hw2

tags data

2023 Educational Data Mining and Applications HW2.pdf

6.3

a

  1. Support(L) ≥ minimum support threshold.
  2. Since S is a subset of L, it means that any transaction containing all items in S also contains all items in L.
  3. Support(S) ≥ Support(L) (because S includes all transactions that L includes)
  4. Therefore, Support(S) ≥ minimum support threshold.
  5. This proves that S is also a frequent itemset.

b

  1. S’ ⊆ S
  2. {T ∈ D | S ⊆ T} = Support(S) , all possible set T that is the superset of S
  3. {T ∈ D | S ⊆ T} ⊆ {T ∈ D | S’ ⊆ T}
  4. Support(S) ≤ Support(S’)

c

  1. Confidence(X => Y) = Support(X ∪ Y) / Support(X).
  2. Confidence(s => {l-s}) = Support(s ∪ {l-s}) / Support({l-s}).
  3. Confidence(s’ => {l-s’}) = Support(s’ ∪ {l-s’}) / Support({l-s’}).
  4. Support(s’ ∪ (l - s’)) ≤ Support(s ∪ (l - s))
  5. Support(s’) ≤ Support(s)
  6. Confidence(s’ => (l - s’)) ≤ Confidence(s => (l - s))

d

  1. If I is frequent in D, it must have a support count greater than or equal to the minimum support count threshold (minsup) for frequent itemsets in D.
  2. If I is not frequent in any partition Pi, it must have a support count less than the minsup in each partition.
  3. Any itemset that is frequent in the original database D must be frequent in at least one partition of D.

6.6

min_support=0.6
min_confi=0.8

IDItems
T100{M, O, N, K, E, Y}
T200{D, O, N, K, E, Y}
T300{M, A, K, E}
T400{M, U, C, K, Y}
T500{C, O, O, K, I, E}

a

Apriori algorithm

1-Itemsets2-Itemsets3-Itemsets
ItemSupport
K1.0
E0.8
Y0.6
M0.6
O0.6
C0.4
N0.4
D0.1
A0.1
U0.1
I0.1
ItemSupport
{K,E}0.8
{K,Y}0.6
{K,M}0.6
{K,O}0.6
{E,Y}0.4
{E,M}0.4
{E,O}0.6
{Y,M}0.4
{Y,O}0.2
{M,O}0.2
ItemSupport
{K,E,Y}0.4
{K,E,M}0.4
{K,E,O}0.6
{K,Y,M}0.4
{K,Y,O}0.4
{K,M,O}0.2
{K,M,E,O}0.2
frequent itemsets:{{E},{K},{M},{O},{Y},{EK},{EO},{KM},{KO},{KY},{EKO}}\text{frequent itemsets}: \{\{E\},\{ K\},\{ M\},\{ O\},\{ Y\},\{ EK\},\{ EO\}, \{KM\}, \{KO\}, \{KY\}, \{EKO\}\}

FP-growth algorithm

itemsetsconditionsupport \boldsymbol{\geq} 0.6 itemsetsfrequent itemsets
e{k:4}{k:4}{E,K}\{E,K\}
m{e,k:2},{k:1}{k:3}{M,K}\{M,K\}
o{k,e,m:1},{k,e:2}{k,3}{e:3}{O,K},{O,E},{O,E,K}\{O,K\},\{O,E\},\{O,E,K\}
y{k,e,m:1},{k,e,o:1},{k,m:1}{k:3}{Y,K}\{Y,K\}
frequent itemsets:{{E:5},{K:4},{M:3},{O:3},{Y:3},{E,K:4},{E,O:3},{K,M:3},{K,O:3},{K,Y:3},{E,K,O:3}}\begin{aligned}\\ \text{frequent itemsets}: &\{\{E:5\},\{ K:4\},\{ M:3\},\{ O:3\},\{ Y:3\},\\ &\{ E,K:4\},\{ E,O:3\}, \{K,M:3\}, \{K,O:3\}, \{K,Y:3\}, \\ &\{E,K,O:3\}\} \end{aligned}\\

conclusion

By query 2 time to build FP-tree reduce the time to query database .So FP-growth is more efficient compared to a priori.

b

frequent itemsetssupportConfidence
{K,O}{E}\{K,O \}\rightarrow \{E \}0.61.0
{E,O}{K}\{E,O \}\rightarrow \{K \}0.61.0

6.14

hotdogs!(hot dogs)total
hamburgers20005002500
!(hamburgers)100015002500
Total300020005000

a

support(hot dogs , hamburgers)=P(hot dogshamburgers)=0.4\text{support(hot dogs , hamburgers)}=P(\text{hot dogs}\cap \text{hamburgers} )=0.4 confidence(hot dogshamburgers)=P(hot dogshamburgers)P(hamburgers)=0.8\text{confidence(hot dogs} \rightarrow \text{hamburgers)}=\frac{P(\text{hot dogs}\cap \text{hamburgers} )}{P(\text{hamburgers})}=0.8 0.4>0.25 and 0.8>0.5 so it is a strong rule0.4>0.25 \text{ and } 0.8 >0.5 \text{ so it is a strong rule}

b

lift(hot dogshamburgers)=200050003000500025005000=4343>1 so positively correlatedlift(\text{hot dogs}\rightarrow\text{hamburgers})=\frac{\frac{2000}{5000}}{\frac{3000}{5000}\frac{2500}{5000}}=\frac{4}{3}\\ \frac{4}{3}>1 \text{ so positively correlated}

c

AllConf(a,b)=P(a&b)max(P(a),P(b))=0.666MaxConf(a,b)=max(P(ab),P(ba))=0.8Kulc(a,b)=12(P(ba)+P(ab))=0.7333Cosine(a,b)=aba×b=20002500×3000=0.730Lift(a,b)=P(a&b)P(a)P(b)=1.333 \begin{aligned}\\ AllConf(a,b)&=\frac{P(a \& b)}{\max(P(a),P(b))}&=0.666\\ MaxConf(a,b)&=\max(P(a|b),P(b|a))&=0.8&\\ Kulc(a,b)&=\frac{1}{2}(P(b|a)+P(a|b))&=0.7333&\\ Cosine(a,b)&=\frac{a \cdot b}{|a|\times |b|}=\frac{2000}{\sqrt{2500\times 3000}}&=0.730&\\ Lift(a,b)&=\frac{P(a\& b)}{P(a)P(b)}&=1.333&\\ \end{aligned}\\