hw1

tags data

2023 Educational Data Mining and Applications HW1.pdf

2.2(e)

Five number summary: min, Q1, median, Q3, max

minQ1medianQ3max
1320253570

2.2(f)

2.8(a)

following 2D data set
x={1.4,1.6}

A1A2
x11.51.7
x22.01.9
x31.61.8
x41.21.5
x51.51.0

Manhattan distance

d(x,y)=i=0nxiyid(x,y)=\sum_{i=0}^n{|x_{i}-y_{i}|}
A1A2distance
x1.41.60
x11.51.70.2
x41.21.50.3
x31.61.80.4
x51.51.00.5
x22.01.90.9

Euclidean distance

d(x,y)=i=0n(xiyi)22d(x,y)=\sqrt[2]{\sum_{i=0}^n{(x_{i}-y_{i})^2}}
A1A2distancerank
x1.41.60
x11.51.70.1411
x22.01.90.6705
x31.61.80.2823
x41.21.50.2232
x51.51.00.5084

supremum distance

d(x,y)=maxi=0nxiyid(x,y)=\max_{i=0}^{n}{|x_i-y_i|}
A1A2distancerank
x1.41.60
x11.51.70.11
x22.01.90.65
x31.61.80.23
x41.21.50.22
x51.51.00.44

cosine similarity

d(x,y)=xyx×yd(x,y)=\frac{x \cdot y}{||x||\times ||y||}
A1A2similarityrank
x1.41.60
x11.51.70.99991
x22.01.90.99573
x31.61.80.99992
x41.21.50.99905
x51.51.00.96534

2.8(b)

normalize(x)=xi=0n(xi)22normalize(x)=\frac{x}{\sqrt[2]{\sum_{i=0}^n{(x_{i})^2}}}
A1A2distancerank
x0.6580.7520
x10.6610.7490.00421
x20.6420.7890.04033
x30.7240.6880.09194
x40.6640.7470.00782
x50.8320.5540.26355

3.3(a)

BinData
Bin 113, 15, 16
Bin 216, 19, 20
Bin 320, 21, 22
Bin 422, 25, 25
Bin 525, 25, 30
Bin 633, 33 ,35
Bin 735, 35, 35
Bin 835, 35, 36
Bin 936, 40, 45
Bin 1046, 52, 70
BinSmoothed Data
Bin 114.67, 14.67, 14.67
Bin 218.33, 18.33, 18.33
Bin 321.00, 21.00, 21.00
Bin 424.00, 24.00, 24.00
Bin 526.67, 26.67, 26.67
Bin 633.67, 33.67, 33.67
Bin 735.00, 35.00, 35.00
Bin 835.33, 35.33, 35.33
Bin 940.33, 40.33, 40.33
Bin 1056.00, 56.00, 56.00

3.3(b)

find the outlier value using the IQR method:

IQR=Q3Q1=3520=15Lower Bound=Q11.5IQR=201.515=2022.5=2.5Upper Bound=Q3+1.5IQR=35+1.515=35+22.5=57.570>Upper Bound70 is the outlier valueIQR=Q3-Q1=35-20=15\\ \begin{aligned}{} \text{Lower Bound} & = Q1 - 1.5 * IQR\\ & = 20 - 1.5 * 15 = 20 - 22.5 = -2.5\\ \end{aligned}\\ \begin{aligned}{} \text{Upper Bound} & = Q3 + 1.5 * IQR\\ & = 35 + 1.5 * 15 = 35 + 22.5 = 57.5\\ \end{aligned}\\ 70>\text{Upper Bound}\\ \text{70 is the outlier value}

3.7(a)

min-max-normalizaion=xminmaxmin=35137013=0.3859\begin{aligned}{} \text{min-max-normalizaion}&=\frac{x-min}{max-min}\\ &=\frac{35-13}{70-13}\\ &=0.3859\\ \end{aligned}

3.7(b)

μ=29.96,σ=12.7z-index=xμσ=3529.9612.7=0.3968\mu=29.96,\sigma=12.7\\ \begin{aligned}{} \text{z-index}&=\frac{x-\mu}{\sigma}\\ &=\frac{35-29.96}{12.7}\\ &=0.3968 \end{aligned}

3.8(b)

agefat
239.5
2326.5
277.8
2717.8
3931.4
4125.9
4727.4
4927.2
5031.2
5234.6
5428.8
5633.4
5730.2
5834.1
5832.9
6041.2
6135.7
r=i(xix^)(yiy^)i(xix^)2i(yiy^)2=1590.629101256.731=0.8329\begin{aligned} r&=\frac{\underset{i}{\sum}(x_{i}-\hat x)(y_{i}-\hat y)}{ \sqrt{\underset{i}{\sum}(x_{i}-\hat x)^2}\sqrt{\underset{i}{\sum}(y_{i}-\hat y)^2}}\\ &=\frac{1590.6}{\sqrt{2910}\sqrt{1256.731}}\\ &=0.8329 \end{aligned} cov(x,y)=1ni((xiE(x))(yiE(y))=99.41\begin{aligned} cov(x,y)&=\frac{1}{n}\sum_{i}((x_i-E(x))(y_i-E(y))\\ &=99.41 \end{aligned}

3.11(a)