Cluster Analysis in R
Dmitriy (Dima) Gorenshteyn
Lead Data Scientist, Memorial Sloan Kettering Cancer Center
wine | beer | whiskey | vodka | |
---|---|---|---|---|
1 | TRUE | TRUE | FALSE | FALSE |
2 | FALSE | TRUE | TRUE | TRUE |
... | ... | ... | ... | ... |
$$
J(A,B) = \frac{A \cap B}{A \cup B}
$$
wine | beer | whiskey | vodka | |
---|---|---|---|---|
1 | TRUE | TRUE | FALSE | FALSE |
2 | FALSE | TRUE | TRUE | TRUE |
$$ J(1,2) = \frac{1 \cap 2}{1 \cup 2} = \frac{1}{4} = 0.25$$
$$ Distance(1,2) = 1 - J(1,2) = 0.75$$
print(survey_a)
wine beer whiskey vodka
<lgl> <lgl> <lgl> <lgl>
1 TRUE TRUE FALSE FALSE
2 FALSE TRUE TRUE TRUE
3 TRUE FALSE TRUE FALSE
dist(survey_a, method = "binary")
1 2
2 0.7500000
3 0.6666667 0.7500000
color | sport | |
---|---|---|
1 | red | soccer |
2 | green | hockey |
3 | blue | hockey |
4 | blue | soccer |
colorblue | colorgreen | colorred | sporthockey | sportsoccer | |
---|---|---|---|---|---|
1 | 0 | 0 | 1 | 0 | 1 |
2 | 0 | 1 | 0 | 1 | 0 |
3 | 1 | 0 | 0 | 1 | 0 |
4 | 1 | 0 | 0 | 0 | 1 |
print(survey_b)
color sport
1 red soccer
2 green hockey
3 blue hockey
4 blue soccer
library(dummies)
dummy.data.frame(survey_b)
colorblue colorgreen colorred sporthockey sportsoccer
1 0 0 1 0 1
2 0 1 0 1 0
3 1 0 0 1 0
4 1 0 0 0 1
print(survey_b)
color sport
1 red soccer
2 green hockey
3 blue hockey
4 blue soccer
dummy_survey_b <- dummy.data.frame(survey_b)
dist(dummy_survey_b, method = 'binary')
1 2 3
2 1.0000000
3 1.0000000 0.6666667
4 0.6666667 1.0000000 0.6666667
Cluster Analysis in R