Market Basket Analysis in R
Christopher Bruffaerts
Statistician
Association rule mining allows to discover interesting relationships between items in a large transactional database.
This mining task can be divided into two subtasks:
Frequent itemset generation: determine all frequent itemsets of a potentially large database of transactions. An itemset is said to be frequent if it satisfies a minimum support threshold.
Rule generation: from the above frequent itemsets, generate association rules with confidence above a minimum confidence threshold.
The apriori algorithm is a classic and fast mining algorithm belonging to the class of association rule mining algorithms.
The apriori algorithm:
Apriori principle:
TID | Transaction |
---|---|
1 | {A, B, C, D} |
2 | {A, B, D} |
3 | {A, B} |
4 | {B, C, D} |
5 | {B, C} |
6 | {C, D} |
7 | {B, D} |
TID | Transaction |
---|---|
1 | {A, B, C, D} |
2 | {A, B, D} |
3 | {A, B} |
4 | {B, C, D} |
5 | {B, C} |
6 | {C, D} |
7 | {B, D} |
TID | Transaction |
---|---|
1 | {A, B, C, D} |
2 | {A, B, D} |
3 | {A, B} |
4 | {B, C, D} |
5 | {B, C} |
6 | {C, D} |
7 | {B, D} |
Itemset | Count | Support |
---|---|---|
{A} | 3 | 0.42 |
{B} | 6 | 0.85 |
{C} | 4 | 0.57 |
{D} | 5 | 0.71 |
{A,B} | 3 | 0.42 |
{B,C} | 3 | 0.42 |
{B,D} | 4 | 0.57 |
{C,D} | 3 | 0.42 |
After the computationally expensive frequent itemset generation, apriori generates rules:
Trick: pruning of association rule
e.g.: if the rule {B,C,D} $\rightarrow$ {A} has low confidence, all rules containing item A in its consequent can be discarded (such as the rule {B,D} $\rightarrow$ {A, C} or {D} $\rightarrow$ {A,B, C}).
Transactional data
inspect(head(trans,2))
items transactionID
[1] {A,B,C,D} 1
[2] {A,B,D} 2
First call to the apriori function - frequent itemsets
support.all = apriori(trans,
parameter = list(supp = 3/7, target="frequent itemsets"))
Frequent itemsets
inspect(support.all)
items support count
[1] {A} 0.4285714 3
[2] {C} 0.5714286 4
[3] {D} 0.7142857 5
[4] {B} 0.8571429 6
[5] {A,B} 0.4285714 3
[6] {C,D} 0.4285714 3
[7] {B,C} 0.4285714 3
[8] {B,D} 0.5714286 4
Parameter: the mining parameters change the characteristics of the mined itemsets or rules.
Call to the apriori function for rule generation with specific arguments
rules.all = apriori(trans,
parameter = list(supp=3/7, conf=0.6, minlen=2),
control = list(verbose=F)
)
Inspecting the rules
inspect(rules.all)
lhs rhs support confidence lift count
[1] {A} => {B} 0.4285714 1.0000000 1.1666667 3
[2] {C} => {D} 0.4285714 0.7500000 1.0500000 3
[3] {D} => {C} 0.4285714 0.6000000 1.0500000 3
[4] {C} => {B} 0.4285714 0.7500000 0.8750000 3
[5] {D} => {B} 0.5714286 0.8000000 0.9333333 4
[6] {B} => {D} 0.5714286 0.6666667 0.9333333 4
Market Basket Analysis in R