Market Basket Analysis in R
Christopher Bruffaerts
Statistician
Market basket analysis
Focus on the what, not on the how much;
i.e. what do customers have in their baskets?
Main metrics
A word of caution
The set of extracted rules can be very large!
Do not inspect or display all rules in that case - always use a subset of rules or use the functions head or tail!
Let's go back to the Grocery store
Dataset from arules package
# Loading the arules package
library(arules)
# Loading the Groceries dataset
data(Groceries)
summary(Groceries)
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146
most frequent items:
whole milk other vegetables rolls/buns soda yogurt
2513 1903 1809 1715 1372
(Other)
34055
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 29
18 19 20 21 22 23 24 26 27 28 29 32
14 14 9 11 4 6 1 1 1 1 3 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 4.409 6.000 32.000
includes extended item information - examples:
labels level2 level1
1 frankfurter sausage meat and sausage
2 sausage sausage meat and sausage
3 liver loaf sausage meat and sausage
# Plotting a sample of 200 transactions
image(sample(Groceries, 200))
Most popular items
itemFrequencyPlot(Groceries,type="relative",
topN=10,horiz=TRUE,col='steelblue3')
Least popular items
par(mar=c(2,10,2,2), mfrow=c(1,1))
barplot(sort(table(unlist(LIST(Groceries))))[1:10],
horiz = TRUE,las = 1,col='orange')
Contingency tables
# Contingency table
tbl = crossTable(Groceries)
tbl[1:4,1:4]
frankfurter sausage liver loaf ham
frankfurter 580 99 7 25
sausage 99 924 10 49
liver loaf 7 10 50 3
ham 25 49 3 256
Sorted contingency table
# Sorted contingency table
tbl = crossTable(Groceries, sort = TRUE)
tbl[1:4,1:4]
whole milk other vegetables rolls/buns soda
whole milk 2513 736 557 394
other vegetables 736 1903 419 322
rolls/buns 557 419 1809 377
soda 394 322 377 1715
Contingency tables
# Counts
tbl['whole milk','flour']
[1] 83
# Chi-square test
crossTable(Groceries, measure='chi')['whole milk', 'flour']
[1] 0.003595389
Contingency tables with other metrics
crossTable(Groceries, measure='lift',sort=T)[1:4,1:4]
whole milk other vegetables rolls/buns soda
whole milk NA 1.5136341 1.205032 1.571735
other vegetables 1.5136341 NA 1.197047 0.9703476
rolls/buns 1.2050318 1.1970465 NA 1.1951242
soda 0.8991124 0.9703476 1.195124 NA
MovieLens: Web-based recommender system that recommends movies for its users to watch.
Market Basket Analysis in R