Market Basket Analysis in Python
Isaiah Hull
Visiting Associate Professor of Finance, BI Norwegian Business School
Cross-Promotion
Aggregation
List of Lists
One-Hot Encoding
Apriori Algorithm
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
itemsets = np.load('itemsets.npy')
print(itemsets)
[['EASTER CRAFT 4 CHICKS'],
['CERAMIC CAKE DESIGN SPOTTED MUG', 'CHARLOTTE BAG APPLES DESIGN'],
['SET 12 COLOUR PENCILS DOLLY GIRL'],
...
['JUMBO BAG RED RETROSPOT', ... 'LIPSTICK PEN FUSCHIA']]
# One-hot encode data
encoder = TransactionEncoder()
onehot = encoder.fit(itemsets).transform(itemsets)
onehot = pd.DataFrame(onehot, columns = encoder.columns_)
# Apply Apriori algorithm and print
frequent_itemsets = apriori(onehot, use_colnames=True, min_support=0.001)
print(frequent_itemsets)
support itemsets
0 0.001504 ( DOLLY GIRL BEAKER)
1 0.002256 ( RED SPOT GIFT BAG LARGE)
...
428 0.001504 (BIRTHDAY CARD, RETRO SPOT, JUMBO BAG RED RETR...
print(len(data.columns))
4201
print(len(frequent_itemsets))
2328
rules = association_rules(frequent_itemsets)
print(rules['consequents'])
0 (DOTCOM POSTAGE)
...
9 (HERB MARKER THYME)
...
234 (JUMBO BAG RED RETROSPOT)
235 (WOODLAND CHARLOTTE BAG)
236 (RED RETROSPOT CHARLOTTE BAG)
237 (STRAWBERRY CHARLOTTE BAG)
238 (CHARLOTTE BAG SUKI DESIGN)
Name: consequents, Length: 239, dtype: object
targeted_rules = rules[rules['consequents'] == {'HERB MARKER THYME'}].copy()
filtered_rules = targeted_rules[(targeted_rules['antecedent support'] > 0.01) &
(targeted_rules['support'] > 0.009) &
(targeted_rules['confidence'] > 0.85) &
(targeted_rules['lift'] > 1.00)]
print(filtered_rules['antecedents'])
9 (HERB MARKER BASIL)
25 (HERB MARKER PARSLEY)
27 (HERB MARKER ROSEMARY)
Name: antecedents, dtype: object
# Load aggregated data
aggregated = pd.read_csv('datasets/online_retail_aggregated.csv')
# Compute frequent itemsets
onehot = encoder.fit(aggregated).transform(aggregated)
data = pd.DataFrame(onehot, columns = encoder.columns_)
frequent_itemsets = apriori(data, use_colnames=True)
# Compute standard metrics
rules = association_rules(frequent_itemsets)
# Compute Zhang's rule
rules['zhang'] = zhangs_rule(rules)
# Print rules that indicate dissociation
print(rules[rules['zhang'] < 0][['antecedents','consequents']])
antecedents consequents
2 (bag) (candle)
3 (candle) (bag)
4 (sign) (bag)
5 (bag) (sign)
Market Basket Analysis in Python