Identifying association rules

Market Basket Analysis in Python

Isaiah Hull

Visiting Associate Professor of Finance, BI Norwegian Business School

Loading and preparing data

import pandas as pd

# Load transactions from pandas.
books = pd.read_csv("datasets/bookstore.csv")

# Split transaction strings into lists.
transactions = books['Transaction'].apply(lambda t: t.split(','))

# Convert DataFrame into list of strings.
transactions = list(transactions)

Exploring the data

print(transactions[:5])

[['language', 'travel', 'humor', 'fiction'],
 ['humor', 'language'],
 ['humor', 'biography', 'cooking'],
 ['cooking', 'language'],
 ['travel']]

Association rules

Association rule
- Contains antecedent and consequent
  - {health} $\rightarrow$ {cooking}
Multi-antecedent rule
- {humor, travel} $\rightarrow$ {language}
Multi-consequent rule
- {biography} $\rightarrow$ {history, language}

Difficulty of selecting rules

Finding useful rules is difficult.
- Set of all possible rules is large.
- Most rules are not useful.
- Must discard most rules.
What if we restrict ourselves to simple rules?
- One antecedent and one consequent.
- Still challenging, even for small dataset.

Generating the rules

fiction
poetry
history
biography
cooking

health
travel
language
humor

Generating the rules

Fiction Rules	Poetry Rules	...	Humor Rules
fiction->poetry	poetry->fiction	...	humor->fiction
fiction->history	poetry->history	...	humor->history
fiction->biography	poetry->biography	...	humor->biography
fiction->cooking	poetry->cooking	...	humor->cooking
...	...	...	...
fiction->humor	poetry->humor	...

Generating rules with itertools

from itertools import permutations

# Extract unique items.
flattened = [item for transaction in transactions for item in transaction]
items = list(set(flattened))

# Compute and print rules.
rules = list(permutations(items, 2))
print(rules)

[('fiction', 'poetry'), 
 ('fiction', 'history'),
 ...
 ('humor', 'travel'), 
 ('humor', 'language')]

Counting the rules

# Print the number of rules
print(len(rules))

The plot shows the total number of rules as a function of the number of unique items.

Looking ahead

# Import the association rules function
from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori

# Compute frequent itemsets using the Apriori algorithm
frequent_itemsets = apriori(onehot, min_support = 0.001, 
                            max_len = 2, use_colnames = True)

# Compute all association rules for frequent_itemsets
rules = association_rules(frequent_itemsets, 
                            metric = "lift", 
                             min_threshold = 1.0)

Let's practice!

Market Basket Analysis in Python