The simplest metric

Market Basket Analysis in Python

Isaiah Hull

Visiting Associate Professor of Finance, BI Norwegian Business School

Metrics and pruning

A metric is a measure of performance for rules.
- {humor} $\rightarrow$ {poetry}
  - 0.81
- {fiction} $\rightarrow$ {travel}
  - 0.23
Pruning is the use of metrics to discard rules.
- Retain: {humor} $\rightarrow$ {poetry}
- Discard: {fiction} $\rightarrow$ {travel}

The simplest metric

The support metric measures the share of transactions that contain an itemset.

$$\frac{\text{number of transactions with items(s)}}{\text{number of transactions}}$$

$$\frac{\text{number of transactions with milk}}{\text{total transactions}}$$

Support for language

TID	Transaction
0	travel, humor, fiction
1	humor, language
2	humor, biography, cooking
3	cooking, language
4	travel

Support for {language} = 2 / 10 = 0.2

TID	Transaction
5	poetry, health, travel, history
6	humor
7	travel
8	poetry, fiction, humor
9	fiction, biography

Support for {Humor} $\rightarrow$ {Language}

TID	Transaction
0	travel,humor,fiction
1	humor,language
2	humor,biography,cooking
3	cooking,language
4	travel

SUPPORT for {language} $\rightarrow$ {humor} = 0.1

TID	Transaction
5	poetry,health,travel,history
6	humor
7	travel
8	poetry,fiction,humor
9	fiction,biography

Preparing the data

print(transactions)

[['travel', 'humor', 'fiction'],
...
['fiction', 'biography']]

from mlxtend.preprocessing import TransactionEncoder

# Instantiate transaction encoder
encoder = TransactionEncoder().fit(transactions)

Preparing the data

# One-hot encode itemsets by applying fit and transform
onehot = encoder.transform(transactions)

# Convert one-hot encoded data to DataFrame
onehot = pd.DataFrame(onehot, columns = encoder.columns_)
print(onehot)

   biography  cooking  ...  poetry  travel
0  False      False   ...   False    True
...
9  True       False   ...   False    False

Computing support for single items

print(onehot.mean())

biography    0.2
cooking      0.2
fiction      0.3
health       0.1
history      0.1
humor        0.5
language     0.2
poetry       0.2
travel       0.4
dtype: float64

Computing support for multiple items

import numpy as np

# Define itemset that contains fiction and poetry
onehot['fiction+poetry'] = np.logical_and(onehot['fiction'],onehot['poetry'])

print(onehot.mean())

biography         0.2
cooking           0.2
...               ...
travel            0.4
fiction+poetry    0.1
dtype: float64

Let's practice!

Market Basket Analysis in Python