Market Basket Analysis
Goals
Given a dataset of customer transactions, where each transaction is a set of items, Market Basket Analysis (MBA) finds a group of items that are frequently purchased together. It is helpful in identifying supplementary products that are not similar. The outcome of MBA will be a recommendation of the type: "Item A is often purchased together with item B, consider crossselling ..."
Uses
Market Basket Analysis can be used to:
- Build a movie/song recommendation engine
- Build a live recommendation algorithm on an e-commerce store
- Cross-sell or Upsell products in a supermarket
%%capture
!pip install mlxtend
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.frequent_patterns import association_rules, apriori
from mlxtend.preprocessing import TransactionEncoder
from pandas.plotting import parallel_coordinates
Matplotlib is building the font cache; this may take a moment.
1. Load your data
# Upload your data as a CSV file.
df = pd.read_csv('example.csv')
df.head()
Transaction | |
---|---|
0 | History,Bookmark |
1 | History,Bookmark |
2 | Fiction,Bookmark |
3 | Biography,Bookmark |
4 | History,Bookmark |
2. Set parameters
# Set parameters to use for the analysis.
MIN_SUPPORT = 0.001 # Set minimum value to accept for the support metric
MAX_LEN = 3 # Set max transaction length to consider
METRIC = "lift" # Metric for association rule creation
MIN_THRESHOLD = 1 # Threshold for association rule creation
3. Derive Rules
Create a table with antecedents, their consequents and all important metrics
# Get all the transcactions as a list
transcactions = list(df['Transaction'].apply(lambda x: sorted(x.split(','))))
# Instantiate transcation encoder
encoder = TransactionEncoder().fit(transcactions)
onehot = encoder.transform(transcactions)
# Convert one-hot encode data to DataFrame
onehot = pd.DataFrame(onehot, columns=encoder.columns_)
# Compute frequent items using the Apriori algorithm -
frequent_itemsets = apriori(onehot,
min_support = MIN_SUPPORT,
max_len = MAX_LEN,
use_colnames = True)
rules = association_rules(frequent_itemsets,
metric = METRIC,
min_threshold = MIN_THRESHOLD)
rules.head()
antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
---|---|---|---|---|---|---|---|---|---|
0 | (Bookmark) | (Biography) | 1.000000 | 0.404040 | 0.404040 | 0.404040 | 1.0 | 0.0 | 1.0 |
1 | (Biography) | (Bookmark) | 0.404040 | 1.000000 | 0.404040 | 1.000000 | 1.0 | 0.0 | inf |
2 | (Bookmark) | (Fiction) | 1.000000 | 0.252525 | 0.252525 | 0.252525 | 1.0 | 0.0 | 1.0 |
3 | (Fiction) | (Bookmark) | 0.252525 | 1.000000 | 0.252525 | 1.000000 | 1.0 | 0.0 | inf |
4 | (Bookmark) | (History) | 1.000000 | 0.252525 | 0.252525 | 0.252525 | 1.0 | 0.0 | 1.0 |
4. Visualize as Heatmap
Visually identify the most promising antecedents and consequents to analyze.
# General Strategy:
# 1. Generate the rules
# 2. Convert antecedents and consequents into rules
# 3. Convert rules into matrix format
rules['lhs items'] = rules['antecedents'].apply(lambda x:len(x) )
# Replace frozen sets with strings
rules['antecedents_'] = rules['antecedents'].apply(lambda a: ','.join(list(a)))
rules['consequents_'] = rules['consequents'].apply(lambda a: ','.join(list(a)))
# Transform the DataFrame of rules into a matrix using the lift metric
pivot = rules[rules['lhs items']>0].pivot(index = 'antecedents_',
columns = 'consequents_', values= 'lift')
# Generate a heatmap with annotations on and the colorbar off
sns.heatmap(pivot, annot = True)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()

5. Visualize as Parallel Coordinates Plot
Visualize interdependencies of items through connecting lines.
# Generate frequent itemsets
frequent_itemsets = apriori(onehot, min_support = 0.10, use_colnames = True, max_len = 2)
# Generate association rules
rules = association_rules(frequent_itemsets, metric = 'support', min_threshold = 0.00)
# Function to convert rules to coordinates.
def rules_to_coordinates(rules):
rules['antecedent'] = rules['antecedents'].apply(lambda antecedent: list(antecedent)[0])
rules['consequent'] = rules['consequents'].apply(lambda consequent: list(consequent)[0])
rules['rule'] = rules.index
return rules[['antecedent','consequent','rule']]
# Generate frequent itemsets
frequent_itemsets = apriori(onehot, min_support = 0.01, use_colnames = True, max_len = 2)
# Generate association rules
rules = association_rules(frequent_itemsets, metric = 'lift', min_threshold = 1.00)
# Generate coordinates and print example
coords = rules_to_coordinates(rules)
# Generate parallel coordinates plot
parallel_coordinates(coords, 'rule');

Appendix
The Algorithms:
- Apriori Algorithm
Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction. For example, we can extract information on purchasing behavior like “ If someone buys beer and sausage, then is likely to buy mustard with high probability “ Let’s define the main Associaton Rules:
Support
It calculates how often the product is purchased and is given by the formula:
Confidence
It measures how often items in Y appear in transactions that contain X and is given by the formula.
Lift
It is the value that tells us how likely item Y is bought together with item X. Values greater than one indicate that the items are likely to be purchased together. When lift > 1 then the rule is better at predicting the result than guessing. When lift < 1, the rule is doing worse than informed guessing.
- Our transactions are lists of comma separated items
- We need to get our data in a onehot encoded format.
Market Basket Analysis
Find groups of items that are frequently purchased together.