Back to Templates

Market Basket Analysis

Goals

Given a dataset of customer transactions, where each transaction is a set of items, Market Basket Analysis (MBA) finds a group of items that are frequently purchased together. It is helpful in identifying supplementary products that are not similar. The outcome of MBA will be a recommendation of the type: "Item A is often purchased together with item B, consider crossselling ..."

Uses

Market Basket Analysis can be used to:

  • Build a movie/song recommendation engine
  • Build a live recommendation algorithm on an e-commerce store
  • Cross-sell or Upsell products in a supermarket
%%capture
!pip install mlxtend
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
from mlxtend.frequent_patterns import association_rules, apriori
from mlxtend.preprocessing import TransactionEncoder
from pandas.plotting import parallel_coordinates
Matplotlib is building the font cache; this may take a moment.

1. Load your data

# Upload your data as a CSV file. 
df = pd.read_csv('example.csv')
df.head()
Transaction
0History,Bookmark
1History,Bookmark
2Fiction,Bookmark
3Biography,Bookmark
4History,Bookmark

2. Set parameters

# Set parameters to use for the analysis.
MIN_SUPPORT = 0.001     # Set minimum value to accept for the support metric 
MAX_LEN = 3             # Set max transaction length to consider
METRIC = "lift"         # Metric for association rule creation 
MIN_THRESHOLD = 1       # Threshold for association rule creation

3. Derive Rules

Create a table with antecedents, their consequents and all important metrics

# Get all the transcactions as a list
transcactions = list(df['Transaction'].apply(lambda x: sorted(x.split(','))))

# Instantiate transcation encoder
encoder = TransactionEncoder().fit(transcactions)
onehot = encoder.transform(transcactions)

# Convert one-hot encode data to DataFrame
onehot = pd.DataFrame(onehot, columns=encoder.columns_)

# Compute frequent items using the Apriori algorithm -
frequent_itemsets = apriori(onehot, 
                            min_support = MIN_SUPPORT, 
                            max_len = MAX_LEN, 
                            use_colnames = True)
rules = association_rules(frequent_itemsets, 
                          metric = METRIC, 
                          min_threshold = MIN_THRESHOLD)

rules.head()
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
0(Bookmark)(Biography)1.0000000.4040400.4040400.4040401.00.01.0
1(Biography)(Bookmark)0.4040401.0000000.4040401.0000001.00.0inf
2(Bookmark)(Fiction)1.0000000.2525250.2525250.2525251.00.01.0
3(Fiction)(Bookmark)0.2525251.0000000.2525251.0000001.00.0inf
4(Bookmark)(History)1.0000000.2525250.2525250.2525251.00.01.0

4. Visualize as Heatmap

Visually identify the most promising antecedents and consequents to analyze.

# General Strategy:
# 1. Generate the rules
# 2. Convert antecedents and consequents into rules
# 3. Convert rules into matrix format

rules['lhs items'] = rules['antecedents'].apply(lambda x:len(x) )

# Replace frozen sets with strings
rules['antecedents_'] = rules['antecedents'].apply(lambda a: ','.join(list(a)))
rules['consequents_'] = rules['consequents'].apply(lambda a: ','.join(list(a)))

# Transform the DataFrame of rules into a matrix using the lift metric
pivot = rules[rules['lhs items']>0].pivot(index = 'antecedents_', 
                    columns = 'consequents_', values= 'lift')

# Generate a heatmap with annotations on and the colorbar off
sns.heatmap(pivot, annot = True)
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.show()

5. Visualize as Parallel Coordinates Plot

Visualize interdependencies of items through connecting lines.

# Generate frequent itemsets
frequent_itemsets = apriori(onehot, min_support = 0.10, use_colnames = True, max_len = 2)

# Generate association rules
rules = association_rules(frequent_itemsets, metric = 'support', min_threshold = 0.00)
# Function to convert rules to coordinates.
def rules_to_coordinates(rules):
    rules['antecedent'] = rules['antecedents'].apply(lambda antecedent: list(antecedent)[0])
    rules['consequent'] = rules['consequents'].apply(lambda consequent: list(consequent)[0])
    rules['rule'] = rules.index
    return rules[['antecedent','consequent','rule']]


# Generate frequent itemsets
frequent_itemsets = apriori(onehot, min_support = 0.01, use_colnames = True, max_len = 2)

# Generate association rules
rules = association_rules(frequent_itemsets, metric = 'lift', min_threshold = 1.00)

# Generate coordinates and print example
coords = rules_to_coordinates(rules)

# Generate parallel coordinates plot
parallel_coordinates(coords, 'rule');

Appendix

The Algorithms:

  • Apriori Algorithm

Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other items in the transaction. For example, we can extract information on purchasing behavior like “ If someone buys beer and sausage, then is likely to buy mustard with high probability “ Let’s define the main Associaton Rules:

Support

It calculates how often the product is purchased and is given by the formula:

Confidence

It measures how often items in Y appear in transactions that contain X and is given by the formula.

Lift

It is the value that tells us how likely item Y is bought together with item X. Values greater than one indicate that the items are likely to be purchased together. When lift > 1 then the rule is better at predicting the result than guessing. When lift < 1, the rule is doing worse than informed guessing.

  • Our transactions are lists of comma separated items
  • We need to get our data in a onehot encoded format.
Python

Market Basket Analysis

Find groups of items that are frequently purchased together.

Use Template