Warning zone
Before you dive in, a little heads-up: this notebook is a showcase of my Python coding skills and a walkthrough of the analysis process. So, yeah, it might be a bit lengthy, but that's par for the course when it comes to diving into a thorough analysis.
Having spent numerous years in the realm of business intelligence and data analysis, I've come to realize the importance of communicating our work to various audiences. While this notebook keeps things light-hearted and non-techy for a fun read, I've also provided some additional links here for those who want to explore this work from different angles:
CLICK ON THIS LINK) if you just want to see how I would present this work to a operational marketing team, the team that would use this data to perform marketing strategies
CLICK ON THIS LINK if you would like to see the presentation I would make if a Marketing Manager or another higher leve stakeholder had asked me for some insights about the study.
Want to connect?
Introduction and Considerations
Welcome to this analysis where we explore the world of market basket analysisβan invaluable technique for uncovering shopping patterns.
The magic of market basket analysis lies in spotting connections between products that often end up in the same shopping cart. These connections help stores smartly arrange related items, encouraging customers to buy more or (why not) buy smarter. By using the Apriori algorithm and a set of appropriate libraries and techniques, we'll uncover significant item combinations and provide insights on how often these combos appear and how confident we are in their relationships.
The dataset we are about to explore comes from an actual E-commerce grocery store and contains records of what customers buy during their online shopping trips, it was downloaded from Kaggle and you can access it clicking here.
The recipe we are going to follow
As a Data Analyst, my analytical approach is guided by key questions that drive my investigation. In the context of this analysis, I will focus on addressing three pivotal inquiries to extract valuable insights:
π Identifying High-Performing Pairs: My first question centers on uncovering pairs of products that exhibit both elevated support and confidence metrics. By doing so, I aim to pinpoint product combinations characterized by substantial sales volume and a strong likelihood of being purchased together. These associations highlight potential opportunities for targeted promotions or strategic bundling to further capitalize on customer preferences.
π Optimizing Low-Selling Products: My second focus revolves around products with low support but high confidence when combined with their corresponding partners. Here, I seek out instances where specific items, despite lower individual sales, exhibit promising potential when sold as part of a combination. By identifying these products, I can suggest strategies to boost their sales by crafting attractive product bundles, leveraging the strength of their association with another item.
π Elevating Average Purchase Value: Lastly, my analysis centers on determining pairs of products that present the greatest potential for increasing the average transaction value. By concentrating on these pairs, I aim to identify combinations that encourage customers to add complementary items to their purchase. This approach aligns with the objective of enhancing the overall value of each transaction, ultimately contributing to higher revenue.
Through these guiding questions, I seek to extract actionable insights that can inform strategic decisions aimed at maximizing sales, optimizing product offerings, and enhancing the shopping experience for customers. I will be approaching the data using Apriori Algorithn and Association Rules, I will also be aproaching the data by exploring tables and scatter plots.
Hold onto your shopping carts, because we're about to unravel the mysteries of grocery shopping! Get ready to peel back the layers of shopping patterns, mix and match like a pro, and cook up strategies for a cart full of delight. Let's turn those shopping lists into epic tales of grocery adventures!ππ΅οΈββοΈποΈ
E-comm's Flavorful Beginnings: Spice Up Your Insights
-
In the analysis, we can observe two purchasing patterns that provide insights into customer behavior. The first pattern points to specific products commonly purchased together, suggesting preferences in certain situations. The second pattern highlights products often bought together, but without a major impact on specific outcomes. This distinction offers a glimpse into the complex world of customer preferences and how various product associations shape their purchasing decisions.
-
The analysis reveals an interesting relationship between 'preserved dips spreads' (antecedent) and 'chips pretzels' (consequent). Despite the relatively low occurrence of 'preserved dips spreads,' their presence significantly increases the likelihood of 'chips pretzels' being purchased, as indicated by the lift value of 2.3. This connection suggests a targeted behavior where customers who choose 'preserved dips spreads' often complement their choice with 'chips pretzels.
-
An interesting discovery regarding expected patterns was made, specifically the relationship between "dry pasta" and "pasta sauce." What's intriguing here is that although the frequency of both items appearing together isn't extremely high, a noteworthy lift value of 4 indicates a robust and significant connection between these two products. This essentially means that the likelihood of customers purchasing both "dry pasta" and "pasta sauce" together is much higher than if they were chosen independently, highlighting a strong association between these items that might not be immediately apparent from their individual occurrence rates.
The pasta and the Sauce of our Analysis (code structure overview)
For the pasta:
Pandas: Our trusty sous-chef that'll help us import the CSV file and prep our dataset.
Seaborn: Our artistic flair to add a splash of visualization, making those insights pop!
Recipe Steps:
Importing the Data: Pandas swings into action, importing our CSV file with precision.
Data Transformation: Like a master chef, we'll season our dataset with essential modifications, making it ready for analysis.
Visualization Magic: Seaborn steps up to create visually appetizing plots, making insights easy to savor.
And now, for the secret sauce:
mlxtend.preprocessing.TransactionEncoder: Like a skillful blender, it'll whip up our data into Apriori-ready goodness using one-hot encoding.
mlxtend.frequentpatterns Apriori & Association Rules: Imagine a taste test for your data β these algorithms bring out the true flavors! They'll unveil hidden patterns, just like discovering the perfect blend of ingredients in a recipe.
So, grab your apron, because we're about to cook up data-driven insights that are tastier than ever! π³ππ½οΈ
Get set to cook
import pandas as pd
import seaborn as sns
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rulesTake a peek at the ingredients before you start cooking
When working with data, it is common to have a large number of entries. This is why we are using Python instead of opening the CSV file in Excel. The dataset we are working with is not even comparable to the ones we can see in real-world situations. I have worked with tables that had over 11TB of data.
Just like when you are cooking a large meal, it is helpful to have a recipe to follow. The recipe tells you what ingredients you need and how to combine them. The dataset is like the recipe, and the data entries are like the ingredients.
The large number of entries makes it difficult to see what is inside the dataset. However, we don't need to see all the lines to get a grasp of it. We can simply print a few lines using the .head() function. This is like taking a look at the ingredients in the recipe before you start cooking. It should be enough to understand the columns that we can work with.
Once we have a good understanding of the dataset, we can start cooking, or in this case, analyzing the data.
#load table
ecom_data = pd.read_csv("ECommerce_consumer behaviour.csv")
#make sure all the columns are visible
pd.set_option('display.max_columns', None)
#print the header to have a clear vision
print(ecom_data.head(2))Make sure you have all you need in your data kitchen
It is a good practice to inspect the missing values and data types in your dataset. The dataset we imported from Kaggle is expected to be clean, but it is always a good idea to double-check for any potential issues. After all, you don't want to find salt in the pepper jar in the middle of cooking!
# Check for missing values
missing_values = ecom_data.isnull().sum()
print("\nMissing Values:\n", missing_values)
# Display data types of each column
data_types = ecom_data.dtypes
print("\nData Types:\n", data_types)
Dice, Slice, and Chop: Get Your Ingredients Ready to Cook
Just like in the kitchen, where we need to prepare some of the ingredients before we can start cooking, in market basket analysis, we also need to prepare some of the data before we can start analyzing it.
Specifically, we need to make sure that the entries are in the same line. This is because we will be creating a matrix of the transactions, and each line in the matrix will represent one transaction. If the entries are not in the same line, then the matrix will be difficult to create and analyze.
30% of a Data Analyst job is to find workarounds (source: no source)
Before you judge me, let me explain my workaround. During the process of doing this analysis I was facing a lot of memory problems and low performance when trying to do run the Apriori Algorithm. That is why you will see in the code below that I am using the department column instead of products.
What happens is we have large number of products, and this consumes a lot of memory. So, I decided to first perform the analysis based on the departments. This means that I will create a first matrix for the departments. This will allow me (when I get to the products part) to focus on the departments that have the most important rules, and it will also save memory.
Once I have analyzed the data by department, I can then use this information to pre-filter the data and make the product analysis better. This is similar to how, in the kitchen, we might pre-cook some of the ingredients before we start cooking the entire dish. This can save time and make the cooking process more efficient.
# Keep only the desired columns
ecom_data_cleaned = ecom_data[['order_id', 'department']].copy()
# Change the 'order_id' to string using .loc accessor
ecom_data_cleaned.loc[:, 'order_id'] = ecom_data_cleaned['order_id'].astype(str)
# Rename the 'product_name' column to 'transaction'
ecom_data_cleaned.rename(columns={'department': 'transaction'}, inplace=True)
# Group by 'order_id' and aggregate unique 'transaction' entries
grouped_data = ecom_data_cleaned.groupby('order_id')['transaction'].agg(lambda x: ','.join(pd.unique(x))).reset_index()
# Now grouped_data contains the desired result
print(grouped_data)
Let's uncover the secret ingredients that make our customers' carts tick
As you can see, even if we chose to first apply the Apriori algorithm in the departments instead of products, we could not use lower min_support or higher max_len values because of processing issues. Even though we used a min_support of 0.009 (I would have liked to use 0.001) and max_len of 4, we could still find 1538 frequent item sets, that is a lot!
Remember the matrix I told you about? Here we used one-hot encoding to transform our previous dataset into a matrix format. This is the easiest way of doing it and the best technique to use with Apriori.
One-hot encoding is like putting each item in its own box. This makes it easy for the algorithm to find patterns in the data. It is also the most efficient way to represent categorical data, such as departments and products.
For example, if we have a dataset of products, and each product can belong to one or more departments, we can create a one-hot encoded matrix where each row represents a transaction and each column represents a department. The value in each cell will be 1 if the product in the transaction belongs to the department in the column, and 0 otherwise.
This allows the Apriori algorithm to find patterns in the data, such as which products are often purchased together.
Just like in the kitchen, where we use different ingredients to cook different dishes, we use different techniques to analyze different types of data. One-hot encoding is the best technique to use with Apriori because it is the most efficient way to represent categorical data.
# Split transaction strings into lists
transactions = grouped_data['transaction'].apply(lambda t: t.split(','))
# Convert the DataFrame column to a list of lists
transactions = list(transactions)
# Initialize the TransactionEncoder
encoder = TransactionEncoder()
purchases = encoder.fit_transform(transactions)
# Convert one-hot encoded data to a DataFrame
purchases_df = pd.DataFrame(purchases, columns=encoder.columns_)
#apply the apriori algorithm
frequent_itemsets = apriori(purchases_df, min_support=0.009, max_len=4, use_colnames=True)
print(len(frequent_itemsets))
Filtering ingredients to find the most delicious ones
Imagine having a pantry stocked with a whopping 1538 ingredients that could potentially come together to create delightful culinary masterpieces. However, let's be real, not all these ingredients are meant to dance in harmony; we need some method to uncover the true culinary companions. Think of it as searching for the perfect recipe online β except here, we're delving into the world of market basket analysis.
So, here's where our trusty algorithm, known as association_rules, strides in. Just like how you'd browse the web to find which ingredients complement each other in a recipe, association_rules takes the frequent itemsets conjured by the apriori algorithm and presents us with pairs that deserve to be on each other's shopping lists. It's like matchmaking for groceries!
This remarkable algorithm doesn't stop at the E-commerce. Think about Netflix suggesting movies based on your viewing history β that's association_rules at play, making movie recommendations.
And in our case,it's not just about carts or films; it's about deciphering patterns in our e-commerce grocery data. We're uncovering pairs of products that have a strong likelihood of being picked up together from the virtual aisles of our online store.
So, whether it's popcorn and a cozy blanket for movie night or spaghetti and tomato sauce for a comforting dinner, association_rules is the behind-the-scenes wizard making it all happen.
To start this journey I decided for a scatter plot. A scatter plot is the ideal visualization to see the whole picture of our data specially when we have a large ammount of rules which is the case here.
β
β