Where should a drinks company run promotions?
📖 Background
Your company owns a chain of stores across Russia that sell a variety of alcoholic drinks. The company recently ran a wine promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn’t possible to run the promotion in all regions. The marketing team would like to target 10 other regions that have similar buying habits to Saint Petersburg where they would expect the promotion to be similarly successful.
Content
- The Dataset
- Analysis Plan
- Exploratory Data Analysis
- Feature Selection and Engineering
- Clustering Implementation
- Final Recommendation
1. The Dataset
The marketing team has sourced you with historical sales volumes per capita for several different drinks types.
1.1 Key Variables
- "year" - year (1998-2016)
- "region" - name of a federal subject of Russia. It could be oblast, republic, krai, autonomous okrug, federal city and a single autonomous oblast
- "wine" - sale of wine in litres by year per capita
- "beer" - sale of beer in litres by year per capita
- "vodka" - sale of vodka in litres by year per capita
- "champagne" - sale of champagne in litres by year per capita
- "brandy" - sale of brandy in litres by year per capita
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.read_csv(r'./data/russian_alcohol_consumption.csv')
df.head(3)
1.2 Handling missing values
Missing values identification
# Define segments
segments = ['wine', 'beer', 'vodka', 'champagne', 'brandy']
segments_shr = [s + '_shr' for s in segments]
# Check missing values
nan_count = df[df[segments].isnull().any(axis=1)]
nan_count['na_count'] = nan_count[segments].isnull().sum(axis=1)
nan_pivot = nan_count.pivot_table(index='region', columns='year', values='na_count', aggfunc='sum').fillna(0)
nan_pivot
Handling missing values
Based on the counts above, it would make sense to completely drop regions with most of the data missing - Chechen Republic, Republic of Crimea, Republic of Ingushetia and Sevastopol
# Drop regions with mostly missing data
print(df.shape)
df = df[~df.region.isin(nan_pivot.index.to_list())]
print(df.shape)
2. Analysis plan
This project is about identifying patterns in unlabelled dataset which makes it unsupervised machine learning and clustering appears to be the most reasonable choice of the model. Since the size of the dataset is relatively small and we only really care about identifying regions similar to Saint Patersburg, hierarchical clustering appears to be the method of choice. Once we complete clustering, we will need to rank the regions from the same cluster as Saint Petersburg to select top 10. To be able to do to the ranking successfully, we need to have a good understanding of what makes Saint Petersburg different from other regions. This will allow us to to prioritize the regions in the optimal way.
Here's the plan of analysis to implement this approach:
- Exploratory Data Analyis (EDA) to visualize average trends across all regions and trends in Saint Petersburg region to highlight the difference
- Selecting and engineering the clustering variables (features) based on the resutls of EDA
- Implementing clustering and tuning the parameters to get to a tight group of regions
- Exploring reduction of the features (variables) to eliminate noise using Principal Component Analysis (PCA)
- Ranking the regions from the same cluster as Saint Petersburg for the final recommendations
3. Exploratory Data Analysis
Data Preparation Code
# Add total alc consumption
df['total_alc'] = df.apply(lambda x: x.wine + x.beer + x.vodka + x.champagne + x.brandy, axis=1)
# Add consumption as share of total
for s in segments:
df[s + '_shr'] = df.apply(lambda x: x[s] / x.total_alc, axis=1)
# Subset Sain Petersburg
spb_df = df[df['region'] == 'Saint Petersburg'].set_index('year')
# Set up seaborn
sns.set_context('talk')
# Set up charting function
def spb_charts():
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (14, 5), tight_layout=True)
sns.lineplot(data=spb_df['total_alc'], ax=ax1)
ax1.set_title('Total alcohol comsumption in St Petersburg', fontsize=20)
ax1.set(
xlabel = 'year',
ylabel = 'l per capita',
xlim = (1995, 2018)
)
ax1.legend('')
sns.lineplot(data=spb_df[segments_shr], ax=ax2)
ax2.set_title('Alcohol segments, St Petersburg', fontsize=20)
ax2.set(
xlabel = 'year',
ylabel = 'segment share of total',
xlim = (1995, 2018)
)
ax2.legend(['wine', 'beer', 'vodka', 'champagne', 'brandy'])
plt.show()
3.1 How did consumer preferences in alcohol consumption changed in St Petersburg over years?
Total alcohol and share of segments trends for Saint Petersburg
spb_charts()
Conclusions
- Overall alcohol consumtion peaked in 2010 and steadily declined after dropping back to pre-millenial levels
- Beer represents the highest share of all segments and was the driver of peak and decline
- Wine share of alcohol consumption has been steadily growing across observed period while other segments were mostly flat.
Now let's compare these trends with national averages.