Duplicate of Competition - drinks promotions

Where should a drinks company run promotions?

📖 Background

Your company owns a chain of stores across Russia that sell a variety of alcoholic drinks. The company recently ran a wine promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn’t possible to run the promotion in all regions. The marketing team would like to target 10 other regions that have similar buying habits to Saint Petersburg where they would expect the promotion to be similarly successful.

Content

The Dataset
Analysis Plan
Exploratory Data Analysis
Feature Selection and Engineering
Clustering Implementation
Final Recommendation

1. The Dataset

The marketing team has sourced you with historical sales volumes per capita for several different drinks types.

1.1 Key Variables

"year" - year (1998-2016)
"region" - name of a federal subject of Russia. It could be oblast, republic, krai, autonomous okrug, federal city and a single autonomous oblast
"wine" - sale of wine in litres by year per capita
"beer" - sale of beer in litres by year per capita
"vodka" - sale of vodka in litres by year per capita
"champagne" - sale of champagne in litres by year per capita
"brandy" - sale of brandy in litres by year per capita

from matplotlib import pyplot as plt
import seaborn as sns

import pandas as pd
df = pd.read_csv(r'./data/russian_alcohol_consumption.csv')
df.head(3)

1.2 Handling missing values

Missing values identification

# Define segments
segments = ['wine', 'beer', 'vodka', 'champagne', 'brandy']
segments_shr = [s + '_shr' for s in segments]

# Check missing values
nan_count = df[df[segments].isnull().any(axis=1)]
nan_count['na_count'] = nan_count[segments].isnull().sum(axis=1)
nan_pivot = nan_count.pivot_table(index='region', columns='year', values='na_count', aggfunc='sum').fillna(0)
nan_pivot

Handling missing values

Based on the counts above, it would make sense to completely drop regions with most of the data missing - Chechen Republic, Republic of Crimea, Republic of Ingushetia and Sevastopol

# Drop regions with mostly missing data
print(df.shape)
df = df[~df.region.isin(nan_pivot.index.to_list())]
print(df.shape)

2. Analysis plan

This project is about identifying patterns in unlabelled dataset which makes it unsupervised machine learning and clustering appears to be the most reasonable choice of the model. Since the size of the dataset is relatively small and we only really care about identifying regions similar to Saint Patersburg, hierarchical clustering appears to be the method of choice. Once we complete clustering, we will need to rank the regions from the same cluster as Saint Petersburg to select top 10. To be able to do to the ranking successfully, we need to have a good understanding of what makes Saint Petersburg different from other regions. This will allow us to to prioritize the regions in the optimal way.

Here's the plan of analysis to implement this approach:

Exploratory Data Analyis (EDA) to visualize average trends across all regions and trends in Saint Petersburg region to highlight the difference
Selecting and engineering the clustering variables (features) based on the resutls of EDA
Implementing clustering and tuning the parameters to get to a tight group of regions
Exploring reduction of the features (variables) to eliminate noise using Principal Component Analysis (PCA)
Ranking the regions from the same cluster as Saint Petersburg for the final recommendations

3. Exploratory Data Analysis

Data Preparation Code

# Add total alc consumption
df['total_alc'] = df.apply(lambda x: x.wine + x.beer + x.vodka + x.champagne + x.brandy, axis=1)

# Add consumption as share of total
for s in segments:
    df[s + '_shr'] = df.apply(lambda x: x[s] / x.total_alc, axis=1)

# Subset Sain Petersburg
spb_df = df[df['region'] == 'Saint Petersburg'].set_index('year')

# Set up seaborn 
sns.set_context('talk')

# Set up charting function
def spb_charts():
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (14, 5), tight_layout=True)
    sns.lineplot(data=spb_df['total_alc'], ax=ax1)
    ax1.set_title('Total alcohol comsumption in St Petersburg', fontsize=20)
    ax1.set(
        xlabel = 'year', 
        ylabel = 'l per capita',
        xlim = (1995, 2018)
    )
    ax1.legend('')
    
    sns.lineplot(data=spb_df[segments_shr], ax=ax2)
    ax2.set_title('Alcohol segments, St Petersburg', fontsize=20)
    ax2.set(
        xlabel = 'year',
        ylabel = 'segment share of total',
        xlim = (1995, 2018)
    )
    ax2.legend(['wine', 'beer', 'vodka', 'champagne', 'brandy'])
    plt.show()

3.1 How did consumer preferences in alcohol consumption changed in St Petersburg over years?

Total alcohol and share of segments trends for Saint Petersburg

spb_charts()

Conclusions

Overall alcohol consumtion peaked in 2010 and steadily declined after dropping back to pre-millenial levels
Beer represents the highest share of all segments and was the driver of peak and decline
Wine share of alcohol consumption has been steadily growing across observed period while other segments were mostly flat.
Now let's compare these trends with national averages.

‌
‌
‌