Finding Similar Alcohol Markets in Russia

Where should a drinks company run promotions?

📖 Background

Your company owns a chain of stores across Russia that sell a variety of alcoholic drinks. The company recently ran a wine promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn’t possible to run the promotion in all regions. The marketing team would like to target 10 other regions that have similar buying habits to Saint Petersburg where they would expect the promotion to be similarly successful.

The data

The marketing team has sourced you with historical sales volumes per capita for several different drinks types.

"year" - year (1998-2016)
"region" - name of a federal subject of Russia. It could be oblast, republic, krai, autonomous okrug, federal city and a single autonomous oblast
"wine" - sale of wine in litres by year per capita
"beer" - sale of beer in litres by year per capita
"vodka" - sale of vodka in litres by year per capita
"champagne" - sale of champagne in litres by year per capita
"brandy" - sale of brandy in litres by year per capita

💪 Competition challenge

Recommend 10 additional regions they should select for the promotion.
Tell the story that supports your recommendations.

Imports

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style('whitegrid')

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances

# dataset
alcohol = pd.read_csv("data/russian_alcohol_consumption.csv")

# key variables
target_region = 'Saint Petersburg'
alcohol_categories = ['wine', 'beer', 'vodka', 'champagne', 'brandy']
years = sorted(alcohol['year'].unique())
print(len(years))

Preprocessing

# fill in nans
alcohol = alcohol.fillna(0)

# create total column
alcohol['total'] = alcohol[alcohol_categories].sum(axis = 1)

# filter out nulls
alcohol = alcohol[alcohol['total'] > 0]

alcohol[target_region] = alcohol['region'].apply(lambda x: 1 if x == target_region else 0)

alcohol.describe()

EDA

What are the largest regions by alcohol category?

def is_target_region(region):
    return region == target_region

alcohol_category = 'beer'

df = alcohol.copy()

fig, axs = plt.subplots(7, 3, figsize = (16, 20))
axs = axs.flatten()

tmp = df.groupby(['year','region'])[alcohol_category].sum().reset_index()
tmp.columns = ['year', 'region', alcohol_category]
top_list = 10

i = 0
for year in years:
    tmp2 = tmp[tmp['year'] == year]
    tmp2 = tmp2.sort_values(by = alcohol_category, ascending = False)
    tmp2 = tmp2.head(top_list)
    tmp2[target_region] = tmp2['region'].apply(is_target_region) 
    sns.barplot(data = tmp2, x = alcohol_category, y = 'region', ax = axs[i], hue = target_region)
    axs[i].set_title(f"{year} Top-10 List {alcohol_category.capitalize()}")
    axs[i].legend().remove()
    i += 1
plt.tight_layout()

df = alcohol[alcohol['region'] == target_region]
df = df.groupby(['year','region'])[alcohol_categories].sum().reset_index()
df = df.melt(id_vars=['year', 'region'])
df.columns = ['year', 'region', 'category', 'liters']
df = df.sort_values(by = 'liters', ascending = False)

fig, axs = plt.subplots(1, 2, figsize = (20, 4))
sns.lineplot(data = df, x = 'year', y = 'liters', hue = 'category', ax = axs[0], hue_order = alcohol_categories)
axs[0].set_title(f"{target_region} Alcohol Consumption Over Time")


df = alcohol.copy()
df = alcohol[alcohol['region'] == target_region].reset_index(drop = True)

for cat in alcohol_categories:
    df[f'pct_{cat}'] = df[cat] / df['total']

df = df.drop(alcohol_categories + ['total', target_region], axis = 1).fillna(0)

df = df.melt(id_vars = ['year', 'region'])
df.columns = ['year', 'region', 'category', 'pct']

plt.figure(figsize=(16, 6))
axs[1].set_title(f"{target_region} Alcohol consumption as a percent of total")
sns.lineplot(data = df, x = 'year', y = 'pct', hue = 'category', ax = axs[1])

plt.tight_layout()

Feature Engineering

‌
‌
‌