Skip to content
Finding Similar Alcohol Markets in Russia
  • AI Chat
  • Code
  • Report
  • Where should a drinks company run promotions?

    📖 Background

    Your company owns a chain of stores across Russia that sell a variety of alcoholic drinks. The company recently ran a wine promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn’t possible to run the promotion in all regions. The marketing team would like to target 10 other regions that have similar buying habits to Saint Petersburg where they would expect the promotion to be similarly successful.

    The data

    The marketing team has sourced you with historical sales volumes per capita for several different drinks types.

    • "year" - year (1998-2016)
    • "region" - name of a federal subject of Russia. It could be oblast, republic, krai, autonomous okrug, federal city and a single autonomous oblast
    • "wine" - sale of wine in litres by year per capita
    • "beer" - sale of beer in litres by year per capita
    • "vodka" - sale of vodka in litres by year per capita
    • "champagne" - sale of champagne in litres by year per capita
    • "brandy" - sale of brandy in litres by year per capita

    💪 Competition challenge

    1. Recommend 10 additional regions they should select for the promotion.
    2. Tell the story that supports your recommendations.

    Imports

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    sns.set_style('whitegrid')
    
    from sklearn.decomposition import PCA
    from sklearn.preprocessing import StandardScaler
    from sklearn.cluster import KMeans
    
    from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
    # dataset
    alcohol = pd.read_csv("data/russian_alcohol_consumption.csv")
    
    # key variables
    target_region = 'Saint Petersburg'
    alcohol_categories = ['wine', 'beer', 'vodka', 'champagne', 'brandy']
    years = sorted(alcohol['year'].unique())
    print(len(years))

    Preprocessing

    # fill in nans
    alcohol = alcohol.fillna(0)
    
    # create total column
    alcohol['total'] = alcohol[alcohol_categories].sum(axis = 1)
    
    # filter out nulls
    alcohol = alcohol[alcohol['total'] > 0]
    alcohol[target_region] = alcohol['region'].apply(lambda x: 1 if x == target_region else 0)
    alcohol.describe()

    EDA

    What are the largest regions by alcohol category?

    def is_target_region(region):
        return region == target_region
    alcohol_category = 'beer'
    
    df = alcohol.copy()
    
    fig, axs = plt.subplots(7, 3, figsize = (16, 20))
    axs = axs.flatten()
    
    tmp = df.groupby(['year','region'])[alcohol_category].sum().reset_index()
    tmp.columns = ['year', 'region', alcohol_category]
    top_list = 10
    
    i = 0
    for year in years:
        tmp2 = tmp[tmp['year'] == year]
        tmp2 = tmp2.sort_values(by = alcohol_category, ascending = False)
        tmp2 = tmp2.head(top_list)
        tmp2[target_region] = tmp2['region'].apply(is_target_region) 
        sns.barplot(data = tmp2, x = alcohol_category, y = 'region', ax = axs[i], hue = target_region)
        axs[i].set_title(f"{year} Top-10 List {alcohol_category.capitalize()}")
        axs[i].legend().remove()
        i += 1
    plt.tight_layout()
    df = alcohol[alcohol['region'] == target_region]
    df = df.groupby(['year','region'])[alcohol_categories].sum().reset_index()
    df = df.melt(id_vars=['year', 'region'])
    df.columns = ['year', 'region', 'category', 'liters']
    df = df.sort_values(by = 'liters', ascending = False)
    
    fig, axs = plt.subplots(1, 2, figsize = (20, 4))
    sns.lineplot(data = df, x = 'year', y = 'liters', hue = 'category', ax = axs[0], hue_order = alcohol_categories)
    axs[0].set_title(f"{target_region} Alcohol Consumption Over Time")
    
    
    df = alcohol.copy()
    df = alcohol[alcohol['region'] == target_region].reset_index(drop = True)
    
    for cat in alcohol_categories:
        df[f'pct_{cat}'] = df[cat] / df['total']
    
    df = df.drop(alcohol_categories + ['total', target_region], axis = 1).fillna(0)
    
    df = df.melt(id_vars = ['year', 'region'])
    df.columns = ['year', 'region', 'category', 'pct']
    
    plt.figure(figsize=(16, 6))
    axs[1].set_title(f"{target_region} Alcohol consumption as a percent of total")
    sns.lineplot(data = df, x = 'year', y = 'pct', hue = 'category', ax = axs[1])
    
    plt.tight_layout()

    Feature Engineering