This was for a DataCamp competition.
%%capture
!pip install numpy==1.19.2 convertdate lunarcalendar holidays pystan==2.19.1.1 --no-cache-dir
!pip install numpy==1.19.2 six==1.15.0 dill==0.2.7.1 jinja2==2.10 jupytext pyyaml==5.4 typing-extensions~=3.7.4 \
umap-learn prophet --no-cache-dirimport os
from IPython.display import clear_output
import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from prophet import Prophet
import seaborn as sns
from sklearn.cluster import AgglomerativeClustering
import umap
plt.ioff();1 hidden cell
Regions for additional wine promotions based upon success in Saint Petersburg
📖 Background
Your company owns a chain of stores across Russia that sell a variety of alcoholic drinks. The company recently ran a wine promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn’t possible to run the promotion in all regions. The marketing team would like to target 10 other regions that have similar buying habits to Saint Petersburg where they would expect the promotion to be similarly successful.
Data
The marketing team has sourced you with historical sales volumes per capita for several different drinks types.
- "year" - year (1998-2016)
- "region" - name of a federal subject of Russia. It could be oblast, republic, krai, autonomous okrug, federal city and a single autonomous oblast
- "wine" - sale of wine in litres by year per capita
- "beer" - sale of beer in litres by year per capita
- "vodka" - sale of vodka in litres by year per capita
- "champagne" - sale of champagne in litres by year per capita
- "brandy" - sale of brandy in litres by year per capita
df = pd.read_csv(r'./data/russian_alcohol_consumption.csv')
df.head()Region types
The region type could be determined from the region column to create a categorical variable. However, there has been only one promotion so far in the city of Saint Petersburg. Therefore, there is currently no basis for determining whether the promotion may be more or less effective in a city compared to other region types. More data would be required for the region type to be useful.
Geographical location of regions
Targeting regions geographically closer to Saint Petersburg might be more likely to lead to successful promotions since demographics may be similar and the geographic locations could be easily obtained. However, as with region type, there is currently no data which would support that.
Alcohol Sales Volumes
The sales volume data will be used to find regions with similar drinking preferences to Saint Petersburg. To reduce the effect of changing total alcohol sales over time, the data should be converted to fractions of alcohol sales from wine, beer, vodka, etc. where volumes used should be only the alcohol volume since different drinks have much different alcohol contents. For example, an increase in brandy volume may lead to a much larger decrease in beer volume while the total alcohol volume remains constant. This requires assuming a typical alcohol content for each type of drink. Although this normalization removes the effect of total alcohol volume changing over time, the fractions could still change over time indicating changing preferences. Changes in preferences over time will be investigated early in the analysis. Once regions with similar drinking preferences to Saint Petersburg are found, regions with higher per capita alcohol sales should be favored since the promotion will probably be less effective where per capita sales are lower.
Analysis
Remove rows with missing values
df.info()df.dropna(inplace=True)Obtain alcohol volumes and fractions of total alcohol volume
Alcohol contents (alcohol_fractions) below are assumed values. The values for wine and champagne were assumed to be equal; wine was assumed to include only unfortified types. If fortified wines were included, then the percentage for wine would be increased somewhat. It would be helpful if additional data could be obtained which might improve the values for alcohol_fractions. The effect on the results when varying the alcohol_fractions will be explored later.