eh# Where should a drinks company run promotions?
📖 Background
Your company owns a chain of stores across Russia that sell a variety of alcoholic drinks. The company recently ran a wine promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn’t possible to run the promotion in all regions. The marketing team would like to target 10 other regions that have similar buying habits to Saint Petersburg where they would expect the promotion to be similarly successful.
The data
The marketing team has sourced you with historical sales volumes per capita for several different drinks types.
- "year" - year (1998-2016)
- "region" - name of a federal subject of Russia. It could be oblast, republic, krai, autonomous okrug, federal city and a single autonomous oblast
- "wine" - sale of wine in litres by year per capita
- "beer" - sale of beer in litres by year per capita
- "vodka" - sale of vodka in litres by year per capita
- "champagne" - sale of champagne in litres by year per capita
- "brandy" - sale of brandy in litres by year per capita
library(tidyverse)
library(skimr)
data <- readr::read_csv('./data/russian_alcohol_consumption.csv')
skim(data)The data frame contains data of 5 kinds of brewery annual usage in litre per ca-pita for 85 different regions in Russian federation. We are planning to find the regions that have the most similarity with Saint Petersburg which is the capital of Leningrad region. Here are the line plot of alcoholic consumption of people of "Leningrad Oblast" during the time. Plus, as the medians, are not so much far away from means of the variables, we continue with means but we can do the same with meadians too.
library(ggplot2)
library(dplyr)
data %>%
count(region)
data %>%
filter(region=="Leningrad Oblast") %>%
ggplot(aes(year,wine))+
geom_line("#5CB85C")
data %>%
filter(region=="Leningrad Oblast") %>%
ggplot(aes(year,vodka))+
geom_line("#EEA236")
data %>%
filter(region=="Leningrad Oblast") %>%
ggplot(aes(year,beer))+
geom_line(color="#D43F3A")
data %>%
filter(region=="Leningrad Oblast") %>%
ggplot(aes(year,champagne))+
geom_line("#46B8DA")
data %>%
filter(region=="Leningrad Oblast") %>%
ggplot(aes(year,brandy))+
geom_line("#9632B8")
data %>%
group_by(region) %>%
summarize(m_wine=mean(wine, na.rm=TRUE), m_wine1=median(wine, na.rm=TRUE), m_beer=mean(beer, na.rm=TRUE), m_beer1=median(beer, na.rm=TRUE), m_vodka=mean(vodka, na.rm=TRUE), m_vodka1=median(vodka, na.rm=TRUE), m_champagne=mean(champagne, na.rm=TRUE), m_champagne1=median(champagne, na.rm=TRUE), m_brandy=mean(brandy, na.rm=TRUE), m_brandy1=median(brandy, na.rm=TRUE)) %>%
filter(region== "Leningrad Oblast")For the first step, we need to pre-process the data since there are missing values in some observations. As we do not have data for most of the observations in "Chechen Republic", we exclude this region from our data set. Plus, to have an estimation of usage habits of people in different regions, we computed the mean of usage within different years for each alcoholic drink.
first_analysis <- data %>%
group_by(region) %>%
summarize(m_wine=mean(wine, na.rm=TRUE), m_beer=mean(beer, na.rm=TRUE), m_vodka=mean(vodka, na.rm=TRUE), m_champagne=mean(champagne, na.rm=TRUE), m_brandy=mean(brandy, na.rm=TRUE)) %>%
filter(region != "Chechen Republic")
To begin our cluster analysis, we exclude the regions' names from the data frame as the names do not have any effect on defining our clusters. Plus, since the data are gathered in the same unit, litre, we do not need to scale the data to have a correct distance measurement.
scaled_first_analysis <- first_analysis %>%
select(-region)
To perform cluster analysis, we try two different unsupervised algorithms. For the first algorithm, Kmeans, we need to specify the number of cluster which captures the correct variability of our data and give us valid outcomes. To achieve this result, we create a scree plot in which we can see that there is an elbow on the number of 3.
library(purrr)
tot_withinss <- map_dbl(1:15, function(k){
model <- kmeans(x = scaled_first_analysis, centers = k, nstart=25)
model$tot.withinss
})
elbow_df1 <- data.frame(
k = 1:15 ,
tot_withinss = tot_withinss
)
ggplot(elbow_df1, aes(x = k, y = tot_withinss)) +
geom_line() +
scale_x_continuous(breaks = 1:15)
In the other analysis, we perform "Silhouette analysis" on the data in which the diagram shows that number of clusters, 2 and 3 would result in the best outcomes.
library(cluster)
sil_width <- map_dbl(2:15, function(k){
model <- pam(x = scaled_first_analysis, k = k)
model$silinfo$avg.width
})
sil_df <- data.frame(
k = 2:15,
sil_width = sil_width)
ggplot(sil_df, aes(x = k, y = sil_width)) +
geom_line() +
scale_x_continuous(breaks = 2:15)
Following the Scree plot and Silhouette analysis, we use 3 clusters to classify our observations. This number of clusters result in valid outcomes as repeating the algorithm on a different seed would result in almost the same number of group members in our clusters.
set.seed(100)
clust1_kmean <- kmeans(scaled_first_analysis, centers=3, nstart=25)
clust1_kmean
set.seed(320)
clust1_kmean2 <- kmeans(scaled_first_analysis, centers=3, nstart=25)
clust1_kmean2
cluster_outcome <- mutate(first_analysis,cluster_kmeans=clust1_kmean$cluster)
ggplot(cluster_outcome, aes(m_wine, m_beer, color=as.factor(cluster_kmeans)))+
geom_point()
ggplot(cluster_outcome, aes(m_wine, m_champagne, color=as.factor(cluster_kmeans)))+
geom_point()
ggplot(cluster_outcome, aes(m_wine, m_vodka, color=as.factor(cluster_kmeans)))+
geom_point()
ggplot(cluster_outcome, aes(m_wine, m_brandy, color=as.factor(cluster_kmeans)))+
geom_point()
cluster_outcome %>%
group_by(as.factor(cluster_kmeans)) %>%
summarise(count=n())
The second method that we followed is hierarchical clustering in which we calculate the distance by "euclidean" method between our observations. The hierarchical clustering can be performed by three methods which we chose to continue with complete method to have a more balanced clusters.