Where should a drinks company run promotions?
📖 Background
Your company owns a chain of stores across Russia that sell a variety of alcoholic drinks. The company recently ran a wine promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn’t possible to run the promotion in all regions. The marketing team would like to target 10 other regions that have similar buying habits to Saint Petersburg where they would expect the promotion to be similarly successful.
The data
The marketing team has sourced you with historical sales volumes per capita for several different drinks types.
- "year" - year (1998-2016)
- "region" - name of a federal subject of Russia. It could be oblast, republic, krai, autonomous okrug, federal city and a single autonomous oblast
- "wine" - sale of wine in litres by year per capita
- "beer" - sale of beer in litres by year per capita
- "vodka" - sale of vodka in litres by year per capita
- "champagne" - sale of champagne in litres by year per capita
- "brandy" - sale of brandy in litres by year per capita
library(psych)
library(tidyverse)
library(forcats)
library(skimr)
library(naniar) # manage missing data
library(caret)
data <- readr::read_csv('./data/russian_alcohol_consumption.csv')
EDA
Reviewing dataset, missing data and statistics about variables.
General reviewing
# General overview
skim(data)
# reviewing records
glimpse(data)
** Conclusions**
- All variables have lower than 5% of missing data, therefore, we are going to drop NA records.
- All consumption variables have a positive skew
- compsumption variables have a different scale of amount, therefore, mean of brandy is 0.57 and mean of beer 51.3, almost 100 times, we apply a normalize transformation for consumption.
Reviewing distribution
# dropping missing data
df_data <- drop_na(data)
summary(df_data)
#records dropped
data.frame(dropped_records = nrow(data) - nrow(df_data),
percentage = (nrow(df_data)/nrow(data) - 1) * 100 )
# Normalizing comsumption variables
df_data_norm <- df_data
df_data_norm[ , 3:7] <- scale(df_data[ ,3:7])
# reviewing summary
summary(df_data_norm)
# tidy_data
df_data_tidy <- gather(df_data_norm, key = "drink", value=consumption, -c(year, region))
# scatterplot
pairs(df_data_norm[, 3:7])
# correlations
corPlot(df_data_norm[, 3:7],
diag = FALSE,
upper = F)
# median of cosumption of wine by region
median_wine <- df_data %>% group_by(region) %>% summarize(median_consumption = median(wine)) %>% summarize(quantile(median_consumption, 0.70))
# wine consumption by region 30% upper
df_bar <- df_data %>% group_by(region) %>% summarize(median_consumption = median(wine)) %>%
filter(median_consumption >= as.double( median_wine)) %>%
mutate(type = ifelse( region == "Saint Petersburg", "Saint Petersburg", "Others"))
df_bar %>%
ggplot(aes(x = median_consumption, y = fct_reorder(region, median_consumption), fill = type )) +
geom_col() +
labs( title = "Median consumption of wine by region",
subtitle = "30% upper consumption",
x = "median consumption",
y = "region")
# distribution of consumption for all regions
# normalized data
ggplot(df_data_tidy, aes(x = consumption)) +
geom_histogram(bins = 9) +
facet_wrap(~drink) +
labs(title = "Distribution for all regions",
subtitle = "nomralized data")
# distribution of consumption for Saint Peterburg region
filter(df_data_tidy, region == "Saint Petersburg" ) %>%
ggplot( aes(x = consumption)) +
geom_histogram(bins = 9) +
facet_wrap(~drink) +
labs(title = "Distribution for Saint Petersburg regions",
subtitle = "nomralized data")
# growth wine consumption
df_sp_wine <- df_data %>% filter( region == "Saint Petersburg") %>% select(year, region, wine) %>% arrange(year) %>%
mutate(region = "Saint Petersburg", wine_1998 = wine / wine[1])
df_other_wine <- df_data %>% filter (region != "Saint Peterburg") %>%
mutate(region = "other regions") %>% group_by(year, region)%>%
summarize(wine = sum(wine)) %>% ungroup() %>% arrange(year) %>%
mutate(wine_1998 = wine / wine[1])
bind_rows(df_sp_wine, df_other_wine) %>%
ggplot(aes(x=year, y=wine_1998, color = region)) +
geom_line() +
labs(title = "Wine consumption Saint Petersburg vs All regions",
subtitle = "base year = 1998",
y = "consumption")
Conclusions
- The percentage of dropped observations were 4.09%
- The scatter plot shown a low positive correlations among drinks, only champagne and brandy have a correlation of 0.78
- The consumption of wine for Saint Petersbur is on top 30%, the region with most median consumption in the period was Kraelia.
- The normalize transformation were usefull to compare distribution of drink consumption, therefore, Saint Petersburg shows a pattern different from all regions.
- About the growth of wine consumption, Saint Petersburg have had a growth greater than the other regions 2.6 vs 2.0 on basis of 1998 consumption.
Modeling
We are going to follow:
- We are going to summarize data by region, and add a feature of growth on each drink, it will be calculated as:
growth_drink = drink[2016] / drink[1998]
-
Scale all variables on dataset
-
Calculate distances and select the 10 nearest observations to Saint Petersburg region.
#region
v_region <- unique(df_data$region)
f_growth <- function(region, df_region) {
obs <- nrow(df_region)
if (obs > 0) {
df_cum <- as.data.frame(t(apply(df_region[ ,3:7], 2, sum)))
df_cum <- df_cum %>% mutate(region = region,
g_wine = df_region$wine[obs]/df_region$wine[1] - 1,
g_beer = df_region$beer[obs]/df_region$beer[1] - 1,
g_vodka = df_region$vodka[obs]/df_region$vodka[1] -1,
g_champagne = df_region$champagne[obs]/df_region$champagne[1] - 1,
g_brandy = df_region$brandy[obs]/df_region$brandy[1] -1 )
}
return(df_cum)
}
i <- 0
for (i_region in v_region) {
i <- i+1
df_region <- filter(df_data, region == i_region) %>% arrange(year)
if (i == 1) {
df_result <- f_growth(i_region, df_region)}
else {
df_result <- bind_rows(df_result, f_growth(i_region, df_region) )
}
}
# preparamos matriz de distancias
v_colums <- colnames(df_result)
rownames(df_result) <- df_result$region
df_result_clean <- df_result[ , !v_colums %in% "region"]
# normalized variables
df_result_clean_norm <- scale(df_result_clean)
mat_dist <- dist(df_result_clean_norm)
dist_sp <- sort(as.matrix(mat_dist)[ , "Saint Petersburg"])
df_dist_sp <- data.frame(region = names(dist_sp), distance = dist_sp)
row.names(df_dist_sp) <- NULL
# top 10 nearest to Saint Petersburg
df_dist_sp[2:10, ] %>%
ggplot(aes(x = distance, y = fct_reorder(region, distance, .desc = T))) +
geom_col() +
labs( title = "Top 10 regions similar to Saint Petersburg",
subtitle = "based in distances for normalized drinks and growth",
x = "distance",
y = "region")
Conclusions
- According to summarize the consumption in the period 1998 - 2016 we created variables for growth of drinks in that period.
- All variables consumption and their growths were normalized, this is a key point, therefore we found a hihg skewness on the consumption.
- The work horse was the function f_growth
- The distance was the metric to find the regions looks like the Saint Petersburg region, the top 10 was the 10 regions nearest to Saint Petersburg.
- Other considerations to compare this simple modeling could be:
- Apply Clustering, find the best number of cluster and select among the cluster which belongs Saint Petersburg the top 10 regions.
- Applied time series for each region, such as ARIMA models, then apply clustering on the parameters and select the best 10 regions which belong Saint Peterburg.