Competition - drinks promotions

Where should a drinks company run promotions?

📖 Background

Your company owns a chain of stores across Russia that sell a variety of alcoholic drinks. The company recently ran a wine promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn’t possible to run the promotion in all regions. The marketing team would like to target 10 other regions that have similar buying habits to Saint Petersburg where they would expect the promotion to be similarly successful.

The data

The marketing team has sourced you with historical sales volumes per capita for several different drinks types.

"year" - year (1998-2016)
"region" - name of a federal subject of Russia. It could be oblast, republic, krai, autonomous okrug, federal city and a single autonomous oblast
"wine" - sale of wine in litres by year per capita
"beer" - sale of beer in litres by year per capita
"vodka" - sale of vodka in litres by year per capita
"champagne" - sale of champagne in litres by year per capita
"brandy" - sale of brandy in litres by year per capita

library(psych)
library(tidyverse)
library(forcats)
library(skimr)
library(naniar) # manage missing data
library(caret)

data <- readr::read_csv('./data/russian_alcohol_consumption.csv')

EDA

Reviewing dataset, missing data and statistics about variables.

General reviewing

# General overview
skim(data)

# reviewing records
glimpse(data)

** Conclusions**

All variables have lower than 5% of missing data, therefore, we are going to drop NA records.
All consumption variables have a positive skew
compsumption variables have a different scale of amount, therefore, mean of brandy is 0.57 and mean of beer 51.3, almost 100 times, we apply a normalize transformation for consumption.

Reviewing distribution


# dropping missing data
df_data <- drop_na(data)

summary(df_data)

#records dropped 
data.frame(dropped_records = nrow(data) - nrow(df_data),
           percentage = (nrow(df_data)/nrow(data) - 1) * 100 )

# Normalizing comsumption variables
df_data_norm <- df_data

df_data_norm[ , 3:7] <- scale(df_data[ ,3:7])

# reviewing summary
summary(df_data_norm)

# tidy_data
df_data_tidy <- gather(df_data_norm, key = "drink", value=consumption, -c(year, region))

# scatterplot
pairs(df_data_norm[, 3:7])

# correlations
corPlot(df_data_norm[, 3:7],
        diag = FALSE,
        upper = F)


# median of cosumption of wine by region
median_wine <- df_data %>% group_by(region) %>% summarize(median_consumption = median(wine)) %>%  summarize(quantile(median_consumption, 0.70)) 

# wine consumption by region 30% upper
df_bar <- df_data %>% group_by(region) %>% summarize(median_consumption = median(wine)) %>%
  filter(median_consumption >= as.double( median_wine)) %>%
  mutate(type = ifelse( region == "Saint Petersburg", "Saint Petersburg", "Others"))


df_bar %>%
ggplot(aes(x = median_consumption, y = fct_reorder(region, median_consumption), fill = type )) +
  geom_col() +
  labs( title = "Median consumption of wine by region",
        subtitle = "30% upper consumption",
        x = "median consumption",
        y = "region") 

# distribution of consumption for all regions
# normalized data
ggplot(df_data_tidy, aes(x = consumption)) +
  geom_histogram(bins = 9) + 
  facet_wrap(~drink) +
  labs(title = "Distribution for all regions",
       subtitle = "nomralized data")


# distribution of consumption for Saint Peterburg region
filter(df_data_tidy, region == "Saint Petersburg" ) %>%
ggplot( aes(x = consumption)) +
  geom_histogram(bins = 9) + 
  facet_wrap(~drink) + 
  labs(title = "Distribution for Saint Petersburg regions",
       subtitle = "nomralized data")


# growth wine consumption
df_sp_wine <- df_data %>% filter( region == "Saint Petersburg") %>% select(year, region, wine) %>% arrange(year) %>%
  mutate(region = "Saint Petersburg", wine_1998 = wine / wine[1])

df_other_wine <- df_data %>% filter (region != "Saint Peterburg") %>% 
  mutate(region = "other regions") %>% group_by(year, region)%>% 
  summarize(wine = sum(wine)) %>% ungroup() %>% arrange(year) %>% 
  mutate(wine_1998 = wine / wine[1])

bind_rows(df_sp_wine, df_other_wine) %>%
  ggplot(aes(x=year, y=wine_1998, color = region)) +
  geom_line() +
  labs(title = "Wine consumption Saint Petersburg vs All regions",
       subtitle = "base year = 1998",
       y = "consumption")

Conclusions

The percentage of dropped observations were 4.09%
The scatter plot shown a low positive correlations among drinks, only champagne and brandy have a correlation of 0.78
The consumption of wine for Saint Petersbur is on top 30%, the region with most median consumption in the period was Kraelia.
The normalize transformation were usefull to compare distribution of drink consumption, therefore, Saint Petersburg shows a pattern different from all regions.
About the growth of wine consumption, Saint Petersburg have had a growth greater than the other regions 2.6 vs 2.0 on basis of 1998 consumption.

Modeling

We are going to follow:

We are going to summarize data by region, and add a feature of growth on each drink, it will be calculated as:

growth_drink = drink[2016] / drink[1998]

Scale all variables on dataset
Calculate distances and select the 10 nearest observations to Saint Petersburg region.

#region

v_region <- unique(df_data$region)

f_growth <- function(region, df_region) {
  obs <- nrow(df_region)
  if (obs > 0) {
    df_cum <- as.data.frame(t(apply(df_region[ ,3:7], 2, sum)))
    df_cum <- df_cum %>% mutate(region = region,
                                g_wine = df_region$wine[obs]/df_region$wine[1] - 1,
                                g_beer = df_region$beer[obs]/df_region$beer[1] - 1,
                                g_vodka = df_region$vodka[obs]/df_region$vodka[1] -1,
                                g_champagne = df_region$champagne[obs]/df_region$champagne[1] - 1,
                                g_brandy = df_region$brandy[obs]/df_region$brandy[1] -1 )
    } 
  return(df_cum)
}

i <- 0

for (i_region in v_region) {
  
  i <- i+1
  
  df_region <- filter(df_data, region == i_region) %>% arrange(year)
  
  
  if (i == 1) {
    df_result <- f_growth(i_region, df_region)}
  else {
    df_result <- bind_rows(df_result, f_growth(i_region, df_region) )
  }
  
}


# preparamos matriz de distancias
v_colums <- colnames(df_result)

rownames(df_result) <- df_result$region

df_result_clean <- df_result[ , !v_colums %in% "region"]

# normalized variables
df_result_clean_norm <- scale(df_result_clean)

mat_dist <- dist(df_result_clean_norm)

dist_sp <- sort(as.matrix(mat_dist)[ , "Saint Petersburg"])

df_dist_sp <- data.frame(region = names(dist_sp), distance = dist_sp)

row.names(df_dist_sp) <- NULL

# top 10 nearest to Saint Petersburg

df_dist_sp[2:10, ] %>%
  ggplot(aes(x = distance, y = fct_reorder(region, distance, .desc = T))) + 
  geom_col() +
  labs( title = "Top 10 regions similar to Saint Petersburg",
        subtitle = "based in distances for normalized drinks and growth",
        x = "distance", 
        y = "region")