Set-up
install.packages("plotly")
install.packages("patchwork")
install.packages("factoextra")
library(skimr)
library(tidyverse)
library(tidymodels)
library(plotly)
library(patchwork)
theme_set(theme_light())Data
data <-
read_csv('./data/russian_alcohol_consumption.csv')
data %>%
skimr::skim_without_charts()Knowing our data
Background
Your company owns a chain of stores across Russia that sell a variety of alcoholic drinks. The company recently ran a wine promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn't possible to run the promotion in all regions. The marketing team would like to target 10 other regions that have similar buying habits to Saint Petersburg where they would expect the promotion to be similarly successful.
Data dictionary
The marketing team has sourced you with historical sales volumes per capita for several different drinks types.
+-----------------+----------------------------------------------------------------------------------------------------------------------------------------+ | year | year (1998-2016) | +-----------------+----------------------------------------------------------------------------------------------------------------------------------------+ | region | Name of a federal subject of Russia. It could be oblast, republic, krai, autonomous okrug, federal city and a single autonomous oblast | +-----------------+----------------------------------------------------------------------------------------------------------------------------------------+ | wine | Sale of wine in litres by year per capita | +-----------------+----------------------------------------------------------------------------------------------------------------------------------------+ | beer | Sale of beer in litres by year per capita | +-----------------+----------------------------------------------------------------------------------------------------------------------------------------+ | vodka | Sale of vodka in litres by year per capita | +-----------------+----------------------------------------------------------------------------------------------------------------------------------------+ | champagne | Sale of champagne in litres by year per capita | +-----------------+----------------------------------------------------------------------------------------------------------------------------------------+ | brandy | Sale of brandy in litres by year per capita | +-----------------+----------------------------------------------------------------------------------------------------------------------------------------+
Objective/Goals
- Recommend 10 additional regions they should select for the promotion.
- Tell the story that supports your recommendations.
Notes
-
Region should be a factor
-
We need to pivot longer
-
Missing data across all numerical vars, but the complete rate looks good enough
-
We are looking for patterns of regions that behave more similar to Saint Petersburg. looks like a clustering task
-
There is ts data (historical sales for diferent regions) maybe time series models ?
-
Top regions with most "type" liters per cap sold
-
Do some summary and plot functions
-
do a clustering and maybe an interactive graph
Exploratory Data Analysis
data <-
data %>%
pivot_longer(-c(year,region),
names_to = "type",
values_to = "value")Summaries of the data
data %>%
na.omit() %>%
group_by(type) %>%
summarise(min = min(value),
median = median(value),
mean = mean(value),
max = max(value),
sd = sd(value),
IQR = IQR(value))data %>%
view()
data %>%
distinct(region) %>%
summarise(n = n())
data %>%
# filter(region == "Saint Petersburg") %>%
group_by(type,region) %>%
summarise(mean = mean(value),
median = median(value),
min = min(value),
max = max(value)) %>%
filter(type == "beer") %>%
arrange(desc(mean))
data %>%
filter(region == "Saint Petersburg") %>%
ggplot(aes(x = year, y = value))+
geom_line()+
geom_point()+
facet_wrap(~type, scales = "free_y")+
stat_smooth(method = "lm",formula = y ~ splines::bs(x, 3))Whats happening in Saint Petersburg ?
- Sells more beer than any other type of alchohol in average across all years
- Beer presents a downward trend but the volume of sales its still more compared to the others
- Vodka has a stepper downtrend
- Wine and brandy possibly upper trend
- Champange possibly downward trend
Look for regions with the most similar sells patterns (trends)
K-means clustering
data %>%
mutate(year_type = paste(type,year),
year_type = snakecase::to_snake_case(year_type)) %>%
select(-c(type,year)) %>%
pivot_wider(names_from = year_type,
values_from = value,
values_fill = 0)whats happening ? more columns than rows because there are 4 types of alcochol for 1998 to 2016, and dont know how to deal with it yet
For now lets summarise across years by the mean
clust_data <-
data %>%
filter(!is.na(value)) %>%
group_by(region,type) %>%
summarise(mean = mean(value)) %>%
ungroup() %>%
pivot_wider(names_from = type,
values_from = mean,
values_fill = 0) %>%
mutate(across(is.numeric,~ as.numeric(scale(.))))k_clust <- kmeans(select(clust_data,-region),centers = 4)
tidy(k_clust)
glance(k_clust)