Skip to content
New Workbook
Sign up
Product Sales 10 Aug 2023

Product Sales Analysis 10 August 2023

You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary report.

You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.

📝 Task List

Your written report should include written text summaries and graphics of the following:

  • Data validation:
    • Describe validation and cleaning steps for every column in the data
  • Exploratory Analysis:
    • Include two different graphics showing single variables only to demonstrate the characteristics of data
    • Include at least one graphic showing two or more variables to represent the relationship between features
    • Describe your findings
  • Definition of a metric for the business to monitor
    • How should the business use the metric to monitor the business problem
    • Can you estimate initial value(s) for the metric based on the current data
  • Final summary including recommendations that the business should undertake

Start writing report here..

Data Validation and Cleaning

This step has purpose to prepare the workspace and clean the dataset before analysis. I imported several libraries and the dataset "Pens and Printers" company. The preview of original dataset is showed in table below.

Hidden code
sales_df

Here is the result of data checking in each variable:

  1. variable week : no NA or blank data
  2. variable sales_method : many typos such as "em + call" and "email" --> it needs cleaning
  3. variable customer_id : no NA or blank data
  4. variable nb_sold : no NA or blank data
  5. variable revenue : 1074 data has NA or blank value --> it needs cleaning
  6. variable years_as_customer : no NA or blank data
  7. variable nb_site_visits : no NA or blank data
  8. variable state : no NA/blank data, no typo data also

To fix typo data in column sales_method, I used dplyr::case_when to relabel all typos into just 3 methods: "Email", "Call", "Email + Call". Meanwhile to fix NA data in 1074 rows in column revenue, I imputed those with predicted revenue values from each sales method's linear regression model lm(formula = revenue ~ nb_sold). The model confidence to impute missing values is good with adjusted R square > 0.95. This imputation method has background from the chart below, showing high correlation between variable nb_sold and revenue per sales_method.

#check NA
sales_df[which(is.na(sales_df$week)),]
sales_df[which(is.na(sales_df$sales_method)),]
sales_df[which(is.na(sales_df$customer_id)),]
sales_df[which(is.na(sales_df$nb_sold)),]
#found there is 1074 rows having NA revenue
sales_df[which(is.na(sales_df$revenue)),]
sales_df[which(is.na(sales_df$years_as_customer)),]
sales_df[which(is.na(sales_df$nb_site_visits)),]
sales_df[which(is.na(sales_df$state)),]

#check values for categorical data
#found typo in sales_method
unique(sales_df$sales_method)
unique(sales_df$state) %>% sort()

#fix typo in sales_method
sales_df <- sales_df %>% mutate(
  sales_method_cleaned = case_when(
    sales_method == "em + call" ~ "Email + Call",
    sales_method == "email" ~ "Email",
    sales_method == "Email" ~ "Email",
    sales_method == "Email + Call" ~ "Email + Call",   
    sales_method == "Call" ~ "Call",       
  )
)
#fix NA in revenue
sales_df[which(!is.na(sales_df$revenue)),] %>% ggplot(aes(x = nb_sold, y = revenue, color = sales_method_cleaned)) + geom_point() 
lm_email <- sales_df[which(sales_df$sales_method_cleaned == "Email" & !is.na(sales_df$revenue)),] %>% lm(formula = revenue ~ nb_sold)
lm_emailcall <- sales_df[which(sales_df$sales_method_cleaned == "Email + Call" & !is.na(sales_df$revenue)),] %>% lm(formula = revenue ~ nb_sold)
lm_call <- sales_df[which(sales_df$sales_method_cleaned == "Call" & !is.na(sales_df$revenue)),] %>% lm(formula = revenue ~ nb_sold)
summary(lm_email)
summary(lm_emailcall)
summary(lm_call)
sales_df_email_na <- sales_df[which(sales_df$sales_method_cleaned == "Email" & is.na(sales_df$revenue)),]
sales_df_email_na$revenue_cleaned <- predict(lm_email,sales_df_email_na)
sales_df_emailcall_na <- sales_df[which(sales_df$sales_method_cleaned == "Email + Call" & is.na(sales_df$revenue)),]
sales_df_emailcall_na$revenue_cleaned <- predict(lm_emailcall,sales_df_emailcall_na)
sales_df_call_na <- sales_df[which(sales_df$sales_method_cleaned == "Call" & is.na(sales_df$revenue)),]
sales_df_call_na$revenue_cleaned <- predict(lm_call,sales_df_call_na)
sales_df_all_notna <- sales_df[which(!is.na(sales_df$revenue)),] %>% mutate(revenue_cleaned = revenue) 
sales_df_cleaned <- sales_df_all_notna %>% dplyr::bind_rows(sales_df_email_na) %>% 
  dplyr::bind_rows(sales_df_emailcall_na) %>% 
  dplyr::bind_rows(sales_df_call_na)

Into dataset, I added new variable sales_method_cleaned: variable contained cleaned sales method values. We don't find any typo anymore with this variable, and instead just find three unique values as expected.

sales_df_cleaned$sales_method_cleaned %>% unique()

Into dataset, I added new variable revenue_cleaned : variable contained fully completed revenue value after linear regression imputation. From table below, NA/blank in certain customer_id data of column revenue now has imputed value in column revenue_cleaned.

sales_df_cleaned[which(is.na(sales_df_cleaned$revenue)),] %>% select(customer_id, sales_method, revenue, revenue_cleaned)

The data validation is finished and data has been cleaned comprehensively, maintaining 15000 rows of data. No any rows of data was removed. Below table is preview of cleaned dataset ready for analysis. Variables sales_method and revenue has been removed from the dataset, and instead variables sales_method_cleaned and revenue_cleaned are available for analysis.

sales_df_cleaned %>% select(week, customer_id, nb_sold, years_as_customer, nb_site_visits, state, sales_method_cleaned, revenue_cleaned)