Product Sales 10 Aug 2023

Product Sales Analysis 10 August 2023

You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary report.

You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.

📝 Task List

Your written report should include written text summaries and graphics of the following:

Data validation:
- Describe validation and cleaning steps for every column in the data
Exploratory Analysis:
- Include two different graphics showing single variables only to demonstrate the characteristics of data
- Include at least one graphic showing two or more variables to represent the relationship between features
- Describe your findings
Definition of a metric for the business to monitor
- How should the business use the metric to monitor the business problem
- Can you estimate initial value(s) for the metric based on the current data
Final summary including recommendations that the business should undertake

Start writing report here..

Data Validation and Cleaning

This step has purpose to prepare the workspace and clean the dataset before analysis. I imported several libraries and the dataset "Pens and Printers" company. The preview of original dataset is showed in table below.

Hidden code

sales_df

Here is the result of data checking in each variable:

variable week : no NA or blank data
variable sales_method : many typos such as "em + call" and "email" --> it needs cleaning
variable customer_id : no NA or blank data
variable nb_sold : no NA or blank data
variable revenue : 1074 data has NA or blank value --> it needs cleaning
variable years_as_customer : no NA or blank data
variable nb_site_visits : no NA or blank data
variable state : no NA/blank data, no typo data also

To fix typo data in column sales_method, I used dplyr::case_when to relabel all typos into just 3 methods: "Email", "Call", "Email + Call". Meanwhile to fix NA data in 1074 rows in column revenue, I imputed those with predicted revenue values from each sales method's linear regression model lm(formula = revenue ~ nb_sold). The model confidence to impute missing values is good with adjusted R square > 0.95. This imputation method has background from the chart below, showing high correlation between variable nb_sold and revenue per sales_method.

#check NA
sales_df[which(is.na(sales_df$week)),]
sales_df[which(is.na(sales_df$sales_method)),]
sales_df[which(is.na(sales_df$customer_id)),]
sales_df[which(is.na(sales_df$nb_sold)),]
#found there is 1074 rows having NA revenue
sales_df[which(is.na(sales_df$revenue)),]
sales_df[which(is.na(sales_df$years_as_customer)),]
sales_df[which(is.na(sales_df$nb_site_visits)),]
sales_df[which(is.na(sales_df$state)),]

#check values for categorical data
#found typo in sales_method
unique(sales_df$sales_method)
unique(sales_df$state) %>% sort()

#fix typo in sales_method
sales_df <- sales_df %>% mutate(
  sales_method_cleaned = case_when(
    sales_method == "em + call" ~ "Email + Call",
    sales_method == "email" ~ "Email",
    sales_method == "Email" ~ "Email",
    sales_method == "Email + Call" ~ "Email + Call",   
    sales_method == "Call" ~ "Call",       
  )
)

#fix NA in revenue
sales_df[which(!is.na(sales_df$revenue)),] %>% ggplot(aes(x = nb_sold, y = revenue, color = sales_method_cleaned)) + geom_point()

lm_email <- sales_df[which(sales_df$sales_method_cleaned == "Email" & !is.na(sales_df$revenue)),] %>% lm(formula = revenue ~ nb_sold)
lm_emailcall <- sales_df[which(sales_df$sales_method_cleaned == "Email + Call" & !is.na(sales_df$revenue)),] %>% lm(formula = revenue ~ nb_sold)
lm_call <- sales_df[which(sales_df$sales_method_cleaned == "Call" & !is.na(sales_df$revenue)),] %>% lm(formula = revenue ~ nb_sold)
summary(lm_email)
summary(lm_emailcall)
summary(lm_call)
sales_df_email_na <- sales_df[which(sales_df$sales_method_cleaned == "Email" & is.na(sales_df$revenue)),]
sales_df_email_na$revenue_cleaned <- predict(lm_email,sales_df_email_na)
sales_df_emailcall_na <- sales_df[which(sales_df$sales_method_cleaned == "Email + Call" & is.na(sales_df$revenue)),]
sales_df_emailcall_na$revenue_cleaned <- predict(lm_emailcall,sales_df_emailcall_na)
sales_df_call_na <- sales_df[which(sales_df$sales_method_cleaned == "Call" & is.na(sales_df$revenue)),]
sales_df_call_na$revenue_cleaned <- predict(lm_call,sales_df_call_na)
sales_df_all_notna <- sales_df[which(!is.na(sales_df$revenue)),] %>% mutate(revenue_cleaned = revenue) 
sales_df_cleaned <- sales_df_all_notna %>% dplyr::bind_rows(sales_df_email_na) %>% 
  dplyr::bind_rows(sales_df_emailcall_na) %>% 
  dplyr::bind_rows(sales_df_call_na)

Into dataset, I added new variable sales_method_cleaned: variable contained cleaned sales method values. We don't find any typo anymore with this variable, and instead just find three unique values as expected.

sales_df_cleaned$sales_method_cleaned %>% unique()

Into dataset, I added new variable revenue_cleaned : variable contained fully completed revenue value after linear regression imputation. From table below, NA/blank in certain customer_id data of column revenue now has imputed value in column revenue_cleaned.

sales_df_cleaned[which(is.na(sales_df_cleaned$revenue)),] %>% select(customer_id, sales_method, revenue, revenue_cleaned)

The data validation is finished and data has been cleaned comprehensively, maintaining 15000 rows of data. No any rows of data was removed. Below table is preview of cleaned dataset ready for analysis. Variables sales_method and revenue has been removed from the dataset, and instead variables sales_method_cleaned and revenue_cleaned are available for analysis.

sales_df_cleaned %>% select(week, customer_id, nb_sold, years_as_customer, nb_site_visits, state, sales_method_cleaned, revenue_cleaned)

‌
‌
‌

Product Sales 10 Aug 2023

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Product Sales Analysis 10 August 2023

📝 Task List

Data Validation and Cleaning

Product Sales Analysis 10 August 2023