Data Analyst Product Sales
Hi, Really hoping you could help me out with some analysis. This is quite short notice but I have a meeting with the executive team to update them on sales approaches for the new product line so I need this information as soon as possible. We need to know:
- How many customers were there for each approach?
- What does the spread of the revenue look like overall? And for each method?
- Was there any difference in revenue over time for each of the methods?
- Based on the data, which method would you recommend we continue to use? Some of these methods take more time from the team so they may not be the best for us to use if the results are similar. We don’t really know if there are other differences between the customers in each group, so anything you can tell us would be really helpful to give some context to what went well. I need to report to the executive team in 4 weeks. You need to present to me before then so I understand what is going on and what we do next. Find attached more details about the data and what i expect you to do. Look forward to seeing your presentation.
New Product Sales Methods
Six weeks ago, we launched a new line of office stationery. Despite the world becoming increasingly digital, there is still demand for notebooks, pens, and sticky notes. Our focus has been on selling products to enable our customers to be more creative, focused on tools for brainstorming.
We have tested three different sales strategies for this, targeted email and phone calls, as well as combining the two:
-
Email: Customers in this group received an email when the product line was launched, and a further email three weeks later. This required very little work for the team.
-
Call: Customers in this group were called by a member of the sales team. On average, members of the team were on the phone for around thirty minutes per customer.
-
Email and Call: Customers in this group were first sent the product information email, then called a week later by the sales team to talk about their needs and how this new product may support their work. The email required little work from the team, and the call was around ten minutes per customer.
Data Validation
# Importing Libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# load the data and view first few rows
sales = pd.read_csv('product_sales.csv')
display(sales.head())
# Display general information about the 'sales' dataset using the info() method.
print(sales.info())
print('\n')
# Display summary statistics for the 'sales' dataset using the describe() method.
display(sales.describe())
# check for the sum of the null values in the dataframe
sales.isna().sum()
Dataset Summary
- Missing Values: The 'revenue' column has missing values and requires handling.
- Summary Statistics: The displayed summary statistics offer insights into data distribution for analysis.
# Specify the column for missing values
column_to_check = 'revenue'
# Step 1: Count the number of missing values in the column
missing_values_count = sales[column_to_check].isnull().sum()
# Step 2: Calculate the threshold for missing values (5 percent)
total_rows = len(sales)
threshold = 0.05 * total_rows
# Step 3: Compare the count of missing values with the threshold
if missing_values_count <= threshold:
# Step 4: If the count <= threshold, drop the rows with missing values
sales.dropna(subset=[column_to_check], inplace=True)
print(f"Dropped rows with missing '{column_to_check}' values due to excessive missing values.")
else:
print(f"Mising values of the '{column_to_check}' column is greater than the threshold So the '{column_to_check}' column will not be dropped.")
# Histogram plot of the null value column
sales['revenue'].hist(bins=20)
plt.show()
sales['revenue'] = sales.revenue.fillna(sales['revenue'].median())
Data Analysis and Imputation
-
The histogram plot of the null value column showed that the data was skewed. As a result, we opted to use the median to fill the missing values in the revenue column.
-
According to the data description the revenue varies depending on which products were sold, and we don't have any product type related column so we can't group the revenue column by any column for the imputation.
# Get the list of column names in the 'sales' dataset.
columns = sales.columns
# Iterate through each column in the dataset and print the unique values
for col in columns:
unique_values = sales[col].unique()
num_of_unique_values = sales[col].nunique()
print(f"Unique values in '{col}':")
print(unique_values)
print(num_of_unique_values)
print("\n")
# Filter the DataFrame based on the condition 'years_as_customer' > 39
sales_out_of_range_rows = sales[sales['years_as_customer'] > 39]
# Print the 'years_as_customer' column with its name
display(sales_out_of_range_rows)
Unique Values Summary
-
Sales Method Case: The 'sales_method' column contains unique values in both title case and lower case. Standardizing the case may be necessary for consistency.
-
Email and Call Duplication: The 'email + call' unique value also appears as 'em + call,' indicating a need for data cleaning to address this duplication.
-
Years As a Customer: The years as a customer column had customers with years as a customer older than the company's age of 39 years as was stated in the data description, those rows needs to be drop.
# Filter the DataFrame to keep only rows where 'years_as_customer' <= 39
sales = sales[sales['years_as_customer'] <= 39]
# Now, the 'sales' DataFrame contains only rows where 'years_as_customer' is not greater than 39
sales_out_of_range_rows = sales[sales['years_as_customer'] > 39]
print(sales_out_of_range_rows)