Product Sales Approach Analysis
Project Overview
This repository contains the project that earned me my DataCamp Professional certification. In this project, I analyzed sales data from the company Pens and Printers.
Pens and Printers supplies high-quality office products, ranging from pens and notebooks to desk chairs and monitors, to large organizations. Although they do not manufacture their products, they specialize in selling items produced by other companies.
The company's focus is on providing tools that help customers enhance creativity and brainstorming. They recently launched a new line of office stationery and tested three different sales strategies: targeted emails, phone calls, and a combination of both.
Email: Customers in this group received an initial email when the product line launched, followed by another email three weeks later. This method required minimal effort from the sales team.
Call: Customers were contacted by a sales team member, with each call averaging around thirty minutes per customer.
Email and Call: Customers first received an email with product information and were subsequently called a week later. The initial email required little effort, while the follow-up call averaged ten minutes per customer.
The executive team sought an analysis of the effectiveness of these sales approaches for the new product line. They needed insights on:
- The number of customers reached by each approach.
- The overall and method-specific revenue distribution.
- Differences in revenue over time for each sales strategy.
- Recommendations on the most effective method, considering the time investment required from the sales team.
- Any additional insights into differences between customer groups that could inform future strategies.
The successful completion of this project and certification demonstrate my ability to deliver actionable insights and data-driven recommendations.
📝 Task List
- Data validation:
- Describe validation and cleaning steps for every column in the data
- Exploratory Analysis:
- Include two different graphics showing single variables only to demonstrate the characteristics of data
- Include at least one graphic showing two or more variables to represent the relationship between features
- Describe your findings
- Definition of a metric for the business to monitor
- How should the business use the metric to monitor the business problem
- Can you estimate initial value(s) for the metric based on the current data
- Final summary including recommendations that the business should undertake
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#importing dataset
df=pd.read_csv("product_sales.csv")
df.shape#Checking null values
print(df.isnull().sum())
# Count duplicate rows
df.duplicated().sum()
# Drop nulls from the 'revenue' column
df.dropna(inplace=True)
# Check unique values in sales_method column
print(df['sales_method'].unique())
# Replace 'em + call' to 'Email + Call' and 'email' to 'Email' in 'sales_method' column
df['sales_method'] = df['sales_method'].replace({'em + call': 'Email + Call', 'email': 'Email'})
#sales_method categorical counts
df['sales_method'].value_counts()
print(df[df['years_as_customer'] > 40])
# Delete the rows where 'years_as_customer' is greater than 40
df = df[df['years_as_customer'] <= 40]
df.nunique()Data Validation
The dataset contains 15000 rows and 8 columns before cleaning and validataion. I have validated all the columns against the criteria in the dataset table:
- week: 6 weeks without missing values. No cleaning is needed.
- sales_method: 5 sales methods found. Replaced wrong values with 'Email + Call' and 'Email'.
- customer_id: 15000 unique values without missing values. No cleaning is needed.
- nb_sold: 10 unique numeric values without missing values. No cleaning is needed.
- revenue: 13926 numeric values and 1074 missing values. Cleaned the rows with null values.
- years_as_customer: 42 unique numeric values without missing values. But two values found greater than 40 (company founded in 1984, now 2023 runnning). So we cleaned those rows. Now year column has 40 unique values without missing values.
- nb_site_visits: 27 unique numeric values without missing values. No cleaning is needed.
- state: 50 states without missing values,same as the description. No cleaning is needed.
No duplicate rows found in the dataset. After the data validation, the dataset contains 13926 rows and 8 columns without missing values.
How many customers were there for each approach?
There are 6921 sales where we contacted with customer through Email, 4780 through Call, and 2223 through both Email and Call. About 49.7% customer received Email, 34.3% received Call, and only 16% received from Email+Call.
import plotly.express as px
# Get the counts of each sales method
sales_method_counts = df['sales_method'].value_counts().reset_index()
# Create a bar chart with color as previous pie chart
fig = px.bar(sales_method_counts, x='index', y='sales_method', title='Sales Method Distribution', color='index')
fig.update_layout(showlegend=False, title_x=0.5)
fig.show()
# Create a pie chart
fig = px.pie(sales_method_counts, values='sales_method', names='index', title='Sales Method Distribution')
fig.update_layout(showlegend=False, title_x=0.5)
fig.update_traces(textposition='inside', textinfo='percent+label')
fig.show()What does the spread of the revenue look like overall? And for each method?
From the multiple boxplots below, we can see majority of the overall revenue we gain from sales range from 32.54 to 107.33. For individual sales methods, the majority of the sales revenue for Email+Call range from 155.77 to 191.12, for Call range 41.47 from to 52.68, for Email range from 87.88 to 105.17.
# Create a box plot to visualize the spread of revenue overall
fig = px.box(df, y='revenue', title='Spread of Revenue Overall')
fig.update_layout(showlegend=False, title_x=0.5)
fig.show()
# Create a box plot to visualize the spread of revenue for each sales method
fig = px.box(df, x='sales_method', y='revenue', color='sales_method', title='Spread of Revenue by Sales Method', color_discrete_map={
"Email":px.colors.qualitative.Plotly[0],
"Call":px.colors.qualitative.Plotly[1],
"Email + Call":px.colors.qualitative.Plotly[2]
})
fig.update_layout(showlegend=False, title_x=0.5)
fig.show()Was there any difference in revenue over time for each of the methods?
In the following line chart, we see that most revenue is gained from customers over time via Email. Email+Call is the closest revenue-gainer method compared with Email. Call is the lowest revenue-gainer sales method. We also see that in customers' first year, we obtain the most revenue. But it gradually decreased over time.
# Group the data by sales method and year, and sum the revenue
revenue_by_method_year = df.groupby(['sales_method', 'years_as_customer'])['revenue'].sum().reset_index()
# Create a line chart for each sales method
fig = px.line(revenue_by_method_year, x='years_as_customer', y='revenue', color='sales_method', title='Revenue Over Time by Sales Method',
color_discrete_map={
"Email":px.colors.qualitative.Plotly[0],
"Call":px.colors.qualitative.Plotly[1],
"Email + Call":px.colors.qualitative.Plotly[2]
})
# Add chart title and axis labels
fig.update_layout(title='Revenue Over Time by Sales Method', xaxis_title='years_as_customer', yaxis_title='Revenue', title_x=0.5,legend=dict(
yanchor="top",
y=0.99,
xanchor="left",
x=0.86))
# Show the chart
fig.show()# Group the data by sales method and calculate the mean revenue for each method
fig = px.bar(
data_frame=df.groupby(['sales_method']).mean().reset_index(),
x="sales_method",
y="revenue"
)
fig.show()