Analysing Product Sales

Data Analyst Professional Practical Exam Submission

You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.

You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.

📝 Task List

Your written report should include written text summaries and graphics of the following:

Data validation:
- Describe validation and cleaning steps for every column in the data
Exploratory Analysis:
- Include two different graphics showing single variables only to demonstrate the characteristics of data
- Include at least one graphic showing two or more variables to represent the relationship between features
- Describe your findings
Definition of a metric for the business to monitor
- How should the business use the metric to monitor the business problem
- Can you estimate initial value(s) for the metric based on the current data
Final summary including recommendations that the business should undertake

Start writing report here..

First let's import the dataset into a pandas DataFrame get a view of what we're playing with

# Importing the dataset into a DataFrame
import pandas as pd
sales = pd.read_csv('product_sales.csv')
print(sales)
print(sales.columns)

Validating Data

Here we can see that there are about 15000 rows with the following column names and issues

Week- no issues. Individual values 1-6 as integers. No null values found.
sales_method- Issue with 2 similar values being counted differently- 'em + call', 'email'. Used replace function to have 3 unique values- 'Email', 'Email + Call', 'Call'
customer_id - No issues because all 15000 rows are unique and non null
revenue- Missing 1074 values. Will be interesting to deep dive on any patterns within the missing set
years_as_customer- Analysis reveals that there are 2 outliers. Customers who have been patrons for 40 and 63 years which are unlikely because the company was established in 1984. The histogram reveals a right skewed distribution
nb_site_visits- no correction required. This histogram reveals that the distribution is most closest to a Normal distribution.
state- no correction required

#Validating the data
import seaborn as sns
import matplotlib.pyplot as plt

#Finding columns with null values-  While we can see that the data types match the description provided, the revenue column has null values
print(sales.info())

#Validate Week- This refers to the weeks in question numbered 1-6. 
print(sales['week'].unique())

#Validate sales_method - investigation reveals that there 'em + call' , 'email' can be saved as 'Email + Call','Email' to minimise the number of unique values
print(sales['sales_method'].unique())
sales['sales_method'].replace(['em + call','email'],['Email + Call','Email'],inplace=True)
print(sales['sales_method'].unique())

#Validate customer_id- 15000 unique ids
print(sales['customer_id'].unique().shape)

#Validate nb_sold ( new products sold) - 10 unique non-null values. No correction required
print(sales['nb_sold'].unique())

#Validate revenue- 1074 Null values found. Imputing missing data below
print(sales['revenue'].describe())
print(sales['revenue'].isnull().sum())

#Validate years_as_customer- Describe shows that most values tend to be under 10. There are 2 outliers.
print(sales['years_as_customer'].describe())
sns.boxplot(data=sales['years_as_customer'])
plt.show()
print(sales[sales['years_as_customer']>40])
sns.histplot(sales['years_as_customer'])
plt.show()

#Validate- nb_site visits, state- They seem fine- no correction required
print(sales['nb_site_visits'].describe())
print(sales['nb_site_visits'].info())
print(sales['nb_site_visits'].unique())
print(sales['state'].describe())
print(sales['state'].info())
print(sales['state'].unique())
sns.histplot(sales['nb_site_visits'])
plt.show()

Imputing missing revenue

On further analysis it was found that there is a positive correlation between nb_sold and revenue across sales methods. So revenue values can be extrapolated and imputed.

# Check the relationship between revenue column and nb_sold column by sales_method Email
sales_Email = sales[sales['sales_method'] == 'Email']
print(sales_Email[['revenue','nb_sold']].corr())
# Check the relationship between revenue column and nb_sold column by sales_method Email + Call
sales_Email_Call = sales[sales['sales_method'] == 'Email + Call']
print(sales_Email_Call[['revenue','nb_sold']].corr())
# Check the relationship between revenue column and nb_sold column by sales_method Call
sales_Call = sales[sales['sales_method'] == 'Call']
print(sales_Email_Call[['revenue','nb_sold']].corr())
# Visualize the relationship between revenue column and nb_sold column classified by sales_method.
sns.relplot(x='nb_sold', y='revenue', data=sales,hue='sales_method')
plt.show()

#method for imputing revenue data
import statsmodels.formula.api as smf
# Create sales_estimate
sales_model= sales[~sales['revenue'].isna()]
# Create sales_results
sales_estimate = sales[sales['revenue'].isna()]
# Create the model then predict the missing values
x = smf.ols('revenue ~ nb_sold + C(sales_method)', data = sales_model).fit()
estimates = x.predict(sales_estimate)
print(estimates)

# Fill missing values with results
sales['revenue'] = sales['revenue'].fillna(estimates)
print(sales.info())

Analysis of sales_method

The proportion of sales driven by email accounts for about 50% of total sales
Email alone out performs Email + Call and Call
The missing revenue values have no bearing on the method of sales

fig,ax = plt.subplots()
ax= sales['sales_method'].value_counts().plot(kind='bar',title='Sales Method Distribution')
ax.bar_label(ax.containers[0],padding=1)
plt.show()

print(sales['sales_method'].value_counts(normalize=True))

print(sales['sales_method'][sales['revenue'].isnull()].value_counts())

Analysis of revenue

Majority of the revenue is between 50-100 with slightly more customers in the 100 side.
The other large values with fewer customers are above 180

sns.boxplot(sales['revenue'])
plt.show()

sns.histplot(sales['revenue'])
plt.show()

Analysis of nb_site_visits

The histogram plot reveals that customers on average visit the site about 25 times.
A majority of 5000 customers visiting the site 24-26 times in the last 6 months.
This most cloesly follows a normal distribution where 90% of customers visit the site between 15-35 times

sns.histplot(sales['nb_site_visits'])
plt.title('Number of times customers visited (last 6 months)')
plt.show()

Analysing revenue by sales method

Email + Call stands out for its wide spread and high median of about 175 and mean of 183
Call has a narrow spread and a median of 50 and mean of 47
Email has a wide spread but not as wide as email + call and a middle ground of a median at 100 and mean of 97
The barplot where the estimator is tagged as sum shows that total revenue from Email is the highest, followed by email+call and call

‌
‌
‌