Skip to content

Pens and Printers - A Report on Sales Analysis

Data

The sales rep has pulled some data from their sales tracking system for me. They haven’t included numbers for how much time was spent on each customer, but there may be some other useful customer information in here.

The data only relates to the new products sold. As there are multiple different products, the revenue will vary depending on which products were sold.

The data hasn’t been validated yet, so I'll make sure that to check it against all of the information in the table before I start the analysis. I'll provide a description of each processing and cleaning step.

Column NameDetails
weekWeek sale was made, counted as weeks since product launch
sales_methodCharacter, which of the three sales methods were used for that customer
customer_idCharacter, unique identifier for the customer
nb_soldNumeric, number of new products sold
revenueNumeric, revenue from the sales, rounded to 2 decimal places
years_as_customerNumeric, number of years customer has been buying from us (company founded in 1984)
nb_site_visitsNumeric, number of times the customer has visited our website in the last 6 months
stateCharacter, location of the customer i.e. where orders are shipped
Spinner
DataFrameas
df
variable
-- Explore the data in the table
SELECT *
FROM 'product_sales.csv'
LIMIT 5

Data Validation:

The data file has 15,000 rows and 8 columns (numeric and categorical data). Here are the cleaning steps taken for every column in the data:

After dropping the missing values from the revenue column, The Cleaned Dataset has 13,926 rows left.

import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 


product_sales = pd.read_csv('product_sales.csv')


product_sales.head()
product_sales.info()

With the above summary, we can see that the revenue column has 13,926 values with 1,074 missing values.

Before dealing with missing values, first, let's check and correct column with str or character type.

print(product_sales.sales_method.unique())
print(' ')
print(product_sales.state.unique())
print(' ')

Loading and Cleaning Data accordingly


Column NameDetails
weekNo change.
sales_methodCorrected. Misspelling such as em + call & email corrected to Email + Call & Email.
customer_idNo change.
nb_soldNo change.
revenueCorrected. Rows with missing values were dropped.
years_as_customerNo change.
nb_site_visitsNo change.
stateNo change.

Now to change the method and the misrepresented values em + call, email. It should be Email + Call & Email.

product_sales.loc[product_sales.sales_method=='em + call', 'sales_method'] = 'Email + Call'
product_sales.loc[product_sales.sales_method=='email', 'sales_method'] = 'Email'
product_sales.sales_method.value_counts()
product_sales.isna().sum()
counts_method = product_sales.sales_method.value_counts(dropna=False)
na_revenue = product_sales.groupby('sales_method')['revenue'].count().reindex(counts_method.index)

na_counts = pd.concat([counts_method, na_revenue], axis=1).assign(NaNs=counts_method-na_revenue)
na_counts.assign(NaNs_pct=round(na_counts.NaNs/na_counts.sales_method, 2))

Since sales methods each has null values in the revenue column less than 15%, I'll simply remove those rows.