Pens and Printers - A Report on Sales Analysis
Data
The sales rep has pulled some data from their sales tracking system for me. They haven’t included numbers for how much time was spent on each customer, but there may be some other useful customer information in here.
The data only relates to the new products sold. As there are multiple different products, the revenue will vary depending on which products were sold.
The data hasn’t been validated yet, so I'll make sure that to check it against all of the information in the table before I start the analysis. I'll provide a description of each processing and cleaning step.
| Column Name | Details |
|---|---|
| week | Week sale was made, counted as weeks since product launch |
| sales_method | Character, which of the three sales methods were used for that customer |
| customer_id | Character, unique identifier for the customer |
| nb_sold | Numeric, number of new products sold |
| revenue | Numeric, revenue from the sales, rounded to 2 decimal places |
| years_as_customer | Numeric, number of years customer has been buying from us (company founded in 1984) |
| nb_site_visits | Numeric, number of times the customer has visited our website in the last 6 months |
| state | Character, location of the customer i.e. where orders are shipped |
-- Explore the data in the table
SELECT *
FROM 'product_sales.csv'
LIMIT 5Data Validation:
The data file has 15,000 rows and 8 columns (numeric and categorical data). Here are the cleaning steps taken for every column in the data:
After dropping the missing values from the revenue column, The Cleaned Dataset has 13,926 rows left.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
product_sales = pd.read_csv('product_sales.csv')
product_sales.head()product_sales.info()With the above summary, we can see that the revenue column has 13,926 values with 1,074 missing values.
Before dealing with missing values, first, let's check and correct column with str or character type.
print(product_sales.sales_method.unique())
print(' ')
print(product_sales.state.unique())
print(' ')Loading and Cleaning Data accordingly
| Column Name | Details |
|---|---|
| week | No change. |
| sales_method | Corrected. Misspelling such as em + call & email corrected to Email + Call & Email. |
| customer_id | No change. |
| nb_sold | No change. |
| revenue | Corrected. Rows with missing values were dropped. |
| years_as_customer | No change. |
| nb_site_visits | No change. |
| state | No change. |
Now to change the method and the misrepresented values em + call, email. It should be Email + Call & Email.
product_sales.loc[product_sales.sales_method=='em + call', 'sales_method'] = 'Email + Call'
product_sales.loc[product_sales.sales_method=='email', 'sales_method'] = 'Email'product_sales.sales_method.value_counts()product_sales.isna().sum()counts_method = product_sales.sales_method.value_counts(dropna=False)
na_revenue = product_sales.groupby('sales_method')['revenue'].count().reindex(counts_method.index)
na_counts = pd.concat([counts_method, na_revenue], axis=1).assign(NaNs=counts_method-na_revenue)
na_counts.assign(NaNs_pct=round(na_counts.NaNs/na_counts.sales_method, 2))Since sales methods each has null values in the revenue column less than 15%, I'll simply remove those rows.