Skip to content

Product Sales: Optimizing Sales Strategy for New Product Line

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set seaborn grid style
sns.set_style('ticks')
sns.set_context('paper')
sns.set_palette('colorblind')
sns.set_style({'axes.grid': True, 'grid.color': '.8', 'grid.linestyle': '--'})
# Load the data
prod_sales = pd.read_csv('product_sales.csv')
prod_sales.head()
prod_sales.shape

Data Validation

The data set has 15000 rows and 8 columns. I have validated all the variables and have made changes after validation. Not all the columns were as described in the data dictionary.

  • week: 6 unique numeric values without missing values, same as the description. No cleaning is needed.
  • sales_method: 5 character (string) values without missing values. Replaced em + calland email with Email + Call and Email respectively, ensuring there are only 3 unique sales methods (Email, Call and Email + Call).
  • customer_id: Character values without missing values, same as the description. No cleaning is needed.
  • nb_sold: Numeric values without missing values, same as the description. No cleaning is needed.
  • revenue: Numeric values with 1074 missing values. Missing values imputed with the mean.
  • years_as_customer: Numeric values without missing values but with two outlier values 47 and 63. The two values are greater than 39, were considered outliers and deleted since the company was created in 1984 and we are in 2023.
  • nb_site_visits: Numeric values without missing values, same as the description. No cleaning is needed.
  • state: 50 unique character values without missing values, same as the description. No cleaning is needed.
# Check variable data types
prod_sales.info()
prod_sales['sales_method'].unique()
prod_sales['week'].unique()
# Check for missing values
prod_sales.isnull().sum()
# Percentage of missing values
print("Missing values for revenue: {:.2f}%".format(100 * prod_sales['revenue'].isnull().sum() / len(prod_sales)))
# Check for duplicates
prod_sales.duplicated().sum()
# Check for outliers
prod_sales.describe()
# Count number of records in years_as_customer column that are greater than 39
count = (prod_sales['years_as_customer'] > 39).sum()
count
# Identify outliers
outliers = prod_sales.loc[prod_sales['years_as_customer'] > 39, 'years_as_customer']

# Remove outliers from dataframe
prod_sales = prod_sales[~prod_sales['years_as_customer'].isin(outliers)]
prod_sales.shape
prod_sales['sales_method'].dtype