Skip to content
Product Sales: Optimizing Sales Strategy for New Product Line
Product Sales: Optimizing Sales Strategy for New Product Line
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Set seaborn grid style
sns.set_style('ticks')
sns.set_context('paper')
sns.set_palette('colorblind')
sns.set_style({'axes.grid': True, 'grid.color': '.8', 'grid.linestyle': '--'})
# Load the data
prod_sales = pd.read_csv('product_sales.csv')
prod_sales.head()prod_sales.shapeData Validation
The data set has 15000 rows and 8 columns. I have validated all the variables and have made changes after validation. Not all the columns were as described in the data dictionary.
- week: 6unique numeric values without missing values, same as the description. No cleaning is needed.
- sales_method: 5character (string) values without missing values. Replacedem + callandemailwithEmail + CallandEmailrespectively, ensuring there are only3unique sales methods (Email,CallandEmail + Call).
- customer_id: Character values without missing values, same as the description. No cleaning is needed.
- nb_sold: Numeric values without missing values, same as the description. No cleaning is needed.
- revenue: Numeric values with 1074missing values. Missing values imputed with the mean.
- years_as_customer: Numeric values without missing values but with two outlier values 47and63. The two values are greater than 39, were considered outliers and deleted since the company was created in 1984 and we are in 2023.
- nb_site_visits: Numeric values without missing values, same as the description. No cleaning is needed.
- state: 50 unique character values without missing values, same as the description. No cleaning is needed.
# Check variable data types
prod_sales.info()prod_sales['sales_method'].unique()prod_sales['week'].unique()# Check for missing values
prod_sales.isnull().sum()# Percentage of missing values
print("Missing values for revenue: {:.2f}%".format(100 * prod_sales['revenue'].isnull().sum() / len(prod_sales)))# Check for duplicates
prod_sales.duplicated().sum()# Check for outliers
prod_sales.describe()# Count number of records in years_as_customer column that are greater than 39
count = (prod_sales['years_as_customer'] > 39).sum()
count# Identify outliers
outliers = prod_sales.loc[prod_sales['years_as_customer'] > 39, 'years_as_customer']
# Remove outliers from dataframe
prod_sales = prod_sales[~prod_sales['years_as_customer'].isin(outliers)]
prod_sales.shapeprod_sales['sales_method'].dtype