Skip to content
Poject: Investigate Best Sales Approach
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
from statsmodels.imputation import mice
from statsmodels.stats.diagnostic import lilliefors
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as statsproduct_sales = pd.read_csv('product_sales.csv')
df = product_sales
df.head(5)Hidden output
# checking for missing values and finding each data type
df.info()Hidden output
# checking for missing values
print(df.isnull().sum())Hidden output
# Data validation: checking unique values to identify any unexpected or erroneous data
print(df.nunique())Hidden output
# checking for outliers
print(df.describe().round(2))Hidden output
df[df['years_as_customer'] > 40]Hidden output
# Propotion of missing values of each column relative to the total enteries
missing_values = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100
missing_data = pd.DataFrame({'Total Missing': missing_values, 'Percentage': missing_percentage})
print(missing_data)Hidden output
# Visual Inspection:
sns.heatmap(df.isnull(), cmap='viridis')
plt.ylabel('Rows')
plt.xlabel('Columns')
plt.show()
# Statistical Tests:
# Perform Little's MCAR test using the Lilliefors test as a proxy
# Note: This is not a direct replacement for Little's MCAR test, but it can be used for normality testing
# Convert all columns to numeric, forcing non-numeric to NaN
df_numeric = df.apply(pd.to_numeric, errors='coerce')
flattened_values = df_numeric.dropna().values.flatten()
# Ensure there are at least 4 observations for the Lilliefors test
if len(flattened_values) >= 4:
stat, p_value = lilliefors(flattened_values)
lilliefors_result = f"Lilliefors test p-value: {p_value}"
else:
lilliefors_result = "Not enough data for Lilliefors test. At least 4 observations are required."
# Correlation Analysis:
missing_corr = df.isnull().corr()
sns.heatmap(missing_corr, annot=True, cmap='coolwarm')
plt.title('Correlation of Missing Values')
plt.show()
# Display results
lilliefors_result, missing_corr
# Determine if the missingness is correlated
is_correlated = missing_corr.abs().values.flatten().max() > 0.5
is_correlatedHidden output
# values greater than 40 in years_as_customer is wrong and we should consider them as outliers
df = df[df['years_as_customer'] <= 40]
df_drop = df
df_dropHidden output
# statistical test to determine if there is a relationship between missing values in the 'revenue' column and the 'sales_method' column, we can use a chi-square test of independence. This test is appropriate when both variables are categorical, so we first create a categorical revenue column
# Replace NaN values with 'Null' and non-NaN values with 'Not_Null' in the 'revenue' column
df_drop['revenue_category'] = df_drop['revenue'].apply(lambda x: 'Null' if pd.isna(x) else 'Not_Null')
df_drop.head(5)Hidden output
# Validating sales_method
allowed_method = ['Email', 'Call', 'Email + Call']
not_allowed_method = df_drop[~df_drop['sales_method'].isin(allowed_method)]
# not_allowed_method
df_drop['sales_method'] = df_drop['sales_method'].replace({'em + call': 'Email + Call', 'email': 'Email'})# Group by 'sales_method' and 'revenue_category' and display count of 'Null' and 'Not_Null' for each group
contingency_table = pd.crosstab(df_drop['sales_method'], df_drop['revenue_category'])
print(contingency_table)
# Perform the chi-square test
chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
print(chi2_stat, '\n', p_value)
df_drop.head(18)Hidden output
# Validating revenue
# When the distribution of revenue for each sales method is not normal, it's generally advisable to use the median for imputation rather than the mean.
# Function to perform stratified imputation
def stratified_imputation(group):
group['revenue'].fillna(group['revenue'].median(), inplace=True)
return group
# Group the DataFrame by 'sales_method' and apply stratified imputation
df_imputed = df_drop.groupby('sales_method', group_keys=False).apply(stratified_imputation)
# Check if there are still any missing values in 'revenue' column
missing_values = df_imputed['revenue'].isnull().sum()
print(missing_values)
df_imputedHidden output
Data Validation
The dataset contains 15,000 rows and 8 columns before cleaning and validation. I have validated all the columns against the criteria in the table:
- week: 15,000 values, with 6 unique values, from 1 to 6 same as in the description without any missing values. No cleaning needed.
- sales_method: 15,000 values, same as in the dataset. No missing values but had 5 unique values instead of 3. 33 values were adjusted to the correct values by changing email and em + call to Email and Email + Call respectively.
- customer_id: 15,000 character variable, same as in the description without missing values. No cleaning needed.
- nb_sold: 15,000 values, 10 numeric categories without missing values from 7 to 16 same as in the dataset. No cleaning needed.
- revenue: 13,926 values in the dataset and 1,074 null or missing values. There are more than 5% missing values in the revenue column, so I checked if the values are missing completely at random (MCAR), missing at random (MAR), or missing not at random (MNAR). The null values were replaced with the median of the revenue column since the missingness is not correlated and was determined to be MCAR. Also, there is a relationship between missing values in the revenue column and the sales_method column from the chi-square test of independence which prevent us from dropping the missing values. 15,000 values after cleaning.
- years_as_customer: 15,000 numeric values, same as in the dataset without missing values. There are outliers in this column because the company, founded in 1984, cannot have a customer of more than 40 years therefore, two rows with years more than 40 were dropped.
- nb_site_visits: 15,000 numeric values, same as in the dataset without missing values. No cleaning needed.
- state: 15,000 values, same as in the dataset without missing values. No cleaning needed.