Skip to content

DATA ANALYST PRACTICAL EXAM SUBMISSION

1.Data Validation

The data has 15000 records and 8 columns, 1,074 missing values exist within the 'revenue' column. There are no duplicated records. Based on the provided data information, the values in each column is reasonable and no need to further processing except for the revenue column and sales_method column that needed cleaning. I have validated all the columns against the criteria in the dataset table:

week:The column takes integer values between 1 and 6, no missing values and hence no cleaning is needed.

sales_method: 3 unique sales method ('Call','Email' and 'Email+ Call') are required in this column. Before data validation there were 5 unique sales method because of the misspelling of 'em + call', 'email' instead of 'Email + Call' and 'Email' respectively. I created a dictionary 'corrections containing all possible errors and matched them up to the correct sales method then I applied the replace () function to correct the error.

customer_id:15000 unique customer identifiers as the description, no duplication nor were there any missing values before and after data validation.

nb_sold:Numeric values, same as the description. No cleaning is needed.

Revenue 13926 rows before data validation. After validation I obtained 15000 rows for this column and is numeric. The 1,024 missing values which were replaced with 0 assuming that no sale was made. I cannot replace with mean, median or mode as this could lead to overstating the revenue thus give wrong EDA. I also applied the .round(2) function to ensure revenue from the sales is rounded to 2 decimal places.

years_as_customer:Contains numeric values between 0 and 40 no missing values. Any value above 40 was filtered. After validation the values were within the range of 0 and 39.

nb_site_visits:numeric values without missing values, same as the description. No cleaning is needed.

state:: 50 possible states without missing values, same as the description. No cleaning is needed.

2.BUSINESS FOCUS

Objective: Analyzing and Optimizing Sales Strategies for New Product Line

To identify the most effective sales strategies for the newly launched office stationery product line, ensuring maximum revenue generation while optimizing resource allocation.

Since launching a new product line is expensive and the sales team want to make sure they are using the best techniques to sell the new product effectively. The sales team need to know the following key areas:

A. Customer Distribution Across Sales Methods: Goal:Determine the number of customers associated with each sales method (Email, Call, Email + Call).**

B. Revenue Distribution: Goal: Assess the overall revenue distribution to understand the general performance of the new product line.

C. Revenue by Sales Method: Goal: Evaluate the effectiveness of each sales method in terms of revenue generation.

D. Revenue Trends Over Time: Goal: Analyze how revenue evolves over the six-week period for each sales method.

3.Data Cleaning and formatting

I imported the dataset into a dataframe using pandas and displayed the first 5 rows

#Importing all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#importing the product_sales data set
 #read file
df=pd.read_csv('product_sales.csv')

# df.head() display first 5 rows
#data.describe() describe about the data
#data.info() give information about the data
print(df.head(),df.describe(),df.info())

Data Cleaning

#number of missing values in the revenue column with 0
xnull = df['revenue'].isna().sum()
print('The Number of missing values in revenue column is:', xnull)
#filling the null values in the revenue column
#The revenue column contains 18 missing values which will be replaced with 0. This assumes that no sale was made.

df['revenue'].fillna(0, inplace=True)
df['revenue'] = df['revenue'].round(2)
[35]
#printing out the unique sales_method column before data validation
df['sales_method'].unique()
#creating a dictionary for all possible errors and corresponding corrections in the sales_method column
corrections = {
    'call': 'Call',
    'email': 'Email',
    'em + call': 'Email + Call',
    'Email+ Call': 'Email + Call',
    'Call': 'Call',
    'Email': 'Email',
    'emAil': 'Email',
    'Em+Call': 'Email+ Call'
}
#correcting the erros in the sales_method column using the replace() method
df['sales_method'] = df['sales_method'].replace(corrections)
[37]
#validating that there are only 3 unique sales method which are Call,Email and Email +  Call
df['sales_method'].unique()
[38]
#Validating there is exactly 15000 unique identifiers for each client and no duplication or missing values in the customer_id column 
df['customer_id'].nunique()
[39]
#validating that nb_sold has numeric values only to show the number of new products sold
df['nb_sold'].unique()
#validating the number of years customer has been buying from pens and printers since 1984 to date. any value below 1 nad any value above 40 will be replaced by the median of the  years_as_customer column

df_filtered=df[df['years_as_customer'] <= 40]
df_filtered['years_as_customer'].unique()
#validating the state column
df['state'].nunique()