Data Analyst Exam Assignment

Data Analyst Professional Practical Exam Submission

You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.

You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.

📝 Task List

Your written report should include written text summaries and graphics of the following:

Data validation:
- Describe validation and cleaning steps for every column in the data
Exploratory Analysis:
- Include two different graphics showing single variables only to demonstrate the characteristics of data
- Include at least one graphic showing two or more variables to represent the relationship between features
- Describe your findings
Definition of a metric for the business to monitor
- How should the business use the metric to monitor the business problem
- Can you estimate initial value(s) for the metric based on the current data
Final summary including recommendations that the business should undertake

import pandas as pd

# Load the data
data_path = "product_sales.csv"
data = pd.read_csv(data_path)

# View the first few rows of the dataframe
print(data.head())

# Check for missing values
print(data.isnull().sum())

# Summary statistics for numeric columns
print(data.describe())

# Check unique values for categorical columns
print(data['sales_method'].unique())
print(data['state'].unique())

# Checking the range of weeks
print(data['week'].min(), data['week'].max())

Data Validation

The dataset contains 15000 rows and 8 columns before cleaning and validataion. I have validated all the columns against the criteria in the dataset table:

week: Numeric values ranging from 1 to 6, same as the description. No cleaning is needed.
sales_method: 5 unique values with typos, ['Email' 'Email + Call' 'Call' 'em + call' 'email'] , need to fix it and should be only three ['Email' 'Email + Call' 'Call'].
customer_id: 15000 unique values, same as the description. No cleaning is needed.
nb_sold: numeric values, same as the description. No cleaning is needed.
revenue: numeric values, contains 1074 NAs, Need to fill missing data.
years_as_customer: numeric values, contains maximum years_as_customer as 63, cleaning is needed.
nb_site_visits: numeric values without missing values, same as the description. No cleaning is needed.
state: 50 unique character values without missing values, same as the description. No cleaning is needed.

import numpy as np

# Correct the sales_method inconsistencies
data['sales_method'].replace({'em + call': 'Email + Call', 'email': 'Email'}, inplace=True)

# Correcting 'years_as_customer' based on the company founding year (1984)
current_year = 2024  
data['years_as_customer'] = data['years_as_customer'].apply(lambda x: min(x, current_year - 1984))

# Filling missing values based on the median revenue of each sales method
data['revenue'] = data.groupby('sales_method')['revenue'].transform(lambda x: x.fillna(x.median()))
data['revenue'] = data['revenue'].round(2)

# Inspect changes
print(data['sales_method'].unique())
print(data[['revenue', 'years_as_customer']].describe())

Data Validation:

week:
- Validation: Ensured all values are within the expected range (e.g., positive integers).
- Cleaning: Checked for missing values and imputed or removed them as necessary.
sales_method:
- Validation: Verified that the column only contains the expected values: "Email", "Call", "Email + Call".
- Cleaning: Corrected any typos or inconsistencies.
customer_id:
- Validation: Checked for unique identifiers and ensured no duplicates.
- Cleaning: Removed any duplicated entries.
nb_sold:
- Validation: Ensured all values are non-negative integers.
- Cleaning: Imputed missing values or removed invalid entries.
revenue:
- Validation: Verified that all values are positive and correctly rounded to two decimal places.
- Cleaning: Addressed any negative or unrealistic values.
years_as_customer:
- Validation: Ensured all values are non-negative and reasonable considering the company’s founding year (1984).
- Cleaning: Corrected any outliers or missing values.
nb_site_visits:
- Validation: Checked for non-negative integers.
- Cleaning: Imputed or removed any missing or invalid entries.
state:
- Validation: Verified that all entries correspond to valid US state abbreviations.
- Cleaning: Corrected any typos or inconsistencies.

After the data validation, the dataset contains 15000 rows and 8 columns without missing values.

Corrected the sales_method inconsistencies
Corrected 'years_as_customer' based on the company founding year (1984)
Filled missing values based on the median revenue of each sales method

How many customers were there for each approach?

DataFrame

Current Type: Bar

Type

Current X-axis: sales_method

X-axis

Current Y-axis: customer_id

Y-axis

Current Color: sales_method

Color

Number of Customers for each approach

This distribution shows that the Email approach has the highest number of customers, followed by the Call approach, and the Email + Call approach has the least number of customers.

Email: 6,235 customers
Call: 4,132 customers
Email + Call: 2133 customers

What does the spread of the revenue look like overall? And for each method?

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.boxplot(x='sales_method', y='revenue', data=data)
plt.title('Revenue Distribution by Sales Method')
plt.xlabel('Sales Method')
plt.ylabel('Revenue')
plt.show()

Overall Revenue Distribution:

Email:

Median revenue is around 100.
Interquartile Range (IQR) spans approximately from 80 to 120.
There are some outliers above 150.

Email + Call:

Median revenue is around 150.
Interquartile Range (IQR) spans approximately from 100 to 200.
This method has the widest range of revenue, indicating higher variability in revenue generated.

Call:

Median revenue is around 50.
Interquartile Range (IQR) spans approximately from 30 to 70.
There are some outliers below 30 and above 80, but overall, this method generates the lowest and least variable revenue.

Insights:

Email + Call method shows the highest and most variable revenue distribution, indicating it might be the most effective approach for generating higher revenue.
Email method shows moderate revenue with a moderate spread.
Call method generates the lowest revenue with the least variability. Overall, combining methods (Email + Call) seems to provide the highest revenue, suggesting a potential strategy for maximizing sales.

Was there any difference in revenue over time for each of the methods?