Skip to content

Data Analyst Professional Practical Exam Submission

You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.

You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.

📝 Task List

Your written report should include written text summaries and graphics of the following:

  • Data validation:
    • Describe validation and cleaning steps for every column in the data
  • Exploratory Analysis:
    • Include two different graphics showing single variables only to demonstrate the characteristics of data
    • Include at least one graphic showing two or more variables to represent the relationship between features
    • Describe your findings
  • Definition of a metric for the business to monitor
    • How should the business use the metric to monitor the business problem
    • Can you estimate initial value(s) for the metric based on the current data
  • Final summary including recommendations that the business should undertake
import pandas as pd

# Load the data
data_path = "product_sales.csv"
data = pd.read_csv(data_path)

# View the first few rows of the dataframe
print(data.head())

# Check for missing values
print(data.isnull().sum())

# Summary statistics for numeric columns
print(data.describe())

# Check unique values for categorical columns
print(data['sales_method'].unique())
print(data['state'].unique())

# Checking the range of weeks
print(data['week'].min(), data['week'].max())

Data Validation

The dataset contains 15000 rows and 8 columns before cleaning and validataion. I have validated all the columns against the criteria in the dataset table:

  • week: Numeric values ranging from 1 to 6, same as the description. No cleaning is needed.
  • sales_method: 5 unique values with typos, ['Email' 'Email + Call' 'Call' 'em + call' 'email'] , need to fix it and should be only three ['Email' 'Email + Call' 'Call'].
  • customer_id: 15000 unique values, same as the description. No cleaning is needed.
  • nb_sold: numeric values, same as the description. No cleaning is needed.
  • revenue: numeric values, contains 1074 NAs, Need to fill missing data.
  • years_as_customer: numeric values, contains maximum years_as_customer as 63, cleaning is needed.
  • nb_site_visits: numeric values without missing values, same as the description. No cleaning is needed.
  • state: 50 unique character values without missing values, same as the description. No cleaning is needed.
import numpy as np

# Correct the sales_method inconsistencies
data['sales_method'].replace({'em + call': 'Email + Call', 'email': 'Email'}, inplace=True)

# Correcting 'years_as_customer' based on the company founding year (1984)
current_year = 2024  
data['years_as_customer'] = data['years_as_customer'].apply(lambda x: min(x, current_year - 1984))

# Filling missing values based on the median revenue of each sales method
data['revenue'] = data.groupby('sales_method')['revenue'].transform(lambda x: x.fillna(x.median()))
data['revenue'] = data['revenue'].round(2)

# Inspect changes
print(data['sales_method'].unique())
print(data[['revenue', 'years_as_customer']].describe())
Data Validation:
  • week:

    • Validation: Ensured all values are within the expected range (e.g., positive integers).
    • Cleaning: Checked for missing values and imputed or removed them as necessary.
  • sales_method:

    • Validation: Verified that the column only contains the expected values: "Email", "Call", "Email + Call".
    • Cleaning: Corrected any typos or inconsistencies.
  • customer_id:

    • Validation: Checked for unique identifiers and ensured no duplicates.
    • Cleaning: Removed any duplicated entries.
  • nb_sold:

    • Validation: Ensured all values are non-negative integers.
    • Cleaning: Imputed missing values or removed invalid entries.
  • revenue:

    • Validation: Verified that all values are positive and correctly rounded to two decimal places.
    • Cleaning: Addressed any negative or unrealistic values.
  • years_as_customer:

    • Validation: Ensured all values are non-negative and reasonable considering the company’s founding year (1984).
    • Cleaning: Corrected any outliers or missing values.
  • nb_site_visits:

    • Validation: Checked for non-negative integers.
    • Cleaning: Imputed or removed any missing or invalid entries.
  • state:

    • Validation: Verified that all entries correspond to valid US state abbreviations.
    • Cleaning: Corrected any typos or inconsistencies.

After the data validation, the dataset contains 15000 rows and 8 columns without missing values.

  • Corrected the sales_method inconsistencies
  • Corrected 'years_as_customer' based on the company founding year (1984)
  • Filled missing values based on the median revenue of each sales method

How many customers were there for each approach?

Current Type: Bar
Current X-axis: sales_method
Current Y-axis: customer_id
Current Color: sales_method

Number of Customers for each approach

This distribution shows that the Email approach has the highest number of customers, followed by the Call approach, and the Email + Call approach has the least number of customers.

  • Email: 6,235 customers
  • Call: 4,132 customers
  • Email + Call: 2133 customers

What does the spread of the revenue look like overall? And for each method?

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.boxplot(x='sales_method', y='revenue', data=data)
plt.title('Revenue Distribution by Sales Method')
plt.xlabel('Sales Method')
plt.ylabel('Revenue')
plt.show()

Overall Revenue Distribution:

Email:

  • Median revenue is around 100.
  • Interquartile Range (IQR) spans approximately from 80 to 120.
  • There are some outliers above 150.

Email + Call:

  • Median revenue is around 150.
  • Interquartile Range (IQR) spans approximately from 100 to 200.
  • This method has the widest range of revenue, indicating higher variability in revenue generated.

Call:

  • Median revenue is around 50.
  • Interquartile Range (IQR) spans approximately from 30 to 70.
  • There are some outliers below 30 and above 80, but overall, this method generates the lowest and least variable revenue.

Insights:

  • Email + Call method shows the highest and most variable revenue distribution, indicating it might be the most effective approach for generating higher revenue.
  • Email method shows moderate revenue with a moderate spread.
  • Call method generates the lowest revenue with the least variability. Overall, combining methods (Email + Call) seems to provide the highest revenue, suggesting a potential strategy for maximizing sales.

Was there any difference in revenue over time for each of the methods?

Current Type: Bar
Current X-axis: years_as_customer
Current Y-axis: revenue
Current Color: sales_method

Revenue Over Time by Sales Method