Skip to content

Data Analyst Professional Practical Exam Submission

You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.

You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.

📝 Task List

Your written report should include written text summaries and graphics of the following:

  • Data validation:
    • Describe validation and cleaning steps for every column in the data
  • Exploratory Analysis:
    • Include two different graphics showing single variables only to demonstrate the characteristics of data
    • Include at least one graphic showing two or more variables to represent the relationship between features
    • Describe your findings
  • Definition of a metric for the business to monitor
    • How should the business use the metric to monitor the business problem
    • Can you estimate initial value(s) for the metric based on the current data
  • Final summary including recommendations that the business should undertake

Start writing report here..

Task 1: Data Validation & Cleaning

  • pandas: Data handling
  • numpy: Numerical computing
  • matplotlib.pyplot: Plotting graphs
  • matplotlib.style: Graph styling
  • plotly.express: Interactive plotting
  • seaborn: Statistical visualization
# Import the required libraries and obtain an overview to begin data cleaning.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as sty
import plotly.express as px
import seaborn as sns
df = pd.read_csv('product_sales.csv')
df.info()
# Display the head of the Table
df.head()

Before cleaning the dataset, it contained 15,000 rows and 8 columns.

# Shape of the dataset before cleaning: (15,000, 8)
df.shape
# The Revenue column is the only one with missing values, totaling 1,074 missing entries.
df.isna().sum()

The Revenue column contained 1,074 missing values, which were dropped in the next step. After this operation, the dataset's shape changed to 13,926 rows and 8 columns.

# Remove missing values from the columns
df.dropna(inplace=True)
df.isna().sum()
# Shape after cleaning and dropping missing values in the Revenue column: (13926, 8)
df.shape
# Check the table values
print(df['sales_method'].unique())
print(df['state'].unique())
print(df['week'].unique())

I need to fix "Sales_Method" column because it contains duplicate values. The column should only include the unique values: ['Email + Call' 'Call' 'Email']

# When printing the unique values in the Sales_Method column, some duplicates are present, such as 'em + call' and 'email'. 
#These need to be standardized to the correct values.
df.loc[df.sales_method == 'em + call', 'sales_method'] = 'Email + Call'
df.loc[df.sales_method == 'email', 'sales_method'] = 'Email'
print(f'\nEnsuring that there are only 3 options in the sales_methods column after amending: {df.sales_method.unique()}')