Skip to content
Certification - Data Analyst Professional - Product Sales (copy)
Data Analyst Professional Practical Exam Submission
You can use any tool that you want to do your analysis and create visualizations. Use this template to write up your summary for submission.
You can use any markdown formatting you wish. If you are not familiar with Markdown, read the Markdown Guide before you start.
📝 Task List
Your written report should include written text summaries and graphics of the following:
- Data validation:
- Describe validation and cleaning steps for every column in the data
- Exploratory Analysis:
- Include two different graphics showing single variables only to demonstrate the characteristics of data
- Include at least one graphic showing two or more variables to represent the relationship between features
- Describe your findings
- Definition of a metric for the business to monitor
- How should the business use the metric to monitor the business problem
- Can you estimate initial value(s) for the metric based on the current data
- Final summary including recommendations that the business should undertake
Start writing report here..
Import Libraries
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set style for seaborn
sns.set_style('dark')
sns.set_palette(['Green', 'Orange'])
Upload Dataset
url = 'product_sales.csv'
#assign url to pandas DataFrame: df
df = pd.read_csv(url)
Inspect DataFrame
- Check the shape of the datframe
- Quick preview of the Dataframe
- Check Missing Value
- Check Duplicates
- Validate the comsistency of the DataFrame with the given description
# shape of the DataFrame
df.shape
# Dataframe overview
df.head()
#Dataframe Information Overview
df.info()
#Numeric Data Description
df.describe()
Observation
From the summary statistics, we can make the following observations:
- The dataset contains 15,000 rows and 6 columns.
- The average revenue generated is 93.93, with a standard deviation of 47.44. The minimum revenue is 32.54, and the maximum revenue is 238.32. (currency is assumed due to the information on the state column)
- On average, the customers have been with the company for 4.96 years, and the minimum is 0 years (i.e., a new customer).
- The average number of site visits in the last 6 months is 24.99, with a standard deviation of 3.50.
- The minimum number of site visits in the last 6 months is 12, while the maximum is 41.
- There are missing values in the revenue column, with a count of 13,926 compared to the total count of 15,000. This means that 1,074 rows have missing revenue values.
- The minimum number of products sold is 7, while the maximum is 16.
def check_unique_missing(df):
"""Checks for unique values, missing values, and duplicated rows in a pandas DataFrame.
Args:
df: pandas DataFrame, the input data to check.
Returns:
None.
Prints the number of unique values for each column in the DataFrame, as well as
the number of missing values for each column that has them. If any duplicated rows
are found, prints the total number of duplicates.
"""
# Check unique values
unique_vals = {}
for col in df.columns:
unique_vals[col] = df[col].unique()
# Check missing values
missing_vals = {}
for col in df.columns:
missing_vals[col] = df[col].isnull().sum()
# Check for duplicated rows
num_duplicates = df.duplicated().sum()
# Print results
print("\nUnique values:")
for col, vals in unique_vals.items():
print(f"\n{col}: {len(vals)} unique values")
print("\nMissing values:")
for col, num_missing in missing_vals.items():
if num_missing > 0:
print(f"{col}: {num_missing} missing values")
print("\nDuplicate values:")
if num_duplicates > 0:
print(f"\n{num_duplicates} duplicated rows found")
else:
print("\nNo duplicated rows found")
check_unique_missing(df)
Data Cleaning
- identify the errorneous observations in the
years_at_company
column and drop them - investigate why the
sales_method
column has 5 values and make corrections - handle the missing values in the revenue column
-
check the relationship between the missing `revenue` values and `sales_method`
-
check the relationship between the missing `revenue` values and `state` column
-
check the relationship between the missing `revenue` values and the `nb_sold` column
years_as_customer
column
years_as_customer
column