Data Analyst (Associate) and Professional Certification Case Study

Data Analyst (Professional) Example Practical Exam

Author: Mauro César

Context

This notebook aims to reproduce the step by step to the [practical exam sample]. Further details and guidelines can be found in the Practical Hub

Imports

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load Data

# Read data from 'toyota.csv' and store it in a dataset
car_sales = pd.read_csv('toyota.csv', index_col=None)

# Print numes of rows x columns
print(f"Number of rows/records: {car_sales.shape[0]}")
print(f"Number of columns/variables: {car_sales.shape[1]}")

# Print Head of dataframe
car_sales.head()

car_sales.describe()

car_sales.info()

# Check possible values in each column
nom_num = ['model', 'transmission', 'fuelType', 'engineSize']

# Check possible values in each column
for column in nom_num:
    possible_values = car_sales[column].unique()
    n_values = len(possible_values) 
    print(f"There are {n_values} possible values in column '{column}': {possible_values}")

Data Validation

The dataset contains 6738 rows and 9 columns before cleaning and validataion. I have validated all the columns against the criteria in the dataset table:

model: 18 models without missing values, same as the description. No cleaning is needed.
year: 23 unique values without missing values, from 1998 to 2020, same as the description. No cleaning is needed.
price: numeric values without missing values, same as the description. No cleaning is needed.
transmission: 4 categories without missing values, same as the description. No cleaning is needed.
mileage: numeric values, same as the description. No cleaning is needed.
fuelType: 4 categories without missing values, same as the description. No cleaning is needed.
mpg: numeric values without missing values, same as the description. No cleaning is needed.
engineSize: 16 possible values without missing values, same as the description. No cleaning is needed.

After the data validation, the dataset contains 6738 rows and 9 columns without missing values.

Exploratory Analysis

From the last 6 months' record, more than half of the used cars we sold are using petrol, followed by using hybrid. Also, we can see the number of cars sold in petrol is almost twice the number of cars sold in hybrid. Therefore, we can conclude that there is huge room to impove the sales of hybrid cars.

# Count the number of records for sales per year
sales_per_fuel_type= car_sales['fuelType'].value_counts().sort_values(ascending=False)
# .sort_index()  # Sort by year for better visualization

# Create a bar chart
plt.bar(sales_per_fuel_type.index.astype(str), sales_per_fuel_type.values, color='skyblue')  # Convert index to string for better x-axis labels

# Add labels and title
plt.xlabel('Fuel Type')
plt.ylabel('Sales Count')
plt.title('Sales by Fuel Type')

# Add count labels to each bar
for i, count in enumerate(sales_per_fuel_type.values):
    plt.text(i, count, str(count), ha='center', va='bottom')

# Display the chart
plt.xticks(rotation=45)  # Rotate x-axis labels if they overlap
plt.tight_layout()  # Adjust layout to prevent clipping of tick-labels
plt.show()

If we only look at hybrid cars. From the bar chart below, the best seller in car models is Yaris, followed by Auris and C-HR.

# Count the number of records for sales per year
hybrid_sales = car_sales[car_sales['fuelType'] == 'Hybrid']
sales_per_model = hybrid_sales['model'].value_counts().sort_values(ascending=False)
# .sort_index()  # Sort by year for better visualization

# Create a bar chart
plt.bar(sales_per_model.index.astype(str), sales_per_model.values, color='skyblue')  # Convert index to string for better x-axis labels

# Add labels and title
plt.xlabel('Model')
plt.ylabel('Sales Count')
plt.title('Sales per Model (hybrid only)')

# Add count labels to each bar
for i, count in enumerate(sales_per_model.values):
    plt.text(i, count, str(count), ha='center', va='bottom')

# Display the chart
plt.xticks(rotation=45)  # Rotate x-axis labels if they overlap
plt.tight_layout()  # Adjust layout to prevent clipping of tick-labels
plt.show()

We also want to compare the price of all the cars we sold last 6 months. I was looking at the distribution of price. The majority of used cars sold last 6 months is less than 30000 GBP. Used cars sold for more than 30000 GBP could be considered outliers.

# Create a histogram of the number of reviews, capping the number of reviews at 3000
values, bins, bars = plt.hist(car_sales['price'], bins=30)
# range=(0, 3000)

# Add labels and title
plt.xlabel('Number of Sales')
plt.ylabel('Frequency')
plt.suptitle('Distribution of Number of Sales')
plt.title('Price Distribution', fontsize = 10)
plt.bar_label(bars,  color='navy')

# Display the histogram
plt.show()

‌
‌
‌