DSP Test Exam Try 2

Sample Practical Exam professional Data Scientist

Notes from Assignament

Email from Head of Data Science

Task:

Perform analysis and write a short report for HODS (Head of Data Science)
Document Code and thought processes
Prepare and deliver a presentation to the sales team

Goal:

Predict prices within 10 % of listed price (Team is only at ~30%, this is the soft limit)
Document decisions along with work

Email form Sales Team:

team member is retiring, the're good at estimating car sales prices
Team estimates are usually around 30% off
Asking for initial thoughts
Presentation to two sales managers

Business question: "Can we predict prices accurately enough using Data Science?"

Company background

Sell used cars to offer great value for small price
New cars will be required to have zero emissions, cars running on diesel and petrol are expected to have a lower value after 2030

Data Information

Information from online adverts
Buyers typically want to know the road tax (determined based on age, emissions, fuel type)
Electric cars do not pay road tax currently
Sales team has pulled listings we don't know how the cars sold

ImportPackages and define utility functions

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import missingno as msno

Data validation

Validation

6738 entries are included before validation.

Validating individual Columns

model: Parsed as object, 18 possible values are expected, this was verified. The unique values look good, no obvious spelling errors. Some model names might overlap (e.g. Verso, Verso-S, and PROACE VERSO) but verifying and fixing this needs more expertise than I have. I converted to a category type.
year: Parsed as integer, that's good. Ranges from 1998 to 2020. Cars from 2017 account for almost a third of the dataset.
price: Parsed as integer, converted to float. Ranges from 850 to ~60k GBP. Looks plausible.
transmission: Parsed as object, 4 unique values present as expected. Converted to category.
mileage: Parsed as integer,ranges from 2 to 174k all are plausible
fuelType: Parsed as object, only the four expected values are present. Converted to category
tax: parsed as integer, ranges around 200 GBPs max 500 seems plausible
mpg: Parsed as float, ranges from 2 to 250. Both extremely low and high values seem implausible and are set to null.
engineSize: Parsed as float, should have 16 different values, values range from 0.0 to 4.5 looks plausible

As this is a proof of concept setting, we'll throw out the 41 implausible mpg values. In a production setting this should be handled differently. After data validation, 6697 datapoints remain.

General notes

There are only 18 different models, the dataset name suggests only one brand (Toyota)

df = pd.read_csv('toyota.csv')
df.info()

# Validate the "model" column
model = df['model']
unique_models = model.nunique()
display(f"There are {unique_models} unique models")
model.value_counts(dropna=False).sort_index()
df['model'] = df['model'].astype('category')

# Validate the "year" column
year = df['year']
display(year.describe())
sns.histplot(year,bins=np.arange(1990.5,2031.5,1))
plt.title("Listings per Year")

# Validate price column
price = df['price'].astype('float')
display(df['price'].describe())
sns.histplot(price)

# Validate transmission column
display(df['transmission'].value_counts(dropna=False))
transmission = df['transmission'].astype('category')
sns.countplot(data=df,x='transmission')
df['transmission']=transmission
plt.xlabel("")
plt.title("Transmission types")

# Validate mileage column
mileage = df['mileage'].astype('float')
display(df['mileage'].describe())
sns.histplot(mileage)

# Validate transmission column
display(df['fuelType'].value_counts(dropna=False))
fuelType = df['fuelType'].astype('category')
sns.countplot(data=df,x='fuelType')
df['fuelType']=fuelType
plt.xlabel("")
plt.title("Fuel types")

# Validate 'tax' column
tax = df.tax
sns.boxplot(data=df,y='tax', x='fuelType')
plt.title("Road Tax by Fuel Type")
plt.xlabel=('')
plt.ylabel=('Tax / GBP')

fig = plt.figure(figsize=(10,6))
axes = fig.subplots(2,2)

for fuelType, ax in zip(df.fuelType.unique(), axes.flatten()):
    sns.scatterplot(data=df[df.fuelType == fuelType],y='tax', x='year',ax = ax)
    ax.set_title(f"Road Tax by Year (Fuel Type: { fuelType})")
    ax.set_xlabel=('Year')
    ax.set_ylabel=('Tax / GBP')
fig.tight_layout()

mpg = df['mpg']
display(mpg.describe())
sns.histplot(mpg)
implausible = np.logical_or((mpg < 20) , (mpg > 100))
print(f'Marking {implausible.sum()} points as implausible')

df.loc[implausible,'mpg'] = np.nan

engine_size = df['engineSize']
display(engine_size.describe())
display(engine_size.value_counts(dropna=False))
sns.histplot(engine_size)

‌
‌
‌