Skip to content

Sample Practical Exam professional Data Scientist

Notes from Assignament

Email from Head of Data Science

Task:

  • Perform analysis and write a short report for HODS (Head of Data Science)
  • Document Code and thought processes
  • Prepare and deliver a presentation to the sales team

Goal:

  • Predict prices within 10 % of listed price (Team is only at ~30%, this is the soft limit)
  • Document decisions along with work

Email form Sales Team:

  • team member is retiring, the're good at estimating car sales prices
  • Team estimates are usually around 30% off
  • Asking for initial thoughts
  • Presentation to two sales managers

Business question: "Can we predict prices accurately enough using Data Science?"

Company background

  • Sell used cars to offer great value for small price
  • New cars will be required to have zero emissions, cars running on diesel and petrol are expected to have a lower value after 2030

Data Information

  • Information from online adverts
  • Buyers typically want to know the road tax (determined based on age, emissions, fuel type)
  • Electric cars do not pay road tax currently
  • Sales team has pulled listings we don't know how the cars sold

ImportPackages and define utility functions

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import missingno as msno

Data validation

Validation

6738 entries are included before validation.

Validating individual Columns

  • model: Parsed as object, 18 possible values are expected, this was verified. The unique values look good, no obvious spelling errors. Some model names might overlap (e.g. Verso, Verso-S, and PROACE VERSO) but verifying and fixing this needs more expertise than I have. I converted to a category type.
  • year: Parsed as integer, that's good. Ranges from 1998 to 2020. Cars from 2017 account for almost a third of the dataset.
  • price: Parsed as integer, converted to float. Ranges from 850 to ~60k GBP. Looks plausible.
  • transmission: Parsed as object, 4 unique values present as expected. Converted to category.
  • mileage: Parsed as integer,ranges from 2 to 174k all are plausible
  • fuelType: Parsed as object, only the four expected values are present. Converted to category
  • tax: parsed as integer, ranges around 200 GBPs max 500 seems plausible
  • mpg: Parsed as float, ranges from 2 to 250. Both extremely low and high values seem implausible and are set to null.
  • engineSize: Parsed as float, should have 16 different values, values range from 0.0 to 4.5 looks plausible

As this is a proof of concept setting, we'll throw out the 41 implausible mpg values. In a production setting this should be handled differently. After data validation, 6697 datapoints remain.

General notes

  • There are only 18 different models, the dataset name suggests only one brand (Toyota)
df = pd.read_csv('toyota.csv')
df.info()
# Validate the "model" column
model = df['model']
unique_models = model.nunique()
display(f"There are {unique_models} unique models")
model.value_counts(dropna=False).sort_index()
df['model'] = df['model'].astype('category')
# Validate the "year" column
year = df['year']
display(year.describe())
sns.histplot(year,bins=np.arange(1990.5,2031.5,1))
plt.title("Listings per Year")
# Validate price column
price = df['price'].astype('float')
display(df['price'].describe())
sns.histplot(price)
# Validate transmission column
display(df['transmission'].value_counts(dropna=False))
transmission = df['transmission'].astype('category')
sns.countplot(data=df,x='transmission')
df['transmission']=transmission
plt.xlabel("")
plt.title("Transmission types")
# Validate mileage column
mileage = df['mileage'].astype('float')
display(df['mileage'].describe())
sns.histplot(mileage)
# Validate transmission column
display(df['fuelType'].value_counts(dropna=False))
fuelType = df['fuelType'].astype('category')
sns.countplot(data=df,x='fuelType')
df['fuelType']=fuelType
plt.xlabel("")
plt.title("Fuel types")
# Validate 'tax' column
tax = df.tax
sns.boxplot(data=df,y='tax', x='fuelType')
plt.title("Road Tax by Fuel Type")
plt.xlabel=('')
plt.ylabel=('Tax / GBP')
fig = plt.figure(figsize=(10,6))
axes = fig.subplots(2,2)

for fuelType, ax in zip(df.fuelType.unique(), axes.flatten()):
    sns.scatterplot(data=df[df.fuelType == fuelType],y='tax', x='year',ax = ax)
    ax.set_title(f"Road Tax by Year (Fuel Type: { fuelType})")
    ax.set_xlabel=('Year')
    ax.set_ylabel=('Tax / GBP')
fig.tight_layout()
mpg = df['mpg']
display(mpg.describe())
sns.histplot(mpg)
implausible = np.logical_or((mpg < 20) , (mpg > 100))
print(f'Marking {implausible.sum()} points as implausible')

df.loc[implausible,'mpg'] = np.nan
engine_size = df['engineSize']
display(engine_size.describe())
display(engine_size.value_counts(dropna=False))
sns.histplot(engine_size)