Skip to content

DataCamp Certification Case Study

Project Brief

You have been hired as a data scientist at Discount Motors, a used car dealership in the UK. The dealership is expanding and has hired a large number of junior salespeople. Although promising, these junior employees have difficulties pricing used cars that arrive at the dealership. Sales have declined 18% in recent months, and management would like your help designing a tool to assist these junior employees.

To start with, they would like you to work with the Toyota specialist to test your idea(s). They have collected some data from other retailers on the price that a range of Toyota cars were listed at. It is known that cars that are more than £1500 above the estimated price will not sell. The sales team wants to know whether you can make predictions within this range.

You will need to present your findings in two formats:

  • You must submit a written report summarising your analysis to your manager. As a data science manager, your manager has a strong technical background and wants to understand what you have done and why.
  • You will then need to share your findings with the head of sales in a 10 minute presentation. The head of sales has no data science background but is familiar with basic data related terminology.

The data you will use for this analysis can be accessed here: "data/toyota.csv"

Initial Exploration

import pandas as pd
import numpy as np
cars = pd.read_csv("data/toyota.csv")
cars.head()

Looking at the data, there is a mix of categorical and continuous features. The price is the target variable. The features model, transmission, and fuelType are categorical. Price and mileage are continuous. The year feature can be treated as either and analysis must be done to determine how it will be processed. The features tax, mpg, and engineSize are numeric, but there appear multiple repeating values in the sample perhaps indicating that these are categorical. These features will be analyzed to determine their proper use. Something of to note is that price has no units. Because Discount Motors is based in the UK, and they have provided the data, the assumption will be that price is in British pounds (£).

A possible reason tax appears categorical is that tax is determined by location. The location these cars were being sold is not listed. The mpg of a car is not tested for each individual cars. It is practical to have a standard mpg given the model, year, and other features. This feature might be correlated with other features. EngineSize requires domain knowledge to understand. Searching online leads to an article (Engine Size Explained by Charlie Harvey). It claims that engineSize is a rounded number, which explains why we would see repeating values. That same article claims that a bigger engineSize usually leads to a greater price, and that diesel cars are more expensize. Seeing if our model captures this relationship would be a quick sanity check.

print(cars.shape)
print(cars.isna().sum())

There are 6738 data points with no nans.

Exploratory Data Analysis

#Year
import seaborn as sns
import matplotlib.pyplot as plt

sns.violinplot("year",data=cars)

plt.title("Distribution Of Years")
plt.show()
print(cars.groupby("year").count()["model"].sort_values().head())

Beginning with the year feature, there are very few data points for older cars, sometimes as low as 1 per year. It would be reasonable to treat these as outliers and eliminate them.

#Transmission, fuelType, and engineSize
plt.rcParams["figure.figsize"] = (10,5)
fig_cat, ax_cat = plt.subplots(1,2)
sns.countplot("transmission",data=cars, ax = ax_cat[0])
sns.countplot("fuelType",data=cars, ax = ax_cat[1])
fig_cat.suptitle('Counting Values in Categorical Features', fontsize=14)
for ax_i in ax_cat:
	for p in ax_i.patches:
   		ax_i.annotate('{:}'.format(p.get_height()), (p.get_x()+0.2, p.get_height()+10))
plt.show()
ax_1 = sns.countplot("engineSize",data=cars)
for p in ax_1.patches:
   	ax_1.annotate('{:}'.format(p.get_height()), (p.get_x()+0.1, p.get_height()+10))
plt.show()

Transmission and fuelType both have a small other category. In transmission, it is small enough amount of data points to be an outlier and dropped. EngineSize also has some possible outliers. It seem an engine size of 1 is the most common, with 1.5 and 1.8 following.The 6 data points with an engineSize of 0 do not make much sense. An initial hypothesis is that these are fully electric cars.

cars.loc[cars["engineSize"]==0]

Locating these data points, only two are considered hybrid. Taking petrol with an engineSize of 0 makes these data points seem suspicious and candidates to be dropped.

#Price and Mileage
plt.rcParams["figure.figsize"] = (10,8)
fig_con, ax_con = plt.subplots(2,1)
sns.boxplot("price", data = cars, ax = ax_con[0])
sns.boxplot("mileage", data = cars, ax = ax_con[1])
fig_con.suptitle('Distribution of Continuous Features', fontsize=14)
plt.show()