Skip to content

Data Scientist Professional

Example Practical Exam Solution

You can find the project information that accompanies this example solution in the resource center, Practical Exam Resources.

Data Validation

This data set has 6738 rows, 9 columns. I have validated all variables and I have not made any changes after validation. All the columns are just as described in the data dictionary:

  • model: 18 models without missing values, same as the description. No cleaning is needed.
  • year: 23 unique values without missing values, from 1998 to 2020, same as the description. No cleaning is needed.
  • price: numeric values without missing values, same as the description. No cleaning is needed.
  • transmission: 4 categories without missing values, same as the description. No cleaning is needed.
  • mileage: numeric values, same as the description. No cleaning is needed.
  • fuelType: 4 categories without missing values, same as the description. No cleaning is needed.
  • mpg: numeric values without missing values, same as the description. No cleaning is needed.
  • engineSize: 16 possible values without missing values, same as the description. No cleaning is needed.

9 hidden cells

Exploratory Analysis

I have investigated the target variable and features of the car, and the relationship between target variable and features. After the analysis,I decided to apply the following changes to enable modeling:

  • Price: use log transformation
  • Create a new ordinal variable from tax variable

Target Variable - Price

Since we need to predict the price, the price variable would be our target variable. From the histogram on the left below, we can see there is a longer right tail. Therefore, we apply log transforamtion of the price variable, the distribution on the right below is close to normal distribution.

Hidden code

Numeric Variables - Mileage, Tax, mpg

From the heatmap below, we can conclude that there is a moderate linear negative relationship in two pairs of variables - price log transformation and mileage, tax and mpg.

Hidden code
Relationship between mpg, tax, mileage and price

To spot the non-linear relationship, I decided to make scatterplot to further investigate the relationship between mpg, tax, mileage and our target varible - price. From the scatterplots below, there is linear relationship between mileage and price. No relationship between price and mpg. I found out there is clusters in the scatterplot between price and tax, so I decided to create a new ordinal variable from the tax variable.

Hidden code

1 hidden cell

Categorical Variables - Year, Engine Size, Model, Transmission, fuelType

Characteristics about Year, Engine size, Model, Transmission, and fuelType

Since year and engine size is most related to price, I checked their characteristics. From the bar chart below, the most common manufacture year is 2016. Also, the most common engine size in is 1.

Hidden code

From the bar charts below, we can see the most frequent categories in model, transmission and fuelType variables - Yaris, Manual, Petrol in the dataset.

Hidden code