Data Scientist Professional
Example Practical Exam Solution
You can find the project information that accompanies this example solution in the resource center, Practical Exam Resources.
Data Validation
This data set has 6738 rows, 9 columns. I have validated all variables and I have not made any changes after validation. All the columns are just as described in the data dictionary:
- model: 18 models without missing values, same as the description. No cleaning is needed.
- year: 23 unique values without missing values, from 1998 to 2020, same as the description. No cleaning is needed.
- price: numeric values without missing values, same as the description. No cleaning is needed.
- transmission: 4 categories without missing values, same as the description. No cleaning is needed.
- mileage: numeric values, same as the description. No cleaning is needed.
- fuelType: 4 categories without missing values, same as the description. No cleaning is needed.
- mpg: numeric values without missing values, same as the description. No cleaning is needed.
- engineSize: 16 possible values without missing values, same as the description. No cleaning is needed.
# Use this cell to begin, and add as many cells as you need to complete your analysis!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import r2_score,mean_squared_error
plt.style.use('ggplot')
df = pd.read_csv('data/toyota.csv')
df.info()
#validate possible 18 values
df['model'].nunique()
#validate year of manufacture from 1998 to 2020
df['year'].unique()
#validate four types of transmission
df['transmission'].unique()
#validate four fuel Types
df['fuelType'].unique()
#validate 16 possible values in engineSize
df['engineSize'].nunique()
#validate any negative values in numeric variables
df.describe()
Exploratory Analysis
I have investigated the target variable and features of the car, and the relationship between target variable and features. After the analysis,I decided to apply the following changes to enable modeling:
- Price: use log transformation
- Create a new ordinal variable from tax variable
Target Variable - Price
Since we need to predict the price, the price variable would be our target variable. From the histogram on the left below, we can see there is a longer right tail. Therefore, we apply log transforamtion of the price variable, the distribution on the right below is close to normal distribution.
fig, axes = plt.subplots(1,2,figsize=(15,5))
sns.histplot(df['price'],ax=axes[0]).set(title='The Distribution of Target Variable - Price')
sns.histplot(df['price'],log_scale=True,ax=axes[1]).set(title='The Distribution of Target Variable - Price (Log Scale)');
df['price'] = np.log(df['price'])
Numeric Variables - Mileage, Tax, mpg
From the heatmap below, we can conclude that there is a moderate linear negative relationship in two pairs of variables - price log transformation and mileage, tax and mpg.