Skip to content
Sample Data Scientist Professional Solution (copy)
  • AI Chat
  • Code
  • Report
  • Spinner

    Data Scientist Professional

    Example Practical Exam Solution

    You can find the project information that accompanies this example solution in the resource center, Practical Exam Resources.

    Data Validation

    This data set has 6738 rows, 9 columns. I have validated all variables and I have not made any changes after validation. All the columns are just as described in the data dictionary:

    • model: 18 models without missing values, same as the description. No cleaning is needed.
    • year: 23 unique values without missing values, from 1998 to 2020, same as the description. No cleaning is needed.
    • price: numeric values without missing values, same as the description. No cleaning is needed.
    • transmission: 4 categories without missing values, same as the description. No cleaning is needed.
    • mileage: numeric values, same as the description. No cleaning is needed.
    • fuelType: 4 categories without missing values, same as the description. No cleaning is needed.
    • mpg: numeric values without missing values, same as the description. No cleaning is needed.
    • engineSize: 16 possible values without missing values, same as the description. No cleaning is needed.
    # Use this cell to begin, and add as many cells as you need to complete your analysis!
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import matplotlib.style as style
    import seaborn as sns
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.tree import DecisionTreeRegressor
    from sklearn.preprocessing import LabelEncoder
    from sklearn.preprocessing import PowerTransformer
    from sklearn.metrics import r2_score,mean_squared_error
    plt.style.use('ggplot')
    df = pd.read_csv('data/toyota.csv')
    df.info()
    #validate possible 18 values
    df['model'].nunique()
    #validate year of manufacture from 1998 to 2020
    df['year'].unique()
    #validate four types of transmission
    df['transmission'].unique()
    #validate four fuel Types
    df['fuelType'].unique()
    #validate 16 possible values in engineSize
    df['engineSize'].nunique()
    #validate any negative values in numeric variables
    df.describe()

    Exploratory Analysis

    I have investigated the target variable and features of the car, and the relationship between target variable and features. After the analysis,I decided to apply the following changes to enable modeling:

    • Price: use log transformation
    • Create a new ordinal variable from tax variable

    Target Variable - Price

    Since we need to predict the price, the price variable would be our target variable. From the histogram on the left below, we can see there is a longer right tail. Therefore, we apply log transforamtion of the price variable, the distribution on the right below is close to normal distribution.

    fig, axes = plt.subplots(1,2,figsize=(15,5))
    sns.histplot(df['price'],ax=axes[0]).set(title='The Distribution of Target Variable - Price')
    sns.histplot(df['price'],log_scale=True,ax=axes[1]).set(title='The Distribution of Target Variable - Price (Log Scale)');
    df['price'] = np.log(df['price'])

    Numeric Variables - Mileage, Tax, mpg

    From the heatmap below, we can conclude that there is a moderate linear negative relationship in two pairs of variables - price log transformation and mileage, tax and mpg.