Enhancing Car Price Estimation in the Wake of a Key Sales Team Member's Retirement

Problem Definition

Objective

To develop a reliable and automated car price estimation model that can effectively replace the expertise of a retiring, highly experienced sales team member.

Background

Discount Motors faces a significant challenge as their most skilled sales team member, known for their exceptional ability to estimate car sales prices accurately, is retiring next month. This individual's departure poses a risk to their sales performance, as current team members' price estimates are typically about 30% off the mark, compared to the retiree's precise estimations.

The current process relies heavily on the retiring team member's expertise, where they assess car prices based on various information parameters. The goal is to transition from this manual, expert-dependent approach to a more systematic and automated process.

The new model should aim to estimate car prices with an accuracy within 10% of the actual selling price, significantly improving upon the current team's 30% deviation rate. Achieving this level of accuracy is critical for maintaining sales efficiency and competitiveness in the market.

Given the imminent retirement of the key team member, there is an urgency to develop a preliminary solution. This project's progress and findings will be shared in a presentation to the sales managers, facilitating informed decision-making for future strategies.

This project will leverage historical sales data, including car features such as model, year, mileage, fuel type, and others, to train a predictive model. This model aims to capture the retiring team member's expertise in a scalable, automated fashion.

Import & Cleaning

Loading and Inspecting the Toyota Dataset

Once the dataset has been loaded into a dataframe the info() method is used to print a concise summary of the dataframe. This summary includes the number of entries, the number of columns, the data types of each column, and the amount of memory used. This method is particularly useful for getting a quick overview of the dataframe, especially in understanding the types of data each column contains, and identifying if there are any missing values. In this case there are no missing values.

import pandas as pd

# Load the data into a DataFrame
df = pd.read_csv('toyota.csv')

#  Display information about the dataframe to better understand data
df.info()

Data Validation Checks for Toyota Dataset

Column Name	Details
model	Character, the model of the car, 18 possible values
year	Numeric, year of registration from 1998 to 2020
price	Numeric, listed value of the car in GBP
transmission	Character, one of "Manual", "Automatic", "Semi-Auto" or "Other"
mileage	Numeric, listed mileage of the car at time of sale
fuelType	Character, one of "Petrol", "Hybrid", "Diesel" or "Other"
tax	Numeric, road tax in GBP. Calculated based on CO2 emissions or a fixed price depending on the age of the car.
mpg	Numeric, miles per gallon as reported by manufacturer
engineSize	Numeric, listed engine size, one of 16 possible values

Next I will perform a series of data validation checks on the dataframe. Here's a breakdown of the validation process:

Model Validation:

Checks if the 'model' column has exactly 18 unique values, aligning with the expected variety of car models.

Year Validation:

Ensures all values in the 'year' column fall within the range of 1998 to 2020, which is the expected period of vehicle registration.

Price Validation:

Verifies that each entry in the 'price' column is a numeric value and non-negative, as prices cannot be negative.

Transmission Validation:

Confirms that each value in the 'transmission' column is one of the specified categories: "Manual", "Automatic", "Semi-Auto", or "Other".

Mileage Validation:

Ensures that all entries in the 'mileage' column are numeric and non-negative, as negative mileage is not plausible.

Fuel Type Validation:

Checks if the 'fuelType' column contains only the specified fuel types: "Petrol", "Hybrid", "Diesel", or "Other".

Tax Validation:

Validates that the 'tax' column has only numeric and non-negative values, fitting for a financial figure like tax.

Miles Per Gallon (MPG) Validation:

Ensures the 'mpg' column contains only numeric and non-negative values, as negative values for miles per gallon are not meaningful.

Engine Size Validation:

Checks if the 'engineSize' column contains exactly 16 unique values, as expected from the dataset specifications.

Finally, the results of these validations are aggregated into a dictionary, validation_results, providing a quick overview of whether each column meets its respective validation criteria. In this case everything everything meets the criteria provided by the table above. This step is crucial for ensuring data quality and integrity before proceeding with further analysis or modeling.

# Validation based on the provided table

# 1. Validate 'model': check for 18 unique values
model_validation = df['model'].nunique() == 18

# 2. Validate 'year': should be between 1998 and 2020
year_validation = df['year'].between(1998, 2020).all()

# 3. Validate 'price': should be numeric and non-negative
price_validation = df['price'].apply(lambda x: isinstance(x, (int, float)) and x >= 0).all()

# 4. Validate 'transmission': should be one of the specified categories
transmission_categories = ["Manual", "Automatic", "Semi-Auto", "Other"]
transmission_validation = df['transmission'].isin(transmission_categories).all()

# 5. Validate 'mileage': should be numeric and non-negative
mileage_validation = df['mileage'].apply(lambda x: isinstance(x, (int, float)) and x >= 0).all()

# 6. Validate 'fuelType': should be one of the specified categories
fuelType_categories = ["Petrol", "Hybrid", "Diesel", "Other"]
fuelType_validation = df['fuelType'].isin(fuelType_categories).all()

# 7. Validate 'tax': should be numeric and non-negative
tax_validation = df['tax'].apply(lambda x: isinstance(x, (int, float)) and x >= 0).all()

# 8. Validate 'mpg': should be numeric and non-negative
mpg_validation = df['mpg'].apply(lambda x: isinstance(x, (int, float)) and x >= 0).all()

# 9. Validate 'engineSize': check for 16 unique values
engineSize_validation = df['engineSize'].nunique() == 16

# Aggregating validation results
validation_results = {
    "model": model_validation,
    "year": year_validation,
    "price": price_validation,
    "transmission": transmission_validation,
    "mileage": mileage_validation,
    "fuelType": fuelType_validation,
    "tax": tax_validation,
    "mpg": mpg_validation,
    "engineSize": engineSize_validation
}

validation_results

Exploratory Data Analysis

Price Distribution Analysis

‌
‌
‌