Skip to content
New Workbook
Sign up
Price Prediction Model (Using Python)

Car Price Prediction (Linear Regression) + Simple Dashboard

Hello and welcome to my first ever self project since becoming an apprentice. I have used my freetime and with the help of Kaggle and online resources. I can now present this.

In my final year of Universtiy before I dropped out - I did do a module on AI using Matlab which had great libraries and tools so I took the same theory and used Python for this project. As part of my role - I use Tableau but because I do not have a license for my personal projects, I used Power BI for this.

First Steps

My first step was to clean the original data I got from Kaggle using Power Query

https://www.kaggle.com/datasets/muhammadawaistayyab/used-cars-prices-in-uk

  • Added ID Column
  • Split Name into make and model
  • Changed empty service history to "Part"
  • Removed rest of nulls
  • Removed VW beetle with 1 million miles (huge outlier in the data but suprisingly plausible as the most miles regisrted on a non-commerical vehicle is 3 million)

Using Kaggle to get the dataset as well as the basis for understanding sci-kit learn. I was able to create a linear model to create a price prediction tool. However it is quite slow as it only takes one input at a time.

But before I did that - I created a very simple power bi dashboard. In my apprenticeship I mainly use Tableau so this was a welcome change

Basic Dashboarding

Here is the main page of the dashboard

Here is me about to drill down on ford vehicles

Here is the drilled down ford section to look at just focuses - you can easily tell they make more more than 100 of the 400 ford cars. They have a lower average mileage than the whole dataset and have a lower average cost than the whole dataset

Code

#Import Modules and Libaries
#Pandas for data manipulation and analysis.
#Scikit-Learn for machine learning, including data preprocessing, model building, and evaluation.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Load the dataset from "Sheet 1" using Pandas
file_path = 'Car_Used_UK_Clean.xlsx'
df = pd.read_excel(file_path, sheet_name='Sheet 1')

# Selecting features (X) and target variable (Y)
features = ['Make', 'Model', 'Mileage(miles)', 'Registration_Year', 'Fuel type', 'Engine']
X = df[features]
y = df['Price']

# Identifying categorical and numerical features
categorical_features = ['Make', 'Model', 'Fuel type']
numerical_features = ['Mileage(miles)', 'Registration_Year', 'Engine']

# Preprocessing for numerical data: impute missing values with median and scale
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Preprocessing for categorical data: impute missing values with the most frequent value and apply one-hot encoding (turning categories to a number)
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Creating the preprocessing and modeling pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', LinearRegression())])

# Splitting the dataset into the Training set and Test set (1% because I am using my own testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=42)

# Training the model
pipeline.fit(X_train, y_train)

from joblib import dump

# Save the model to a file
model_filename = 'car_price_model.joblib' #saving model for future use and reuse
dump(pipeline, model_filename)
#print(f"Model saved to {model_filename}")

# Predicting the Test set results
y_pred = pipeline.predict(X_test)


def predict_price(make, model, mileage, year, fuel_type, engine):
    import pandas as pd
    
    # Creating a DataFrame for the input features
    input_data = pd.DataFrame({
        'Make': [make],
        'Model': [model],
        'Mileage(miles)': [mileage],
        'Registration_Year': [year],
        'Fuel type': [fuel_type],
        'Engine': [engine]
    })
    
    # Making prediction using the pipeline
    predicted_price = pipeline.predict(input_data)
    
    # Printing the predicted price
    print(f"The predicted price for the {make} {model} is: £{predicted_price[0]:,.2f}")

# Example usage
predict_price('Hyundai', 'i20', 87500, 2009, 'Petrol', 1.2) #£2000
predict_price('Toyota', 'Yaris', 86991, 2004, 'Petrol', 1.0) #£1150
predict_price('Ford', 'Mustang', 50885, 2016, 'Petrol', 2.3) #£20250

Here is the code with some preset cars as well as the actual price in the comments found on autotrader. Please try out yourself.

As you can see the ford mustang it is very off but in the raw data there was only one other ford mustang so it has very little to base off.

This data project took me less than 4 hours (as can be seen) but was fun none the less.