Car Price Prediction (Linear Regression) + Simple Dashboard
Hello and welcome to my first ever self project since becoming an apprentice. I have used my freetime and with the help of Kaggle and online resources. I can now present this.
In my final year of Universtiy before I dropped out - I did do a module on AI using Matlab which had great libraries and tools so I took the same theory and used Python for this project. As part of my role - I use Tableau but because I do not have a license for my personal projects, I used Power BI for this.
First Steps
My first step was to clean the original data I got from Kaggle using Power Query
https://www.kaggle.com/datasets/muhammadawaistayyab/used-cars-prices-in-uk
- Added ID Column
- Split Name into make and model
- Changed empty service history to "Part"
- Removed rest of nulls
- Removed VW beetle with 1 million miles (huge outlier in the data but suprisingly plausible as the most miles regisrted on a non-commerical vehicle is 3 million)
Using Kaggle to get the dataset as well as the basis for understanding sci-kit learn. I was able to create a linear model to create a price prediction tool. However it is quite slow as it only takes one input at a time.
But before I did that - I created a very simple power bi dashboard. In my apprenticeship I mainly use Tableau so this was a welcome change
Basic Dashboarding
Here is the main page of the dashboard
Here is me about to drill down on ford vehicles
Here is the drilled down ford section to look at just focuses - you can easily tell they make more more than 100 of the 400 ford cars. They have a lower average mileage than the whole dataset and have a lower average cost than the whole dataset
Code
#Import Modules and Libaries
#Pandas for data manipulation and analysis.
#Scikit-Learn for machine learning, including data preprocessing, model building, and evaluation.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Load the dataset from "Sheet 1" using Pandas
file_path = 'Car_Used_UK_Clean.xlsx'
df = pd.read_excel(file_path, sheet_name='Sheet 1')
# Selecting features (X) and target variable (Y)
features = ['Make', 'Model', 'Mileage(miles)', 'Registration_Year', 'Fuel type', 'Engine']
X = df[features]
y = df['Price']
# Identifying categorical and numerical features
categorical_features = ['Make', 'Model', 'Fuel type']
numerical_features = ['Mileage(miles)', 'Registration_Year', 'Engine']
# Preprocessing for numerical data: impute missing values with median and scale
numerical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Preprocessing for categorical data: impute missing values with the most frequent value and apply one-hot encoding (turning categories to a number)
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_features),
('cat', categorical_transformer, categorical_features)
])
# Creating the preprocessing and modeling pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', LinearRegression())])
# Splitting the dataset into the Training set and Test set (1% because I am using my own testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, random_state=42)
# Training the model
pipeline.fit(X_train, y_train)
from joblib import dump
# Save the model to a file
model_filename = 'car_price_model.joblib' #saving model for future use and reuse
dump(pipeline, model_filename)
#print(f"Model saved to {model_filename}")
# Predicting the Test set results
y_pred = pipeline.predict(X_test)
def predict_price(make, model, mileage, year, fuel_type, engine):
import pandas as pd
# Creating a DataFrame for the input features
input_data = pd.DataFrame({
'Make': [make],
'Model': [model],
'Mileage(miles)': [mileage],
'Registration_Year': [year],
'Fuel type': [fuel_type],
'Engine': [engine]
})
# Making prediction using the pipeline
predicted_price = pipeline.predict(input_data)
# Printing the predicted price
print(f"The predicted price for the {make} {model} is: £{predicted_price[0]:,.2f}")
# Example usage
predict_price('Hyundai', 'i20', 87500, 2009, 'Petrol', 1.2) #£2000
predict_price('Toyota', 'Yaris', 86991, 2004, 'Petrol', 1.0) #£1150
predict_price('Ford', 'Mustang', 50885, 2016, 'Petrol', 2.3) #£20250
Here is the code with some preset cars as well as the actual price in the comments found on autotrader. Please try out yourself.
As you can see the ford mustang it is very off but in the raw data there was only one other ford mustang so it has very little to base off.
This data project took me less than 4 hours (as can be seen) but was fun none the less.