Analysis of Used Car Sales
Company Background
Discount Motors is a used car dealership in the UK. They want to lead the way in used cars. Selling to customers who want the latest and greatest features, without the price tag of a brand new car.
Objective
The purpose of this analysis is to improve the accuracy of car price estimation and automate the process for a company that is facing the retirement of its most experienced sales team member, who is currently responsible for estimating prices.
Data Info
The sales team has pulled some data from the website listings from the last 6 months. They haven’t told us if the cars sold or how long it took to sell if it did, we just know they were listed and the price they were listed at
# Importing necessary Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.style as style
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.metrics import r2_score,mean_squared_error
plt.style.use('ggplot')
# First few row of the dataset
df = pd.read_csv('toyota.csv')
df.head()
Data Validation
This data set has 6738 rows, 9 columns. After validation, all variables were consistent with the data dictionary and no modifications were needed:
- model: 18 models without missing values, same as the description. No cleaning is needed.
- year: 23 unique values without missing values, from 1998 to 2020, same as the description. No cleaning is needed.
- price: numeric values without missing values, same as the description. No cleaning is needed.
- transmission: 4 categories without missing values, same as the description. No cleaning is needed.
- mileage: numeric values, same as the description. No cleaning is needed.
- fuelType: 4 categories without missing values, same as the description. No cleaning is needed.
- mpg: numeric values without missing values, same as the description. No cleaning is needed.
- engineSize: 16 possible values without missing values, same as the description. No cleaning is needed.
# Check all variables in the data against the criteria
df.info()
Check the missing values in the columns
#missing data in percentage
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, round(percent,2)], axis=1, keys=['Total', 'Percent'])
display(missing_data)
#Number of unique values'
variables = pd.DataFrame(columns=['Variable','Number of unique values','Values', 'Variable types'])
for i, var in enumerate(df.columns):
variables.loc[i] = [var, df[var].nunique(),sorted(df[var].unique().tolist()), df[var].dtype]
variables.set_index('Variable', inplace=True)
display(variables,variables.shape[0])
#validate any negative values in numeric variables
df.describe()
Exploratory Analysis and Data Transformations
Target Variable Investigation
To start the analysis, I focused on the target variable - Price column, and examined its distribution using a left histogram, which revealed its right-skewed nature. To address this, a log transformation was applied in the subsequent plots. Further exploration of the data was conducted using a cumulative probability distribution (CDF) plot, which helped to identify how far the data deviates from the theoretical normal distribution, with the percentiles plotted as black dots.
fig, axes = plt.subplots(1,2,figsize=(15,5))
# Define ecdf function
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
n = len(data)
x = np.sort(data)
y = np.arange(1, n+1) / n
return x, y
# Extract price
price = df['price']
# Compute mean and standard deviation
mean_vel = np.mean(price)
std_vel = np.std(price)
# Compute percentiles: ptiles_vers
percentiles = np.array([2.5, 25, 50, 75, 97.5])
ptiles_vers = np.percentile(price,percentiles)
# Generate theoretical normal distribution
samples = np.random.normal(mean_vel, std_vel, size=10000)
x_theor, y_theor = ecdf(samples)
# Generate empirical distribution
x_emp, y_emp = ecdf(price)
sns.histplot(df['price'],ax=axes[0]).set(title='The Distribution of Target Variable - Price')
# Plot theoretical and empirical CDFs
sns.lineplot(x=x_theor, y=y_theor)
sns.scatterplot(x=x_emp, y=y_emp, marker='.', color='blue')
sns.scatterplot(x=ptiles_vers, y=percentiles/100, marker='D', color='black')
# Set plot labels and legend
plt.xlabel('Price')
plt.ylabel('CDF')
plt.legend(('Theoretical', 'Empirical', 'Percentiles'), loc='lower right')
plt.show()
df['price'] = np.log(df['price'])
Numeric Variables relationship
The heatmap below, showed two pairs of variables that have a linear negative relationship - price log transformation and mileage, tax and mpg.