Skip to content
New Workbook
Sign up
Detect Outliers in Dataset

Detect Outliers

Some datasets that will be used for training or testing your model contain outliers. Since these outliers have a low probability of appearing in real world data, they will give you a bad model or wrong test results. In order to prevent this, you can detect and then remove these outliers via the Isolation Forest method.

# Load packages
import numpy as np 
import pandas as pd
import plotly.express as pxksjfka
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_splitkoko

After the data is loaded,

Spinner
DataFrameavailable as
df
variable
# Load data from the csv file
df = pd.read_csv("housing.csv", index_col=False)
df.head()

After the data is loaded, you can build and apply the Isolation Forest classifier with the help of the IsolationForest algorithm from the sklearn package.

CONTAMINATION=.2    # The expected outliers to real data ratio
BOOTSTRAP = False   # True if you want to use the bootstrap method; 

# Exctract the data without column names from the dataframe
data = df.values

# We will start by splitting the data into X/y train/test data
X, y = data[:, :-1], data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Next, we will set up the Isolation Forest classifier that will detect the outliers
i_forest = IsolationForest(contamination=CONTAMINATION, bootstrap=BOOTSTRAP)
is_inlier = i_forest.fit_predict(X_train)    # +1 if inlier, -1 if outlier

# Finally, we will select the rows without outliers
mask = is_inlier != -1
# and remove these from the train data
X_train, y_train = X_train[mask, :], y_train[mask]

Before you write this data without outliers to a new csv file, you should first check the results from the average number of rooms (RM) column via two boxplots.

df_wo_outliers=pd.DataFrame(X_train, columns=df.columns[:-1])
df_boxplot = pd.DataFrame(data={'With outliers':df['RM'], 'Without outliers':df_wo_outliers['RM']})
fig = px.box(df_boxplot)
fig.show()

As you can see, a vast majority of the outliers are removed from this RM column. You can also check it for other columns. If you want more/less outliers to be removed, you can increase/decrease the CONTAMINATION.
You can now write this dataframe to a csv file.

df_wo_outliers.to_csv('housing_wo_outliers.csv')

Dataset source

housing.csv