Detect Outliers in Dataset

Detect Outliers

Some datasets that will be used for training or testing your model contain outliers. Since these outliers have a low probability of appearing in real world data, they will give you a bad model or wrong test results. In order to prevent this, you can detect and then remove these outliers via the Isolation Forest method.

# Load packages
import numpy as np 
import pandas as pd
import plotly.express as pxksjfka
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_splitkoko

After the data is loaded,

DataFrameas

df

variable

# Load data from the csv file
df = pd.read_csv("housing.csv", index_col=False)
df.head()

After the data is loaded, you can build and apply the Isolation Forest classifier with the help of the IsolationForest algorithm from the sklearn package.

CONTAMINATION=.2    # The expected outliers to real data ratio
BOOTSTRAP = False   # True if you want to use the bootstrap method; 

# Exctract the data without column names from the dataframe
data = df.values

# We will start by splitting the data into X/y train/test data
X, y = data[:, :-1], data[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

# Next, we will set up the Isolation Forest classifier that will detect the outliers
i_forest = IsolationForest(contamination=CONTAMINATION, bootstrap=BOOTSTRAP)
is_inlier = i_forest.fit_predict(X_train)    # +1 if inlier, -1 if outlier

# Finally, we will select the rows without outliers
mask = is_inlier != -1
# and remove these from the train data
X_train, y_train = X_train[mask, :], y_train[mask]

Before you write this data without outliers to a new csv file, you should first check the results from the average number of rooms (RM) column via two boxplots.

df_wo_outliers=pd.DataFrame(X_train, columns=df.columns[:-1])
df_boxplot = pd.DataFrame(data={'With outliers':df['RM'], 'Without outliers':df_wo_outliers['RM']})
fig = px.box(df_boxplot)
fig.show()

As you can see, a vast majority of the outliers are removed from this RM column. You can also check it for other columns. If you want more/less outliers to be removed, you can increase/decrease the CONTAMINATION.
You can now write this dataframe to a csv file.

df_wo_outliers.to_csv('housing_wo_outliers.csv')

Dataset source

housing.csv

Detect Outliers in Dataset

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Detect Outliers

Dataset source

Detect Outliers