Skip to content
Detect Outliers in Dataset
  • AI Chat
  • Code
  • Report
  • Spinner

    Detect Outliers

    Some datasets that will be used for training or testing your model contain outliers. Since these outliers have a low probability of appearing in real world data, they will give you a bad model or wrong test results. In order to prevent this, you can detect and then remove these outliers via the Isolation Forest method.

    # Load packages
    import numpy as np 
    import pandas as pd
    import plotly.express as pxksjfka
    import matplotlib.pyplot as plt
    from sklearn.ensemble import IsolationForest
    from sklearn.model_selection import train_test_splitkoko

    After the data is loaded,

    Unknown integration
    DataFrameavailable as
    df
    variable
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.
    # Load data from the csv file
    df = pd.read_csv("housing.csv", index_col=False)
    df.head()

    After the data is loaded, you can build and apply the Isolation Forest classifier with the help of the IsolationForest algorithm from the sklearn package.

    CONTAMINATION=.2    # The expected outliers to real data ratio
    BOOTSTRAP = False   # True if you want to use the bootstrap method; 
    
    # Exctract the data without column names from the dataframe
    data = df.values
    
    # We will start by splitting the data into X/y train/test data
    X, y = data[:, :-1], data[:, -1]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
    
    # Next, we will set up the Isolation Forest classifier that will detect the outliers
    i_forest = IsolationForest(contamination=CONTAMINATION, bootstrap=BOOTSTRAP)
    is_inlier = i_forest.fit_predict(X_train)    # +1 if inlier, -1 if outlier
    
    # Finally, we will select the rows without outliers
    mask = is_inlier != -1
    # and remove these from the train data
    X_train, y_train = X_train[mask, :], y_train[mask]

    Before you write this data without outliers to a new csv file, you should first check the results from the average number of rooms (RM) column via two boxplots.

    df_wo_outliers=pd.DataFrame(X_train, columns=df.columns[:-1])
    df_boxplot = pd.DataFrame(data={'With outliers':df['RM'], 'Without outliers':df_wo_outliers['RM']})
    fig = px.box(df_boxplot)
    fig.show()

    As you can see, a vast majority of the outliers are removed from this RM column. You can also check it for other columns. If you want more/less outliers to be removed, you can increase/decrease the CONTAMINATION.
    You can now write this dataframe to a csv file.

    df_wo_outliers.to_csv('housing_wo_outliers.csv')

    Dataset source

    housing.csv