My project for Data Science Exam (final touch)

KNN Imputation for Missing Values

Overview

The KNN imputation method works by finding the K nearest neighbors (similar records) for each observation with missing values and imputing the missing values based on the mean (or median) of those neighbors. This method can capture the underlying patterns in the data more effectively than simple mean or median imputation.

Steps to Implement KNN Imputation

1. Prepare the Data

Import necessary libraries.
Load the dataset.
Identify numerical and categorical columns.
Convert non-numeric columns to numeric as needed.

2. Preprocess the Data

Encode categorical variables.
Standardize numerical features.

3. Apply KNN Imputation

Use KNNImputer from sklearn.impute.
Impute missing values in the dataset.

4. Inverse Transform Encoded Variables (if necessary)

Convert any previously encoded variables back to their original form, if needed for EDA.

5. Proceed to EDA

Analyze the imputed dataset.

Detailed Implementation

Step 1: Prepare the Data

a. Import Necessary Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

b. Load the Dataset

# If data is in a CSV file

df = pd.read_csv('recipe_site_traffic_2212.csv')

df

df.isnull().sum()

c. Identify Numerical and Categorical Columns

# List of numerical columns
numeric_cols = ['calories', 'carbohydrate', 'sugar', 'protein']

# List of categorical columns
categorical_cols = ['category', 'servings', 'high_traffic']

# Clean 'servings' column to extract numeric values
df['servings'] = df['servings'].astype(str).str.extract('(\d+)').astype(float)

# Encode 'category' column
label_encoder_category = LabelEncoder()
df['category_encoded'] = label_encoder_category.fit_transform(df['category'])
print(df)

# Encode 'high_traffic' column
# We will temporarily encode 'High' as 1 and 'Low' as 0
label_encoder_traffic = LabelEncoder()
df['high_traffic_encoded'] = label_encoder_traffic.fit_transform(df['high_traffic'].astype(str))

# Now we can consider 'category_encoded', 'servings', and 'high_traffic_encoded' as numerical features.
numeric_cols.extend(['category_encoded', 'servings', 'high_traffic_encoded'])

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the data
df_scaled = df.copy()
df_scaled[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Initialize KNNImputer
imputer = KNNImputer(n_neighbors=5)

# Apply imputer to the scaled data
df_imputed_scaled = df_scaled.copy()
df_imputed_scaled[numeric_cols] = imputer.fit_transform(df_scaled[numeric_cols])

# Reverse scaling
df_imputed = df_imputed_scaled.copy()
df_imputed[numeric_cols] = scaler.inverse_transform(df_imputed_scaled[numeric_cols])

# Reverse encoding for 'category' and 'high_traffic' if necessary
df_imputed['category'] = label_encoder_category.inverse_transform(df_imputed['category_encoded'].astype(int))

df_imputed['high_traffic'] = label_encoder_traffic.inverse_transform(df_imputed['high_traffic_encoded'].astype(int))

df_imputed['servings'] = df_imputed['servings'].round().astype(int)

# Ensure no negative values in numeric columns after inverse transformation
numeric_cols = ['calories', 'carbohydrate', 'sugar', 'protein']
for col in numeric_cols:
    df_imputed[col] = df_imputed[col].apply(lambda x: max(x, 0))

‌
‌
‌