Skip to content
Feature Selection and Model Interpretation (XGBoostClassifier, eli5, Shap)
Imagine you are part of a data science team working for an educational institution. The team is tasked with developing a predictive model that can assist in identifying students who are likely to pass or fail the grade. Such a model can provide valuable insights into student performance and help in designing targeted interventions to support struggling students.
Get the data from https://archive.ics.uci.edu/dataset/320/student+performance
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo
# fetch dataset
student_performance = fetch_ucirepo(id=320)
# data (as pandas dataframes)
X = student_performance.data.features
y = student_performance.data.targets
# metadata
print(student_performance.metadata)
# variable information
print(student_performance.variables)
A little glimpse of the data:
import pandas as pd
# Combine the features and targets into a single DataFrame
df = pd.concat([X, y], axis=1)
df.head()
#There aren't NA values in the data, but there are a lot of variables that need to be encoded to use them in a model
df.info()
#The min and max values are expected in all the numerical columns
df.describe()
Correlation Analysis
#We can see that there are no strong correlations among the explanatory numeric variables.
import seaborn as sns
sns.heatmap(X.select_dtypes(include='int').corr())
X.select_dtypes(include='int').corr()
Getting the dummies for the model
import warnings
# Ignore all future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
X_dummies = X.copy()
X_dummies = pd.get_dummies(X_dummies, drop_first= True)
X_bool_to_int = X_dummies.select_dtypes(include=['bool']).astype(int)
X_dummies.update(X_bool_to_int)
X_dummies.head()
G3 = y['G3'] > 12
G3 = G3.astype(int)
G3.value_counts(normalize=True)
G3 = G3.values
G3[:5]