Skip to content

Imagine you are part of a data science team working for an educational institution. The team is tasked with developing a predictive model that can assist in identifying students who are likely to pass or fail the grade. Such a model can provide valuable insights into student performance and help in designing targeted interventions to support struggling students.

Get the data from https://archive.ics.uci.edu/dataset/320/student+performance

!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
student_performance = fetch_ucirepo(id=320) 
  
# data (as pandas dataframes) 
X = student_performance.data.features 
y = student_performance.data.targets 
  
# metadata 
print(student_performance.metadata) 
  
# variable information 
print(student_performance.variables) 

A little glimpse of the data:

import pandas as pd

# Combine the features and targets into a single DataFrame
df = pd.concat([X, y], axis=1)
df.head()
#There aren't NA  values in the data, but there are a lot of variables that need to be encoded to use them in a model
df.info()
#The min and max values are expected in all the numerical columns
df.describe()

Correlation Analysis

#We can see that there are no strong correlations among the explanatory numeric variables.
import seaborn as sns
sns.heatmap(X.select_dtypes(include='int').corr())
X.select_dtypes(include='int').corr()

Getting the dummies for the model

import warnings

# Ignore all future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

X_dummies = X.copy()
X_dummies = pd.get_dummies(X_dummies, drop_first= True)
X_bool_to_int = X_dummies.select_dtypes(include=['bool']).astype(int)
X_dummies.update(X_bool_to_int)
X_dummies.head() 
G3 = y['G3'] > 12 
G3 = G3.astype(int)
G3.value_counts(normalize=True)
G3 = G3.values
G3[:5]