Certification - Data Scientist Associate - University Enrollment

University Enrollment Analysis and Model Training

Task 1

The dataset contains 1850 rows and 8 columns with missing values before cleaning. I have validated all the columns against the criteria in the dataset table:

course_id: Same as description with no missing values.
course_type: Same as description with no missing values.
year: Same as description with no missing values.
enrollment_count: Same as description with no missing values.
pre_score: Same as description but some missing values represented as a '-', so i replace with 0.
post_score: 150+ missing values, so i replaced missing values with 0.
pre_requirement: 50+ missing values, so i replaced missing values with "None".
department: Contains 5 unique values instead of 4, but the extra value is similar to an already existing value, so i replaced the 'Math' with 'Mathematics'.

After validation the data contains 1850 rows and 8 columns.

Task 2

Enrollment counts follows as bimodal distribution.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

ue = pd.read_csv('university_enrollment_2306.csv')
ue['department'] = ue['department'].replace('Math', 'Mathematics')
ue['pre_requirement'] = ue['pre_requirement'].fillna("None")
ue['post_score'] = ue['post_score'].fillna(0)
ue['pre_score'] = ue['pre_score'].replace('-', 0)

# ..
sns.set_style('darkgrid')
sns.distplot(ue['enrollment_count'], kde = False, color ='blue')

Task 3

Online class has the most observations.
The observations are not balanced. Online course has significantly highed proportion than classroom

sns.countplot(x='course_type', data=ue)
plt.title('No of course for each course type')

Task 4

Median enrollment_count is higher in online course_type.

Inspecting the relationship between course type and enrollment count.

Hidden code

Task 5

Since predicting the number of students enrolling in a course involves predicting a continuous value (a number), it falls under the regression category.

Make changes to enable modeling Finally, to enable model fitting, I have made the following changes:

Used label encoding to convert course_type ordinal categorical variable into numerical format.
Convert the pre_score column to float dtype using astype(float) to ensure the numerical values are stored as floating-point numbers.
Used label encoding to convert pre_requirement ordinal categorical variable into numerical format.
The department column contains four values (math, tech, science, engineering), so i use one-hot encoding to convert this categorical variable into numerical format.

from sklearn.preprocessing import LabelEncoder

# Create a label encoder instance
label_encoder = LabelEncoder()

# Perform label encoding for the 'course_type' column
ue['course_type_encoded'] = label_encoder.fit_transform(ue['course_type'])

# Convert the 'pre_score' column to float dtype
ue['pre_score'] = ue['pre_score'].astype(float)

# Perform label encoding for the 'pre_requirement' column
pre_requirement_mapping = {'Beginner': 0, 'None': 1, 'Intermediate': 2}
ue['pre_requirement_encoded'] = ue['pre_requirement'].map(pre_requirement_mapping)

# One-hot encoding to convert `department` column
df_ue_encoded = pd.get_dummies(ue, columns=['department'])

#import ML models and peformance metrics
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split 
from sklearn.metrics import mean_squared_error
import numpy as np

X = df_ue_encoded.drop(['course_id', 'course_type', 'enrollment_count', 'pre_requirement'], axis=1)
y = df_ue_encoded[['enrollment_count']]

# Split dataset into 80% training set and 20% test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1848)

Task 6

Baseline Model - Linear Regression Model

# Start coding here ...
lr = LinearRegression()
lr.fit(X_train,y_train)
y_pred_lr =lr.predict(X_test)

‌
‌
‌