Skip to content
Certification - Data Scientist Associate - University Enrollment
University Enrollment Analysis and Model Training
Task 1
The dataset contains 1850 rows and 8 columns with missing values before cleaning. I have validated all the columns against the criteria in the dataset table:
- course_id: Same as description with no missing values.
- course_type: Same as description with no missing values.
- year: Same as description with no missing values.
- enrollment_count: Same as description with no missing values.
- pre_score: Same as description but some missing values represented as a '-', so i replace with 0.
- post_score: 150+ missing values, so i replaced missing values with 0.
- pre_requirement: 50+ missing values, so i replaced missing values with "None".
- department: Contains 5 unique values instead of 4, but the extra value is similar to an already existing value, so i replaced the 'Math' with 'Mathematics'.
After validation the data contains 1850 rows and 8 columns.
Task 2
- Enrollment counts follows as bimodal distribution.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
ue = pd.read_csv('university_enrollment_2306.csv')
ue['department'] = ue['department'].replace('Math', 'Mathematics')
ue['pre_requirement'] = ue['pre_requirement'].fillna("None")
ue['post_score'] = ue['post_score'].fillna(0)
ue['pre_score'] = ue['pre_score'].replace('-', 0)
# ..
sns.set_style('darkgrid')
sns.distplot(ue['enrollment_count'], kde = False, color ='blue')Task 3
- Online class has the most observations.
- The observations are not balanced. Online course has significantly highed proportion than classroom
sns.countplot(x='course_type', data=ue)
plt.title('No of course for each course type')Task 4
- Median enrollment_count is higher in online course_type.
Inspecting the relationship between course type and enrollment count.
Hidden code
Task 5
- Since predicting the number of students enrolling in a course involves predicting a continuous value (a number), it falls under the regression category.
Make changes to enable modeling Finally, to enable model fitting, I have made the following changes:
- Used label encoding to convert
course_typeordinal categorical variable into numerical format. - Convert the
pre_scorecolumn to float dtype using astype(float) to ensure the numerical values are stored as floating-point numbers. - Used label encoding to convert
pre_requirementordinal categorical variable into numerical format. - The
departmentcolumn contains four values (math, tech, science, engineering), so i use one-hot encoding to convert this categorical variable into numerical format.
from sklearn.preprocessing import LabelEncoder
# Create a label encoder instance
label_encoder = LabelEncoder()
# Perform label encoding for the 'course_type' column
ue['course_type_encoded'] = label_encoder.fit_transform(ue['course_type'])
# Convert the 'pre_score' column to float dtype
ue['pre_score'] = ue['pre_score'].astype(float)
# Perform label encoding for the 'pre_requirement' column
pre_requirement_mapping = {'Beginner': 0, 'None': 1, 'Intermediate': 2}
ue['pre_requirement_encoded'] = ue['pre_requirement'].map(pre_requirement_mapping)
# One-hot encoding to convert `department` column
df_ue_encoded = pd.get_dummies(ue, columns=['department'])#import ML models and peformance metrics
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
X = df_ue_encoded.drop(['course_id', 'course_type', 'enrollment_count', 'pre_requirement'], axis=1)
y = df_ue_encoded[['enrollment_count']]
# Split dataset into 80% training set and 20% test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1848)Task 6
Baseline Model - Linear Regression Model
# Start coding here ...
lr = LinearRegression()
lr.fit(X_train,y_train)
y_pred_lr =lr.predict(X_test)