### You’re part of a group

Switch to your group space and start collaborating with your teammates.### You’re part of a group

Switch to your group space and start collaborating with your teammates.

## Data Scientist Associate Practical Exam Submission

Use this template to complete your analysis and write up your summary for submission.

### Task 1

*Write your description here*

The dataset contains 1850 rows and 8 columns with missing values before cleaning. I have validated all the columns against the criteria in the dataset table:

- Course Id: Same as description without missing values.
- Course type: Same as description without missing values, 2 course types.
- Year:Same as description without missing values, 12 years.
- Enrolment Count: Same as description without missing values.
- Pre score: 50+ missing values with '-', so i replace the '-' with 0.
- Post score: 50+ missing values, so i replace missing values with 0.
- Pre requirement: 50+ missing values, so i replace missing values with "None"
- Department: Same as description without missing values, 5 departments.

After the data validation, the dataset contains 1850 rows and 8 columns.

### Task 2

*Write your description here*

From the distribution of enrolment counts graph, the enrollment counts is right-skewed or positively skewed,this skewness implies that most observations likely fall into lower enrollment categories, while there are fewer instances of higher enrollment numbers.

### Task 3

*Write your description here*

a) From the count of courses by type graph, online course type has most observations.

b) From the count of courses by type bar graph,the online course type has a longer bar compared to the classroom bar in the bar chart, it suggests that there may be an imbalance or disparity in the distribution of observations across the different course types.

### Task 4

*Write your description here*

In the visualization, the "Online Course type" boxplot is positioned higher on the vertical axis compared to the "Classroom Course type" boxplot, indicating the higher enrollment counts associated with online courses, hence the online courses shows that the median enrollment count tends to be higher compared to classroom courses.

### Task 5

*Write your description here*

Predicting the number of students who will enroll in a course is a **regression problem** in machine learning.

### Task 6

*Write your description here*

Baseline model - Linear Regression Model

```
# Start coding here...
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
univ_df = pd.read_csv('university_enrollment_2306.csv')
univ_df['pre_score'] = univ_df['pre_score'].replace('-', 0)
univ_df['pre_requirement'] = univ_df['pre_requirement'].apply(lambda x: 'None' if pd.isna(x) else x)
univ_df.loc[univ_df['post_score'].isnull(), 'post_score'] = 0
# Convert categorical variables into numerical using one-hot encoding
df = pd.get_dummies(univ_df, columns=["course_type", "pre_requirement", "department"])
# Splitting the data into features (X) and target (y)
X = df.drop("enrollment_count", axis=1)
y = df["enrollment_count"]
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Creating and fitting a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
# Evaluating the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
```

### Task 7

*Write your description here*

Comparison model - Gradient Boosting model

```
# Start coding here...
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
univ_df = pd.read_csv('university_enrollment_2306.csv')
univ_df['pre_score'] = univ_df['pre_score'].replace('-', 0)
univ_df['pre_requirement'] = univ_df['pre_requirement'].apply(lambda x: 'None' if pd.isna(x) else x)
univ_df.loc[univ_df['post_score'].isnull(), 'post_score'] = 0
# Convert categorical variables into numerical using one-hot encoding
df = pd.get_dummies(univ_df, columns=["course_type", "pre_requirement", "department"])
# Splitting the data into features (X) and target (y)
X = df.drop("enrollment_count", axis=1)
y = df["enrollment_count"]
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and fit a Gradient Boosting Regressor model
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Make predictions using the model
y_pred = model.predict(X_test)
# Calculate the Mean Squared Error (MSE) of the predictions
mse = mean_squared_error(y_test, y_pred)
print(f"Gradient Boosting Model Mean Squared Error: {mse}")
```

### Task 8

*Write your description here*

I am choosing the **Linear Regression model** as a baseline model because it is simple and efficient to train and interpret.
The comparison model I am choosing is the **Gradient Boosting model** due to its ability to capture complex patterns and potentially provide better predictive performance.

### Task 9

*Write your description here*

I am choosing Mean Squared Error(MSE) to evaluate the model because it is commonly used and easy to interpret. The MSE measures the average squared difference between actual and predicted values.