Practical Exam - Fitness class participation analysis

Fitness Class Participation Analysis

Task 1

In this task, I validated the data and confirmed whether each column of the data matches its description.

Columns:

1.booking_id: It has no duplicate value and no missing value, that matches the description.

2.months_as_member: It mataches the description, the values are discrete, the minimum of the values is 1 and has no missing value. The average number of months is about 16 (15.69..).

3.weight: The values are continuous, rounded to 2 decimal places, these are what correspond to the description, while I found the minimum value is 55.41, which is not the same as the description. Besides, I found 20 missing values, and I replaced them with average weight (about 82.61 kg) using Fillna method.

4.days_before: The values are in string format, and some of them are followed by the string " days", I replace them with "", and I use pandas astype method to convert them to int format to match the description. Also, the minimum of the values is 1 and there is no missing value of this column.

5.day_of_week: The values are ordinal, and have no missing value, but here are three types of values (Wednesday, Fri., Monday) that don't match the description, we can instinctively infer them as Wed, Fri, Mon, so I replaced them respectively.

6.time: The values match the description, the values are in "AM" or "PM", and there is no missing value.

7.categorty: Although I didn't find any missing value by using isna pandas method, but for precaution, I check each kind of data value by using value_counts method.The data values look fine except for 13 of the values are "-" which are actually missing values, I replaced those as "unknown" according to the description.

8.attended: It has no missing value and matches the description, only the value 1 or 0 exists.

Task 2

In this task, I did further exploration of category and its relationship with attended.

By plotting the count plot of the category variable, we found that "HIIT" has about 670 observations, which is the most. Comparing the number between attended or not for each category, it shows that the observations across categories of the variable attended are unbalanced, the ratios are about 1:2.

Task 3

In this task, I visualized the distribution of the "months as member" variable to have further understanding of this variable.

By plotting the histogram plot, it shows that it's a right-skewed distribution and that most members have been members of this fitness club for 5 ~ 15 months.

Task 4

In this task, I dive in the relationship between "months as member" and "attended".

By mapping the values of "attended" to different colours on the histogram plot, it appears that those who have been members for at least 24 months have higher attendance.

Task 5

Depending on the background, we're going to predict whether the member will attend the class or not, which is a classification problem in machine learning.

Task 6

In this task, I fit and predict the data base on the DecisionTreeClassifier model with the default parameter settings.

# import modules
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier


# import data file
df = pd.read_csv('fitness_class_2212.csv')

# data manipulation
# weight
df['weight'].fillna(df['weight'].mean(), inplace=True) 
# days_before
df['days_before'] = df['days_before'].str.replace(" days","")
df['days_before'] = df['days_before'].astype('int')
# day_of_week
df['day_of_week'] = df['day_of_week'].str.replace('Wednesday','Wed').replace('Fri.','Fri').replace('Monday','Mon')
# category
df['category'] = df['category'].str.replace("-",'unknown')

# get dummies for category variables
df_clean = pd.get_dummies(df, drop_first=True) # drop first to avoid multicollinearity

# set random_seed
SEED = 123

# split the data to as train and test dataset
X = df_clean.drop(columns=['attended']) #features
y = df_clean[['attended']] #label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=SEED, stratify=y)

# DecisionTreeClassifier model initialization 
dt = DecisionTreeClassifier(random_state=SEED)

# fit the model
dt.fit(X_train, y_train)

# predict 
y_pred_dt = dt.predict(X_test)

Task 7

In this task, I choose a random forest classifier to fit and predict the data.

# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
# Instantiate rf
rf = RandomForestClassifier(random_state=SEED)
# Fit rf to the training set    
rf.fit(X_train, y_train)
# Predict the test set labels
y_pred_rf = rf.predict(X_test)

Task 8

In task 6, I chose DecisionTreeClassifer as the predictive model. Because I observed that some of the variables in the data have different scales (ex: weight, months_as_member), and decision trees are less affected by data scale, and as previous observation of the "months_as_member" variable, I found some outliers, decision trees are less affected by outliers compared to linear models, so I chose decision trees.

In task 7, I chose RandomForestClassifier, as compared to DecisionTreeClassifier, it has a lower risk of overfitting due to its use of bootstrapping method. Additionally, due to the ensemble and randomization techniques, it often achieves a higher performance, which is our primary objective.

Task 9

Based on the previous check of "attended", it was found that the values between 0 and 1 are imbalanced. Therefore, relying on accuracy score as a measure of model performance might not be the best approach. So here I measure the roc_auc_scores in DecisionTreeClassifier and RandomForestClassifier respectively to evaluate the model performance, as it prevents from the effect of imbalance.

By scoring the previously trained models, the auc_score of the decision tree was approximately 64.29% while the auc_score of the random forest was 78.36%,

The ROC curve also clearly shows that the area under the curve of the random forest is greater than the area under the curve of the decision tree, indicating that the random forest performs better in this analysis.

To further validate the evaluation results, I conducted 5-fold cross-validation and found that the results were consistent, with mean auc_scores of 58.62% and 77.33%, respectively, indicating that the random forest model was superior compared to the decision tree in this analysis.

# Import necessary modules
from sklearn.metrics import roc_auc_score

# Compute dt predicted probabilities: 
# y_pred_prob_dt
y_pred_prob_dt = dt.predict_proba(X_test)[:,1]
# y_pred_prob_rf 
y_pred_prob_rf = rf.predict_proba(X_test)[:,1]

# Compute and print AUC score
print("DecisionTreeClassifer AUC: {}".format(roc_auc_score(y_test, y_pred_prob_dt)))
print("RandomForestClassifer AUC: {}".format(roc_auc_score(y_test, y_pred_prob_rf)))

Plot ROC Curve

# Import necessary modules
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob_dt)
fpr2, tpr2, thresholds2 = roc_curve(y_test, y_pred_prob_rf)
# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label='Decision Tree')
plt.plot(fpr2, tpr2, label='Random Forest')
plt.legend()
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

‌
‌
‌