Skip to content

Data Scientist Associate Practical Exam Submission

Use this template to complete your analysis and write up your summary for submission.

Task 1

a. State whether the values match the description given in the table above.

  • booking_id - Matches
  • months_as_member - Matches
  • weight - Missing value in the data.
  • days_before - Does not match. Needs to clean up the string 'days' and change dtype to int.
  • day_of_week - Does not match the categories described. Re-mapping and/or cleaning is required.
  • time - Matches
  • category - Missing value '-' in the data. Replacing with 'unknown'
  • attended - Matches

b. State the number of missing values in the column.

  • The weight variable has 20 missing values.

c. Describe what you did to make values match the description if they did not match.

  • Days_before required clean up to take out the string 'days' while also updating dtype to int.
  • Day_of_week varaible originally consists of categories inconsistent with the documentation. Upon further investigation, the inconsistencies can be corrected with taking the first three characters of the data.
  • Weight variable contains missing values and is corrected by filling in the overall mean of the variable.
import pandas as pd
import matplotlib.pyplot as plt
fitness = pd.read_csv('fitness_class_2212.csv')

# Display the first 5 rows of the dataset
print(fitness.head()) 
print(fitness.info())
print(fitness['category'].dtype)
# Observe # of unique values in each feature varaible
fitness.nunique()

# Variable day_of_week should be 7 but we get 10. Use value_counts() to investigate further
#print(fitness['day_of_week'].value_counts())
fitness['day_of_week'] = fitness['day_of_week'].str[0:3]

# Variable weight has missing values and should be replaced with overall average weight as described in the document
mean_weight = fitness['weight'].mean()
fitness['weight'].fillna(mean_weight, inplace=True)

# Replacing missing value in the category variable from '-' to 'unknown'
fitness['category'] = fitness['category'].replace('-', 'unknown')
#fitness['category'].value_counts()

# Clean up 'days_before' variable since some data contain ' days' string while converting dtype to int.
fitness['days_before'] = fitness['days_before'].str.replace(' days','')
fitness['days_before'] = fitness['days_before'].astype('int')
fitness['days_before'].value_counts().sort_index()

fitness.describe()
fitness.isna().sum()

Task 2

Create a visualization that shows how many bookings attended the class. Use the visualization to:

a. State which category of the variable attended has the most observations

  • HIIT training has the most attendance observations of over 200 b. Explain whether the observations are balanced across categories of the variable attended
  • False, the categories of fitness classes are heavily weighted on HIIT, Cycling and Streghth training, and is not at all balanced across all categories.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
xtab = pd.crosstab(fitness['category'], fitness['attended'])
xtab.plot(kind = 'bar')
plt.ylabel('# of Bookings')
plt.xlabel('Category')
plt.xticks(rotation = 0)
plt.show()

Task 3

Describe the distribution of the number of months as a member. Your answer must include a visualization that shows the distribution.

  • The distribution of the months_as_member variable appears to be right-skewed, with the center at around 15 months as member.
  • It also appears that there are a few loyal customers having been members for more than 60 months.
  • Should these outliers be treated?
import seaborn as sns
sns.displot(fitness, x = 'months_as_member', bins = 20)
#considering removing outliers
import numpy as np
q3 = np.quantile(fitness['months_as_member'], .75)
q1 = np.quantile(fitness['months_as_member'], .25)
iqr = q3 - q1
upper = q3 + 1.5 * iqr

temp = fitness[fitness['months_as_member'] < upper]
#sns.boxplot(x = temp['months_as_member'])
sns.displot(temp, x = 'months_as_member', bins = 20)
Hidden output

Task 4

Describe the relationship between attendance and number of months as a member.

Your answer must include a visualization to demonstrate the relationship.

  • The relationship between attendance and number of months as a member is somewhat positively correlated, which means that the longer the duration of one being a member, generally speaking, one's more likely to attend certain fitness classes.
# Checking for linearity between continuous independent variables vs. response variable.
sns.regplot(data = fitness, y = 'attended', x = 'months_as_member', logistic= True)
#sns.regplot(data = temp, y = 'attended', x = 'weight', logistic= True)
#sns.regplot(data = temp, y = 'attended', x = 'days_before', logistic= True)
correlations = fitness.corr()
sns.heatmap(correlations, annot=True)

Task 5

The business wants to predict whether members will attend using the data provided.

State the type of machine learning problem that this is (regression/ classification/clustering).

The business is looking to predict attendance, which is a binary variable (Yes or No), hence, this business problem is a classification problem.

Task 6

Fit a baseline model to predict whether members will attend using the data provided. You must include your code.

# Start coding here... 
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve

# create dummy variables for the categorical data - 
#create dummy variables for 'day_of_week', 'time', and 'category' variables
dummies_time = pd.get_dummies(fitness['time'], drop_first= True)
dummies_dow = pd.get_dummies(fitness['day_of_week'], drop_first=True)
dummies_cat = pd.get_dummies(fitness['category'], drop_first=True)

fitness_clean = pd.concat([fitness, dummies_cat, dummies_dow, dummies_time], axis=1)
fitness_clean = fitness_clean.drop(columns = ['day_of_week','time','category', 'booking_id'], axis = 1)
fitness_clean.head()

# initializing data to model
X = fitness_clean.drop('attended', axis = 1).values
y = fitness_clean['attended'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state = 20)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
logreg_y_pred = logreg.predict(X_test)
logreg_score = logreg.score(X_test, y_test)
print(f'Accuracy score using logistic regression: {logreg_score:.3f}')



Task 7

7. Fit a comparison model to predict whether members will attend using the data provided. You must include your code.

Implementing classification tree as a comparison model.