Skip to content

Data Scientist Associate Practical Exam Submission

Use this template to complete your analysis and write up your summary for submission.

Task 1

Original data

  • booking_id : Same as description without missing values
  • months_as_member : Same as description without missing values, minimum is 1 month
  • weight : 20 missing values, I replaced missing values with the overall average weight. The minimum weight is not 40.00 kg but 55.41kg.
  • days_before : Same as description without missing values, remove the texts "days" and change the wholce column into type "int"
  • day_of_week : Same as description without missing values, there is different labels should be same such as "Wednesday" and "Wed".I change the class in only 3 letters
  • time : Same as description without missing values
  • category : replace 13 missing values "-" to unknown
  • attended : Same as description without missing values
import pandas as pd 
import matplotlib.pyplot as plt
import numpy as np 
import seaborn as sns
df = pd.read_csv("fitness_class_2212.csv")
print(df.info())
print("The minimum number of months as a member:",min(df["months_as_member"]))
# Replace missing value of weight and check the minimum of the weight
df_mod = df.fillna(df["weight"].mean())
print(df_mod.info())
print("the minimum weight is",df_mod["weight"].min())
#Remove "days" in days_before, make the column"days_before" to the type int
df_mod["days_before"] = df_mod["days_before"].str.replace(" days","")
df_mod["days_before"] = df_mod["days_before"].astype("int")
print(df_mod.info())
# Take the 3 letters in each value, make the data consistent
df_mod["day_of_week"] = df_mod["day_of_week"].str[:3]
#replace "-" to unknown
df_mod["category"] = df_mod["category"].replace("-","unknown")
print(df_mod.iloc[55])

Task 2

The observations of bookings are not balance. There are 1046 people not attending more than 454 people who attended.

#a
df_mod.select_dtypes(['object','bool']).nunique()

day_of_weekhas the most obsevcations

attendence = df_mod[df_mod["attended"]==1]["attended"].count()
absence = df_mod[df_mod["attended"]==0]["attended"].count()
print("Total bookings:",len(df_mod))
print("attendence",attendence)
print("absence",absence)
x = ["Attendence","Absence"]
y = [attendence,absence]
c = ["orange","green"]
plt.bar(x,y,color = c ,alpha = 0.5)
plt.title("Number of attendence and absence")
plt.xlabel("attended")
plt.ylabel("Number")
plt.show()

Task 3

The number of moths as a member is a skewed distribution. Most of the months as member are from 8 months to 19 months. The median of the distribution is 12 months. There are still lots of outliers over 40 months.