Skip to content
Duplicate of Certification Workspace
Data Scientist Associate Practical Exam Submission
Use this template to complete your analysis and write up your summary for submission.
Task 1
Original data
- booking_id : Same as description without missing values
- months_as_member : Same as description without missing values, minimum is 1 month
- weight : 20 missing values, I replaced missing values with the overall average weight. The minimum weight is not 40.00 kg but 55.41kg.
- days_before : Same as description without missing values, remove the texts "days" and change the wholce column into type "int"
- day_of_week : Same as description without missing values, there is different labels should be same such as "Wednesday" and "Wed".I change the class in only 3 letters
- time : Same as description without missing values
- category : replace 13 missing values "-" to unknown
- attended : Same as description without missing values
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
df = pd.read_csv("fitness_class_2212.csv")
print(df.info())
print("The minimum number of months as a member:",min(df["months_as_member"]))
# Replace missing value of weight and check the minimum of the weight
df_mod = df.fillna(df["weight"].mean())
print(df_mod.info())
print("the minimum weight is",df_mod["weight"].min())
#Remove "days" in days_before, make the column"days_before" to the type int
df_mod["days_before"] = df_mod["days_before"].str.replace(" days","")
df_mod["days_before"] = df_mod["days_before"].astype("int")
print(df_mod.info())
# Take the 3 letters in each value, make the data consistent
df_mod["day_of_week"] = df_mod["day_of_week"].str[:3]
#replace "-" to unknown
df_mod["category"] = df_mod["category"].replace("-","unknown")
print(df_mod.iloc[55])
Task 2
The observations of bookings are not balance. There are 1046 people not attending more than 454 people who attended.
#a
df_mod.select_dtypes(['object','bool']).nunique()
day_of_weekhas the most obsevcations
attendence = df_mod[df_mod["attended"]==1]["attended"].count()
absence = df_mod[df_mod["attended"]==0]["attended"].count()
print("Total bookings:",len(df_mod))
print("attendence",attendence)
print("absence",absence)
x = ["Attendence","Absence"]
y = [attendence,absence]
c = ["orange","green"]
plt.bar(x,y,color = c ,alpha = 0.5)
plt.title("Number of attendence and absence")
plt.xlabel("attended")
plt.ylabel("Number")
plt.show()
Task 3
The number of moths as a member is a skewed distribution. Most of the months as member are from 8 months to 19 months. The median of the distribution is 12 months. There are still lots of outliers over 40 months.