Skip to content
DSA Practical
Data Science Associate
Task 1
This dataset has 1,500 rows with 8 columns. Before cleaning, some columns contain either missing values or inconsistent data entries that don't comply with the descriptions in the dataset table:
- booking_id: Same as the description.
- months_as_member: Same as the description with no missing values.
- weight: Contains 20 missing values, they were replaced with the average weight.
- days_before: Values are stored as strings where some entries contain the word "days" after the number. The strings were reduced to just the numeric value then converted to integer data type.
- day_of_week: Some values contain the full day name; these were converted to the abbreviated name and all values were stripped of any other characters (some values had an extra period).
- time: Same as the description with no missing values.
- category: Contains values that don't match the description ("-") which were replaced with the value "Unknown".
- attended: Same as the description with no missing values.
Original dataset
#Loading necessary libraries for analysis
library(readr)
library(dplyr)
library(stringr)
library(tidyr)
#Importing the dataset and viewing the structure
df <- read_csv("fitness_class_2212.csv", show_col_types = FALSE)
str(df)Finding number of missing values for each column
print(colSums(is.na(df)))Examining categorical variables
cat_vars <- c("day_of_week", "time", "category", "attended")
for(x in cat_vars){
print(df %>% count(.data[[x]]))
}Examining numeric variables
summary(df %>% select(!all_of(cat_vars)))
#Taking a closer look at days_before seeing as it is a character class
max(nchar(df$days_before))
df %>%
filter(nchar(days_before) > 5)%>%
select(days_before)Cleaning the categorical variables
#Cleaning the day_of_week variable, starting by stripping away extra periods
df$day_of_week <- str_replace_all(df$day_of_week, "\\.", "")
#Converting long day names to their abbreviated version
df <- df %>%
mutate(day_of_week = case_when(day_of_week == "Monday" ~ "Mon",
day_of_week == "Wednesday" ~ "Wed",
TRUE ~ day_of_week))
print(count(df, day_of_week))
#Cleaning the category variable by replacing the hyphens ( - ) with "unknown"
df$category <- str_replace(df$category, "-", "Unknown")
print(count(df, category))Cleaning numeric variables
#Replacing all NA values in the weight column with the average weight
df$weight <- replace_na(df$weight, mean(df$weight, na.rm = TRUE))
df$weight <- round(df$weight, 2)
#Stripping away all non-numeric values in the days_before column then converting it to class integer
df$days_before <- as.integer(str_replace_all(df$days_before, "[^0-9]", ""))
summary(df$days_before)
Task 2
From Graph 1 we see that the HIIT category has the highest number of observations that atteneded the class, with cycling having the second most. Observations are not balanced; the number of total attended differs across categories.