DSA Practical

Data Science Associate

Task 1

This dataset has 1,500 rows with 8 columns. Before cleaning, some columns contain either missing values or inconsistent data entries that don't comply with the descriptions in the dataset table:

booking_id: Same as the description.
months_as_member: Same as the description with no missing values.
weight: Contains 20 missing values, they were replaced with the average weight.
days_before: Values are stored as strings where some entries contain the word "days" after the number. The strings were reduced to just the numeric value then converted to integer data type.
day_of_week: Some values contain the full day name; these were converted to the abbreviated name and all values were stripped of any other characters (some values had an extra period).
time: Same as the description with no missing values.
category: Contains values that don't match the description ("-") which were replaced with the value "Unknown".
attended: Same as the description with no missing values.

Original dataset

#Loading necessary libraries for analysis
library(readr)
library(dplyr)
library(stringr)
library(tidyr)

#Importing the dataset and viewing the structure
df <- read_csv("fitness_class_2212.csv", show_col_types = FALSE)
str(df)

Finding number of missing values for each column

print(colSums(is.na(df)))

Examining categorical variables

cat_vars <- c("day_of_week", "time", "category", "attended")

for(x in cat_vars){
	print(df %>% count(.data[[x]]))
}

Examining numeric variables

summary(df %>% select(!all_of(cat_vars)))

#Taking a closer look at days_before seeing as it is a character class
max(nchar(df$days_before))
df %>%
	filter(nchar(days_before) > 5)%>%
	select(days_before)

Cleaning the categorical variables

#Cleaning the day_of_week variable, starting by stripping away extra periods
df$day_of_week <- str_replace_all(df$day_of_week, "\\.", "")

#Converting long day names to their abbreviated version
df <- df %>% 
	mutate(day_of_week = case_when(day_of_week == "Monday" ~ "Mon",
								   day_of_week == "Wednesday" ~ "Wed",
								   TRUE ~ day_of_week))
print(count(df, day_of_week))

#Cleaning the category variable by replacing the hyphens ( - ) with "unknown"
df$category <- str_replace(df$category, "-", "Unknown")
print(count(df, category))

Cleaning numeric variables

#Replacing all NA values in the weight column with the average weight
df$weight <- replace_na(df$weight, mean(df$weight, na.rm = TRUE))
df$weight <- round(df$weight, 2)

#Stripping away all non-numeric values in the days_before column then converting it to class integer
df$days_before <- as.integer(str_replace_all(df$days_before, "[^0-9]", ""))
summary(df$days_before)

Task 2

From Graph 1 we see that the HIIT category has the highest number of observations that atteneded the class, with cycling having the second most. Observations are not balanced; the number of total attended differs across categories.

‌
‌
‌

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Data Science Associate