Skip to content
DSA Practical
  • AI Chat
  • Code
  • Report
  • Data Science Associate

    Task 1

    This dataset has 1,500 rows with 8 columns. Before cleaning, some columns contain either missing values or inconsistent data entries that don't comply with the descriptions in the dataset table:

    • booking_id: Same as the description.
    • months_as_member: Same as the description with no missing values.
    • weight: Contains 20 missing values, they were replaced with the average weight.
    • days_before: Values are stored as strings where some entries contain the word "days" after the number. The strings were reduced to just the numeric value then converted to integer data type.
    • day_of_week: Some values contain the full day name; these were converted to the abbreviated name and all values were stripped of any other characters (some values had an extra period).
    • time: Same as the description with no missing values.
    • category: Contains values that don't match the description ("-") which were replaced with the value "Unknown".
    • attended: Same as the description with no missing values.

    Original dataset

    #Loading necessary libraries for analysis
    library(readr)
    library(dplyr)
    library(stringr)
    library(tidyr)
    
    #Importing the dataset and viewing the structure
    df <- read_csv("fitness_class_2212.csv", show_col_types = FALSE)
    str(df)

    Finding number of missing values for each column

    print(colSums(is.na(df)))

    Examining categorical variables

    cat_vars <- c("day_of_week", "time", "category", "attended")
    
    for(x in cat_vars){
    	print(df %>% count(.data[[x]]))
    }

    Examining numeric variables

    summary(df %>% select(!all_of(cat_vars)))
    
    #Taking a closer look at days_before seeing as it is a character class
    max(nchar(df$days_before))
    df %>%
    	filter(nchar(days_before) > 5)%>%
    	select(days_before)

    Cleaning the categorical variables

    #Cleaning the day_of_week variable, starting by stripping away extra periods
    df$day_of_week <- str_replace_all(df$day_of_week, "\\.", "")
    
    #Converting long day names to their abbreviated version
    df <- df %>% 
    	mutate(day_of_week = case_when(day_of_week == "Monday" ~ "Mon",
    								   day_of_week == "Wednesday" ~ "Wed",
    								   TRUE ~ day_of_week))
    print(count(df, day_of_week))
    
    #Cleaning the category variable by replacing the hyphens ( - ) with "unknown"
    df$category <- str_replace(df$category, "-", "Unknown")
    print(count(df, category))

    Cleaning numeric variables

    #Replacing all NA values in the weight column with the average weight
    df$weight <- replace_na(df$weight, mean(df$weight, na.rm = TRUE))
    df$weight <- round(df$weight, 2)
    
    #Stripping away all non-numeric values in the days_before column then converting it to class integer
    df$days_before <- as.integer(str_replace_all(df$days_before, "[^0-9]", ""))
    summary(df$days_before)
    

    Task 2

    From Graph 1 we see that the HIIT category has the highest number of observations that atteneded the class, with cycling having the second most. Observations are not balanced; the number of total attended differs across categories.