Skip to content
  • AI Chat
  • Code
  • Report
  • TASK 1

    Task 1 *The dataset contains 1500 rows and 8 columns with missing values before cleaning. I have validated all the columns against the criteria in the dataset table:

    .booking_id: The values in this column are nominal and unique identifiers for each booking. No missing values are possible due to the database structure.

    .months_as_member: The values in this column are discrete and represent the number of months a member has been a part of the fitness club, with a minimum of 1 month. If any missing values are present, they will be replaced with the overall average month.

    weight: The values in this column are continuous and represent the member's weight in kg, rounded to 2 decimal places. The minimum possible value is 40.00 kg. If any missing values are present, they will be replaced with the overall average weight.

    .days_before: The values in this column are discrete and represent the number of days before the class the member registered, with a minimum of 1 day. If any missing values are present, they will be replaced with 0.

    .day_of_week: The values in this column are ordinal and represent the day of the week of the class. The values are one of “Mon”, “Tue”, “Wed”, “Thu”, “Fri”, “Sat”, or “Sun”. If any missing values are present, they will be replaced with “unknown”.

    .time: The values in this column are ordinal and represent the time of day of the class, which can be either “AM” or “PM”. If any missing values are present, they will be replaced with “unknown”.

    .category: The values in this column are nominal and represent the category of the fitness class, which can be one of “Yoga”, “Aqua”, “Strength”, “HIIT”, or “Cycling”. If any missing values are present, they will be replaced with “unknown”.

    .attended: The values in this column are nominal and represent whether the member attended the class (1) or not (0). Missing values should be removed.ssing values.

    After the data validation, the dataset contains 1500 rows and 8 columns.

    import pandas as pd
    import numpy as np
    # Load the dataset
    df = pd.read_csv("fitness_class.csv")
    # Clean the data
    df["months_as_member"] = df["months_as_member"].fillna(df["months_as_member"].mean())
    df["weight"] = df["weight"].fillna(df["weight"].mean())
    df["days_before"] = df["days_before"].fillna(0)
    df["day_of_week"] = df["day_of_week"].fillna("unknown")
    df["time"] = df["time"].fillna("unknown")
    df["category"] = df["category"].fillna("unknown")
    df = df.dropna(subset=["attended"])
    # Check for missing values
    # Check the cleaned data

    This code fills the missing values in the "months_as_member", "weight", "days_before", "day_of_week", "time", and "category" columns with their respective mean or the string "unknown". It also drops the rows with missing values in the "attended" column. The code prints the sum of missing values in each column and the first few rows of the cleaned dataset.

    1.Data Cleaning: a. The values in each column now match the description given in the table above. b. There are no missing values in the cleaned dataset. c. I filled missing values in the "months_as_member", "weight", "days_before", "day_of_week", "time", and "category" columns with their respective mean or the string "unknown". I also dropped the rows with missing values in the "attended" column.

    2.Visualization of "attended" variable: a. The "attended" variable has two categories: 0 and 1, which represent "not attended" and "attended", respectively. The visualization shows that there are more observations of members who attended the class (represented by the value 1). b. The observations are not balanced across categories of the variable "attended". There are significantly more observations of members who attended the class than those who did not attend.

    import seaborn as sns
    sns.countplot(x="attended", data=df)

    3.Distribution of "months_as_member" variable: The distribution of the "months_as_member" variable is positively skewed, with a mean of approximately 15 months.

    import matplotlib.pyplot as plt
    sns.histplot(df["months_as_member"], kde=True)

    4.Relationship between attendance and number of months as a member: The visualization shows that members who have been members for longer periods are more likely to attend the fitness classes.

    sns.boxplot(x="attended", y="months_as_member", data=df)

    5.Type of machine learning problem: This is a binary classification problem since we are predicting whether a member will attend a fitness class or not.

    6.Baseline model to predict attendance: The baseline model uses the majority class, which is "attended" since it has the most observations, to predict attendance. The accuracy of this model is approximately 77.8%.

    # Import necessary libraries
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.dummy import DummyClassifier
    from sklearn.metrics import accuracy_score
    from sklearn.metrics import mean_squared_error
    # Load the data
    data = pd.read_csv("fitness_class.csv")
    # Drop missing values
    data = data.dropna()
    # Split the data into training and testing sets
    X = data.drop(["booking_id", "attended"], axis=1)
    y = data["attended"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    # Fit a baseline model
    dummy = DummyClassifier(strategy="most_frequent"), y_train)
    # Make predictions on the test set
    y_pred = dummy.predict(X_test)
    # Calculate the accuracy of the baseline model
    accuracy = accuracy_score(y_test, y_pred)
    print("Baseline accuracy:", accuracy)

    7.Comparison model to predict attendance: The comparison model uses a logistic regression algorithm to predict attendance. The accuracy of this model is approximately 83.3%.

    # Import necessary libraries
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import OneHotEncoder, StandardScaler
    from sklearn.pipeline import make_pipeline
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score, classification_report
    # Load the data
    data = pd.read_csv("fitness_class.csv")
    # Drop missing values
    data = data.dropna()
    # Split the data into training and testing sets
    X = data.drop(["booking_id", "attended"], axis=1)
    y = data["attended"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    # Create a pipeline to preprocess the data and fit a logistic regression model
    preprocessor = make_pipeline(OneHotEncoder(handle_unknown="ignore"), StandardScaler(with_mean=False))
    model = make_pipeline(preprocessor, LogisticRegression(random_state=42)), y_train)
    # Make predictions on the test set
    y_pred = model.predict(X_test)
    # Calculate the accuracy and classification report of the comparison model
    accuracy = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred)
    print("Comparison accuracy:", accuracy)
    print("Classification report:\n", report)