Skip to content
Sleep Health and Lifestyle
  • AI Chat
  • Code
  • Report
  • Spinner

    Sleep Health and Lifestyle

    This synthetic dataset contains sleep and cardiovascular metrics as well as lifestyle factors of close to 400 fictive persons.

    The workspace is set up with one CSV file, data.csv, with the following columns:

    • Person ID
    • Gender
    • Age
    • Occupation
    • Sleep Duration: Average number of hours of sleep per day
    • Quality of Sleep: A subjective rating on a 1-10 scale
    • Physical Activity Level: Average number of minutes the person engages in physical activity daily
    • Stress Level: A subjective rating on a 1-10 scale
    • BMI Category
    • Blood Pressure: Indicated as systolic pressure over diastolic pressure
    • Heart Rate: In beats per minute
    • Daily Steps
    • Sleep Disorder: One of None, Insomnia or Sleep Apnea

    Check out the guiding questions or the scenario described below to get started with this dataset! Feel free to make this workspace yours by adding and removing cells, or editing any of the existing cells.

    Source: Kaggle

    🌎 Some guiding questions to help you explore this data:

    1. Which factors could contribute to a sleep disorder?
    2. Does an increased physical activity level result in a better quality of sleep?
    3. Does the presence of a sleep disorder affect the subjective sleep quality metric?

    📊 Visualization ideas

    • Boxplot: show the distribution of sleep duration or quality of sleep for each occupation.
    • Show the link between age and sleep duration with a scatterplot. Consider including information on the sleep disorder.

    🔍 Scenario: Automatically identify potential sleep disorders

    This scenario helps you develop an end-to-end project for your portfolio.

    Background: You work for a health insurance company and are tasked to identify whether or not a potential client is likely to have a sleep disorder. The company wants to use this information to determine the premium they want the client to pay.

    Objective: Construct a classifier to predict the presence of a sleep disorder based on the other columns in the dataset.

    Check out our Linear Classifiers course (Python) or Supervised Learning course (R) for a quick introduction to building classifiers.

    You can query the pre-loaded CSV files using SQL directly. Here’s a sample query:

    Unknown integration
    DataFrameavailable as
    df
    variable
    SELECT *
    FROM 'data.csv'
    LIMIT 10
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.
    import pandas as pd
    
    sleep_data = pd.read_csv('data.csv')
    sleep_data.head()
    import matplotlib.pyplot as plt
    import seaborn as sns
    sleep_data.info()
    sleep_data.describe()
    # Which factors could contribute to a sleep disorder?
    sns.pairplot(data=sleep_data, hue="Sleep Disorder", vars=["Age", "Sleep Duration", "Quality of Sleep", "Physical Activity Level", "Stress Level", "Heart Rate", "Daily Steps"], kind="reg", corner=True, diag_kind="hist")
    plt.show()
    
    # Focus should be aimed on features which regression plots differ from people who do not have any sleep disorder. Those features are: Daily Steps, Physical Activity Level, Age, Stress Level and Heart Rate.
    # Does an increased physical activity level result in a better quality of sleep?
    # No! Increased physical activity does not have significant impact on quality of sleep. Pearson correlation coefficient is very weak so these two features are almost lineary independent. Even regression plot showing second order -quadratic- polynomial relationship is quite flat - not fitting observed data very well.
    
    # Quality of Sleep is strongly correlated with Stress Level, Sleep Duration and Heart Rate.
    sleep_data_drop = sleep_data.drop(labels=("Person ID"), axis=1)
    sns.heatmap(data = sleep_data_drop.corr(), annot=True, cbar=False)
    plt.title("Heatmap with correlations among features affecting sleep")
    plt.show()
    
    sns.regplot(data = sleep_data, x="Physical Activity Level", y="Quality of Sleep", order=2)
    plt.show()
    print("Correlation coefficient for Quality of Sleep and Physical Activity Level is: {}".format(sleep_data["Quality of Sleep"].corr(sleep_data["Physical Activity Level"])))
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score, confusion_matrix
    import sklearn.tree as tree
    from sklearn.model_selection import GridSearchCV
    
    # Data preprocessing
    sleep_data_drop = sleep_data.drop(labels=("Person ID"), axis=1)
    sleep_data_insomnia = sleep_data_drop[sleep_data_drop["Sleep Disorder"] != "Sleep Apnea"]
    X = sleep_data_insomnia.drop("Sleep Disorder", axis=1)
    X_preprocessed = pd.get_dummies(X, drop_first=False)
    sleep_data_insomnia["Sleep Disorder"] = sleep_data_insomnia["Sleep Disorder"].replace("None", "No disorder")
    y = sleep_data_insomnia["Sleep Disorder"].values
    y_preprocessed = pd.DataFrame(y, columns=["Sleep Disorder"])
    
    # Data split into training and testing groups
    X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y_preprocessed, test_size=0.25, random_state=123, shuffle=True, stratify=y_preprocessed)
    
    # Instantiate Decision Tree Classifier for predicting Insomnia
    dt = DecisionTreeClassifier(max_depth=1, criterion="entropy")
    model = dt.fit(X_train, y_train)
    print(model.score(X_train, y_train))
    # Checking accuracy metrics 
    y_pred = dt.predict(X_test)
    print(accuracy_score(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred, labels=["No disorder", "Insomnia"]))
    fig, ax = plt.subplots()
    tree.plot_tree(model, feature_names=X_train.columns, class_names=dt.classes_, filled=True)
    plt.title("Decision tree map for insomnia classification")
    plt.show()
    
    # Fine tuning model
    params = {"max_depth" : [2, 3, 4],
             "criterion" : ["gini", "entropy"],
             "splitter" : ["best", "random"]}
    grid = GridSearchCV(estimator = DecisionTreeClassifier(), param_grid = params)
    model_tuned = grid.fit(X_train, y_train)
    print(model_tuned.best_score_)
    print(model_tuned.best_params_)
    best_model = model_tuned.best_estimator_
    y_pred_tuned = best_model.predict(X_test)
    print(accuracy_score(y_test, y_pred_tuned))
    print(confusion_matrix(y_test, y_pred_tuned, labels=["No disorder", "Insomnia"]))
    fig, ax = plt.subplots(figsize=(10,10))
    tree.plot_tree(best_model, feature_names=X_train.columns, class_names=best_model.classes_, filled=True)
    plt.title("Decision tree map for insomnia classification")
    plt.show()
    
    # Create plot with top important features linked with insomnia
    fig, ax = plt.subplots()
    importances = pd.Series(best_model.feature_importances_, index = X_train.columns)
    importances_sorted = importances.sort_values(ascending=False)[:6]
    ax = importances_sorted.plot(kind="barh", color="lightgreen")
    ax.invert_yaxis()
    ax.set(xlabel="Importance", ylabel="Feature_value", title="Top features influencing insomnia")
    plt.show()
    
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score, confusion_matrix
    import sklearn.tree as tree
    from sklearn.model_selection import GridSearchCV
    
    # Data preprocessing
    sleep_data_drop = sleep_data.drop(labels=("Person ID"), axis=1)
    sleep_data_apnea = sleep_data_drop[sleep_data_drop["Sleep Disorder"] != "Insomnia"]
    X = sleep_data_apnea.drop("Sleep Disorder", axis=1)
    X_preprocessed = pd.get_dummies(X, drop_first=False)
    sleep_data_apnea["Sleep Disorder"] = sleep_data_apnea["Sleep Disorder"].replace("None", "No disorder")
    y = sleep_data_apnea["Sleep Disorder"].values
    y_preprocessed = pd.DataFrame(y, columns=["Sleep Disorder"])
    
    # Data split into training and testing groups
    X_train, X_test, y_train, y_test = train_test_split(X_preprocessed, y_preprocessed, test_size=0.25, random_state=123, shuffle=True, stratify=y_preprocessed)
    
    # Instantiate Decision Tree Classifier for predicting Sleep apnea
    dt = DecisionTreeClassifier(max_depth=1, criterion="entropy")
    model = dt.fit(X_train, y_train)
    print(model.score(X_train, y_train))
    # Checking accuracy metrics 
    y_pred = dt.predict(X_test)
    
    print(accuracy_score(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred, labels=["No disorder", "Sleep Apnea"]))
    fig, ax = plt.subplots()
    tree.plot_tree(model, feature_names=X_train.columns, class_names=dt.classes_, filled=True)
    plt.title("Decision tree map for sleep apnea classification")
    plt.show()
    
    # Fine tuning model
    params = {"max_depth" : [2, 3, 4],
             "criterion" : ["gini", "entropy"],
             "splitter" : ["best", "random"]}
    grid = GridSearchCV(estimator = DecisionTreeClassifier(), param_grid = params)
    model_tuned = grid.fit(X_train, y_train)
    print(model_tuned.best_score_)
    print(model_tuned.best_params_)
    best_model = model_tuned.best_estimator_
    y_pred_tuned = best_model.predict(X_test)
    print(accuracy_score(y_test, y_pred_tuned))
    print(confusion_matrix(y_test, y_pred_tuned, labels=["No disorder", "Sleep Apnea"]))
    fig, ax = plt.subplots(figsize=(10,10))
    tree.plot_tree(best_model, feature_names=X_train.columns, class_names=best_model.classes_, filled=True)
    plt.title("Decision tree map for sleep apnea classification")
    plt.show()
    
    # Create plot with top important features linked with insomnia
    fig, ax = plt.subplots()
    importances = pd.Series(best_model.feature_importances_, index = X_train.columns)
    importances_sorted = importances.sort_values(ascending=False)[:3]
    ax = importances_sorted.plot(kind="barh", color="lightblue")
    ax.invert_yaxis()
    ax.set(xlabel="Importance", ylabel="Feature_value", title="Top features influencing sleep apnea")
    plt.show()