Sleep Health and Lifestyle
This synthetic dataset contains sleep and cardiovascular metrics as well as lifestyle factors of close to 400 fictive persons.
The workspace is set up with one CSV file, data.csv, with the following columns:
Person IDGenderAgeOccupationSleep Duration: Average number of hours of sleep per dayQuality of Sleep: A subjective rating on a 1-10 scalePhysical Activity Level: Average number of minutes the person engages in physical activity dailyStress Level: A subjective rating on a 1-10 scaleBMI CategoryBlood Pressure: Indicated as systolic pressure over diastolic pressureHeart Rate: In beats per minuteDaily StepsSleep Disorder: One ofNone,InsomniaorSleep Apnea
Check out the guiding questions or the scenario described below to get started with this dataset! Feel free to make this workspace yours by adding and removing cells, or editing any of the existing cells.
Source: Kaggle
π Some guiding questions to help you explore this data:
- Which factors could contribute to a sleep disorder?
- Does an increased physical activity level result in a better quality of sleep?
- Does the presence of a sleep disorder affect the subjective sleep quality metric?
π Visualization ideas
- Boxplot: show the distribution of sleep duration or quality of sleep for each occupation.
- Show the link between age and sleep duration with a scatterplot. Consider including information on the sleep disorder.
π Scenario: Automatically identify potential sleep disorders
This scenario helps you develop an end-to-end project for your portfolio.
Background: You work for a health insurance company and are tasked to identify whether or not a potential client is likely to have a sleep disorder. The company wants to use this information to determine the premium they want the client to pay.
Objective: Construct a classifier to predict the presence of a sleep disorder based on the other columns in the dataset.
Check out our Linear Classifiers course (Python) or Supervised Learning course (R) for a quick introduction to building classifiers.
Project Scope:
- Explore the data; looking for features that impact the likelihood of having a sleep disorder
- Per the scenario above, create a classification model to predict if a client might have a sleep disorder.
- Use pipelines and the TPOT autoML library
SELECT *
FROM 'data.csv'
LIMIT 10import pandas as pd
import numpy as np
sleep_data = pd.read_csv('data.csv')
sleep_data.head()sleep_data.info()Notes: No nulls! Dtypes make sense too. Next some descriptive statistics.
sleep_data.describe(include='all')Nothing jumping out here. Looks like pretty well behaved data. Now I will split my analysis into numeric and categorical features
person_id = sleep_data.pop('Person ID')
cat_cols = sleep_data.drop('Sleep Disorder', axis=1).select_dtypes(include='object').columns
num_cols = sleep_data.select_dtypes(include=np.number).columnsNumerical Features
import matplotlib.pyplot as plt
import seaborn as sns
for col in num_cols:
fig, axs = plt.subplots(2, 1, figsize=(8, 4))
sns.boxplot(data=sleep_data[[col, 'Sleep Disorder']], x=col, ax=axs[1])
sns.histplot(data=sleep_data[[col, 'Sleep Disorder']], x=col, hue='Sleep Disorder', kde=True, alpha=0.5, ax=axs[0])
axs[0].set_title(col,loc='center')
plt.show()Notes: Several of the features split out based on Sleep Disorder. This indicates these features will be useful in predicting the sleep disorder outcome. Heart Rate appears to have some outliers.
Next let's look at the correlation of the numeric features with eachother using a heatmap.
β
β