Skip to content

Sleep Health and Lifestyle

This synthetic dataset contains sleep and cardiovascular metrics as well as lifestyle factors of close to 400 fictive persons.

The workspace is set up with one CSV file, data.csv, with the following columns:

  • Person ID
  • Gender
  • Age
  • Occupation
  • Sleep Duration: Average number of hours of sleep per day
  • Quality of Sleep: A subjective rating on a 1-10 scale
  • Physical Activity Level: Average number of minutes the person engages in physical activity daily
  • Stress Level: A subjective rating on a 1-10 scale
  • BMI Category
  • Blood Pressure: Indicated as systolic pressure over diastolic pressure
  • Heart Rate: In beats per minute
  • Daily Steps
  • Sleep Disorder: One of None, Insomnia or Sleep Apnea

Check out the guiding questions or the scenario described below to get started with this dataset! Feel free to make this workspace yours by adding and removing cells, or editing any of the existing cells.

Source: Kaggle

🌎 Some guiding questions to help you explore this data:

  1. Which factors could contribute to a sleep disorder?
  2. Does an increased physical activity level result in a better quality of sleep?
  3. Does the presence of a sleep disorder affect the subjective sleep quality metric?

📊 Visualization ideas

  • Boxplot: show the distribution of sleep duration or quality of sleep for each occupation.
  • Show the link between age and sleep duration with a scatterplot. Consider including information on the sleep disorder.

🔍 Scenario: Automatically identify potential sleep disorders

This scenario helps you develop an end-to-end project for your portfolio.

Background: You work for a health insurance company and are tasked to identify whether or not a potential client is likely to have a sleep disorder. The company wants to use this information to determine the premium they want the client to pay.

Objective: Construct a classifier to predict the presence of a sleep disorder based on the other columns in the dataset.

Check out our Linear Classifiers course (Python) or Supervised Learning course (R) for a quick introduction to building classifiers.

You can query the pre-loaded CSV files using SQL directly. Here’s a sample query:

Hidden code df
# import libs to use:
import statsmodels as stm
from copy import deepcopy
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pingouin
#standarize column names:

sleep_data = pd.read_csv('data.csv')
sleep_data.columns = ['person_id','gender','age','occupation','sleep_duration','quality_of_sleep','physical_activity_level','stress_level','bmi_category','blood_pressure','heart_rate','daily_steps','sleep_disorder']
sleep_data2 = deepcopy(sleep_data)
sleep_data.loc[sleep_data['sleep_disorder'] == None, 'sleep_disorder'] = 'none'
Hidden code
Hidden code
Hidden code
Hidden code
## Does short sleep duration bring a sleeping disorder?
print("Now let's check specific relations we were interested from the provided questions: \n How does sleep duration affects having a sleeping disorder?")

                ## use anova test + pairwise
subset_data= sleep_data[['quality_of_sleep','sleep_disorder']]
subset_data['sleep_disorder'] = subset_data['sleep_disorder'].astype('category')
anovaFrame = pingouin.anova(data=subset_data, dv='quality_of_sleep', between='sleep_disorder')
display(anovaFrame)

print('quality of sleep among sleep disorder factor:')
sns.boxplot(x='sleep_disorder', y='quality_of_sleep', data=subset_data)
plt.show()

pairsFrame = pingouin.pairwise_tests(data=subset_data, dv='quality_of_sleep', between='sleep_disorder', padjust='bonf')
display(pairsFrame)  

sleep_disorder_hyp = sleep_data2[['person_id','gender','quality_of_sleep','stress_level','sleep_duration','sleep_disorder']]
sleep_disorder_hyp.loc[sleep_disorder_hyp['sleep_disorder'] != 'None', 'sleep_disorder'] = 1
sleep_disorder_hyp.loc[sleep_disorder_hyp['sleep_disorder'] == 'None', 'sleep_disorder'] = 0
sleep_disorder_hyp['sleep_disorder'] == sleep_disorder_hyp['sleep_disorder'].astype('bool')
##
yp1 = np.median(sleep_disorder_hyp[sleep_disorder_hyp['sleep_disorder'] == 0]['sleep_duration'])
yp2 = np.median(sleep_disorder_hyp[sleep_disorder_hyp['sleep_disorder'] == 1]['sleep_duration'])
##
plt.figure()
#plt.axline(xy1=(0,yp1), xy2=(1,yp2))
sns.boxplot(x='sleep_disorder', y='sleep_duration', data=sleep_disorder_hyp, hue='gender')
plt.xticks(ticks=[0,1], labels=['No','Yes'])
plt.xlabel('sleep disorder')
plt.ylabel('sleep_duration (h)')
plt.show()
print('Sleep duration seems to have trending effect on wether a person has a sleep disorder(so far). \n the more sleep duration side is on the "No" sleep disorder, while the least sleep duration concentrates more in the "Yes" to having a sleeping disorder')
print('-------------------------------------------------------------- \n')
## is there a trend of stress by lower sleep duration?
## is there a trend of stress by lower quality of sleep?

print("Now check out the stress level as a function of sleep duration and quality")
sns.regplot(x='sleep_duration', y='stress_level', data=sleep_disorder_hyp)
plt.show()

sns.regplot(x='quality_of_sleep', y='stress_level', data=sleep_disorder_hyp)
plt.show()
print('As expected, stress level seems to lower with more sleeping time, and better quality of such sleep. Now lets see how this two seem to correlate')

sns.regplot(x='sleep_duration', y='quality_of_sleep', data=sleep_disorder_hyp)
plt.show()
print('the correlation coeficient says to be:')
print(np.corrcoef(sleep_disorder_hyp['sleep_duration'], sleep_disorder_hyp['quality_of_sleep']))
print("So, in this context you could say more sleep time brings a better quality of sleep, which brings stress level down. All together it could also save you from having a sleeping disorder")
print("------------------------------------------------------------- \n")

## Check physical activity level for quality of sleep

print("Finally, is a classic advice to do exercise for better health and even better rest, lets see if that holds in our comparisions")
sns.boxplot(x='physical_activity_level', y='quality_of_sleep', data=sleep_data)
plt.show()
##hypothesis testing (H0= +phys act <= QoS HA= +phys act > QoS)
obs_mean= sleep_data[['physical_activity_level','quality_of_sleep']].mean()

pa_boot_s=[]
qos_boot_s=[]
for _ in range(5000):
    
    boot_ap = sleep_data[['physical_activity_level', 'quality_of_sleep']].sample(frac=1, replace=True).mean()
    pa_boot_s.append(boot_ap[0])
    qos_boot_s.append(boot_ap[1])

plt.hist(x=pa_boot_s, bins=50)
plt.vlines(x=obs_mean[0], ymin=0, ymax=400, color='red')
print('distribution of physical activity level sample mean vs observed(red)')
plt.show()
print('distribution of quality of sleep sample mean vs observed(green)')
plt.hist(x=qos_boot_s, bins=50)
plt.vlines(x=obs_mean[1], ymin=0, ymax=400, color='green')
plt.show()

print("analysis for these sample bootstraps, pearson r correlation, \n Null: physical activity level doesn't result in better quality of sleep \n Alternative: better quality of sleep based of physical activity level \n Alpha of 0.1")
P_Qos_stats = pingouin.corr(x=sleep_data['physical_activity_level'], y=sleep_data['quality_of_sleep'], alternative='greater')
display(P_Qos_stats)
print("p-value is lower than the alpha, the null hypothesis is rejected, the correlation indicates that better activity level gives better quality of sleep")

print("Now that's a weird trend, you could do barely nothing and sleep very well, but if you start to do some physical activity, it better be somewhat of high level to go for better quality of sleep, yet you have to be careful to not over exhaust yourself. Very unexpected results. My feelings are that you should aim to have the best fitted sleep duration based on the activity level, occupation which most likely influence such activity and keep in mind, that a rather big chunk of observations still are out oif the expected. \nIt's important to note that this dataset is very limited in the amount of observations it had, and it is probably not a good fit for a big population target.")