Skip to content

Sleep Health and Lifestyle

This synthetic dataset contains sleep and cardiovascular metrics as well as lifestyle factors of close to 400 fictive persons.

The workspace is set up with one CSV file, data.csv, with the following columns:

  • Person ID
  • Gender
  • Age
  • Occupation
  • Sleep Duration: Average number of hours of sleep per day
  • Quality of Sleep: A subjective rating on a 1-10 scale
  • Physical Activity Level: Average number of minutes the person engages in physical activity daily
  • Stress Level: A subjective rating on a 1-10 scale
  • BMI Category
  • Blood Pressure: Indicated as systolic pressure over diastolic pressure
  • Heart Rate: In beats per minute
  • Daily Steps
  • Sleep Disorder: One of None, Insomnia or Sleep Apnea

Check out the guiding questions or the scenario described below to get started with this dataset! Feel free to make this workspace yours by adding and removing cells, or editing any of the existing cells.

Source: Kaggle

๐ŸŒŽ Some guiding questions to help you explore this data:

  1. Which factors could contribute to a sleep disorder?
  2. Does an increased physical activity level result in a better quality of sleep?
  3. Does the presence of a sleep disorder affect the subjective sleep quality metric?

๐Ÿ“Š Visualization ideas

  • Boxplot: show the distribution of sleep duration or quality of sleep for each occupation.
  • Show the link between age and sleep duration with a scatterplot. Consider including information on the sleep disorder.

๐Ÿ” Scenario: Automatically identify potential sleep disorders

This scenario helps you develop an end-to-end project for your portfolio.

Background: You work for a health insurance company and are tasked to identify whether or not a potential client is likely to have a sleep disorder. The company wants to use this information to determine the premium they want the client to pay.

Objective: Construct a classifier to predict the presence of a sleep disorder based on the other columns in the dataset.

Check out our Linear Classifiers course (Python) or Supervised Learning course (R) for a quick introduction to building classifiers.

You can query the pre-loaded CSV files using SQL directly. Hereโ€™s a sample query:

Spinner
DataFrameas
df
variable
SELECT *
FROM 'data.csv'
LIMIT 10
# Import necessary libraries
import pandas as pd  # For data manipulation and analysis
import numpy as np   # For numerical operations
import matplotlib.pyplot as plt  # For data visualization
import seaborn as sns  # For enhanced data visualization
from sklearn.model_selection import train_test_split  # For splitting data into training and testing sets
from sklearn.preprocessing import StandardScaler  # For feature scaling
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix  # For model evaluation
from sklearn.tree import DecisionTreeClassifier  # For building decision tree models
from sklearn.ensemble import RandomForestClassifier  # For building random forest models


sleep_data = pd.read_csv('data.csv')
sleep_data.head()
sleep_data.head()
sleep_data.info()
sleep_data.describe()

check for missing values and handle them appropriately (e.g., impute or remove rows with missing data)

# Check for missing values in the DataFrame.
missing_values = sleep_data.isnull().sum()

# To handle missing values:
# 1. Impute missing values in specific columns.
# Example: Impute missing values in 'Sleep Duration' with the mean.
sleep_data['Sleep Duration'].fillna(sleep_data['Sleep Duration'].mean(), inplace=True)

# 2. Remove rows with missing values.
# Example: Remove rows with missing values in 'Quality of Sleep'.
sleep_data.dropna(subset=['Quality of Sleep'], inplace=True)

# Display the number of missing values in each column after handling them.
missing_values_after_handling = sleep_data.isnull().sum()


# Print the number of missing values in each column after handling.
print("Missing values after handling:")
print(missing_values_after_handling)
# Print the updated DataFrame.
sleep_data.head()

Encode categorical variables like "Gender" and "Occupation" into numerical format (e.g., one-hot encoding or label encoding).

# Method 1: One-Hot Encoding

# Use Pandas' get_dummies to perform one-hot encoding on 'Gender' and 'Occupation'.
df_encoded = pd.get_dummies(sleep_data, columns=['Gender', 'Occupation'], drop_first=True)

# Print the DataFrame with one-hot encoded variables.
print("DataFrame with One-Hot Encoding:")
print(df_encoded.head())
โ€Œ
โ€Œ
โ€Œ