Skip to content

Sleep Health and Lifestyle

This synthetic dataset contains sleep and cardiovascular metrics as well as lifestyle factors of close to 400 fictive persons.

The workspace is set up with one CSV file, data.csv, with the following columns:

  • Person ID
  • Gender
  • Age
  • Occupation
  • Sleep Duration: Average number of hours of sleep per day
  • Quality of Sleep: A subjective rating on a 1-10 scale
  • Physical Activity Level: Average number of minutes the person engages in physical activity daily
  • Stress Level: A subjective rating on a 1-10 scale
  • BMI Category
  • Blood Pressure: Indicated as systolic pressure over diastolic pressure
  • Heart Rate: In beats per minute
  • Daily Steps
  • Sleep Disorder: One of None, Insomnia or Sleep Apnea

Check out the guiding questions or the scenario described below to get started with this dataset! Feel free to make this workspace yours by adding and removing cells, or editing any of the existing cells.

Source: Kaggle

๐ŸŒŽ Some guiding questions to help you explore this data:

  1. Which factors could contribute to a sleep disorder?
  2. Does an increased physical activity level result in a better quality of sleep?
  3. Does the presence of a sleep disorder affect the subjective sleep quality metric?

๐Ÿ“Š Visualization ideas

  • Boxplot: show the distribution of sleep duration or quality of sleep for each occupation.
  • Show the link between age and sleep duration with a scatterplot. Consider including information on the sleep disorder.

๐Ÿ” Scenario: Automatically identify potential sleep disorders

This scenario helps you develop an end-to-end project for your portfolio.

Background: You work for a health insurance company and are tasked to identify whether or not a potential client is likely to have a sleep disorder. The company wants to use this information to determine the premium they want the client to pay.

Objective: Construct a classifier to predict the presence of a sleep disorder based on the other columns in the dataset.

Check out our Linear Classifiers course (Python) or Supervised Learning course (R) for a quick introduction to building classifiers.

You can query the pre-loaded CSV files using SQL directly. Hereโ€™s a sample query:

Spinner
DataFrameas
df
variable
SELECT *
FROM 'data.csv'
import pandas as pd
sleep_data = pd.read_csv('data.csv')
sleep_data
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings("ignore")

# Load dataset
data = pd.read_csv("data.csv")

print(data)
# Exploratory Data Analysis (EDA)
sns.countplot(x='Sleep Disorder', data=data)
plt.title("Distribution of Sleep Disorders")
plt.show()

# Parse 'Blood Pressure' column if present
if 'Blood Pressure' in data.columns:
    data[['Systolic_BP', 'Diastolic_BP']] = data['Blood Pressure'].str.split('/', expand=True).astype(float)
    data.drop(columns=['Blood Pressure'], inplace=True)
# Define numerical and categorical columns
categorical_cols = ['Gender', 'BMI Category', 'Occupation', 'Sleep Disorder']
numerical_cols = data.select_dtypes(include=np.number).columns.tolist()

# Ensure 'Sleep Disorder' is not in numerical columns
if 'Sleep Disorder' in numerical_cols:
    numerical_cols.remove('Sleep Disorder')

# Handling missing values
data.fillna(data.median(numeric_only=True), inplace=True)
for col in categorical_cols:
    data[col].fillna(data[col].mode()[0], inplace=True)

# Encode target variable
label_encoder = LabelEncoder()
data['Sleep Disorder'] = label_encoder.fit_transform(data['Sleep Disorder'])

# Define features and target
X = data.drop(columns=['Sleep Disorder'])
y = data['Sleep Disorder']
# Preprocessing Pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), ['Gender', 'BMI Category', 'Occupation'])
    ]
)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Build Pipeline with Model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train Model
pipeline.fit(X_train, y_train)

# Predictions
y_pred = pipeline.predict(X_test)

# Evaluate Model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.title("Confusion Matrix")
plt.show()
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(6,4))
sns.countplot(x=df['Sleep Disorder'], palette='coolwarm')
plt.title("Distribution of Sleep Disorders")
plt.show()
# Correlation analysis
corr = data.corr()
print(corr['Sleep Disorder'].sort_values(ascending=False))

# Visualizing feature distribution for those with and without sleep disorders
plt.figure(figsize=(12, 6))
sns.boxplot(x=data['Sleep Disorder'], y=data['Stress Level'])
plt.title("Stress Level Distribution for Sleep Disorder vs. No Sleep Disorder")
plt.show()

# Feature importance using Random Forest
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

label_encoders = {}
categorical_cols = ['Gender', 'BMI Category', 'Occupation', 'Sleep Disorder']

data_encoded = data.copy()
for col in categorical_cols:
    le = LabelEncoder()
    data_encoded[col] = le.fit_transform(data[col])
    label_encoders[col] = le 

X = data_encoded.drop(columns=['Sleep Disorder'])
y = data_encoded['Sleep Disorder']

rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)

importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)

plt.figure(figsize=(12,6))
importances.plot(kind='bar')
plt.title("Feature Importance in Predicting Sleep Disorder")
plt.xlabel("Features")
plt.ylabel("Importance Score")
plt.show()

# Scatter plot
plt.figure(figsize=(10, 5))
sns.scatterplot(x=data['Physical Activity Level'], y=data['Quality of Sleep'])
plt.title("Physical Activity Level vs. Sleep Quality")
plt.xlabel("Physical Activity Level")
plt.ylabel("Quality of Sleep")
plt.show()

# Compute correlation
corr_value = data[['Physical Activity Level', 'Quality of Sleep']].corr().iloc[0, 1]
print(f"Correlation between Physical Activity Level and Sleep Quality: {corr_value:.2f}")

# Perform ANOVA test
import scipy.stats as stats

groups = [data['Quality of Sleep'][data['Physical Activity Level'] == level] for level in data['Physical Activity Level'].unique()]
anova_result = stats.f_oneway(*groups)
print(f"ANOVA test result: p-value = {anova_result.pvalue:.4f}")
โ€Œ
โ€Œ
โ€Œ