Skip to content
Project: Sleep Health and Lifestyle

Sleep Health and Lifestyle

This synthetic dataset contains sleep and cardiovascular metrics as well as lifestyle factors of close to 400 fictive persons.

The workspace is set up with one CSV file, data.csv, with the following columns:

  • Person ID: Unique identifier for each individual
  • Gender: Male or Female
  • Age: Age of the person
  • Occupation: Job or profession
  • Sleep Duration: Average number of hours of sleep per day
  • Quality of Sleep: A subjective rating on a 1-10 scale
  • Physical Activity Level: Average number of minutes the person engages in physical activity daily
  • Stress Level: A subjective rating on a 1-10 scale
  • BMI Category: Body Mass Index category (e.g., Overweight, Normal, Obese).
  • Blood Pressure: Indicated as systolic pressure over diastolic pressure
  • Heart Rate: In beats per minute
  • Daily StepsNumber of steps taken per day.
  • Sleep Disorder: One of None, Insomnia or Sleep Apnea

Source: Kaggle

Scenario

Background: You work for a health insurance company and are tasked to identify whether or not a potential client is likely to have a sleep disorder. The company wants to use this information to determine the premium they want the client to pay.

Objective: Construct a classifier to predict the presence of a sleep disorder based on the other columns in the dataset.

1. Setup

# Loading necessary packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV
# Loading dataset and storing in 'sleep_df' variable
sleep_df = pd.read_csv('data.csv')

2. Exploring

sleep_df.head()
sleep_df.info()
sleep_df.describe()

3. Visualizing

# Setting the aesthetic style of the plots
sns.set_style("whitegrid")

# Creating a boxplot for sleep duration/quality by occupation
plt.figure(figsize=(15, 8))
sns.boxplot(x='Occupation', y='Sleep Duration', hue='Sleep Disorder', data= sleep_df)
plt.xticks(rotation=45)
plt.title('Sleep Duration by Occupation and Sleep Disorder')
plt.show()

# Creating a scatterplot for age and sleep duration
plt.figure(figsize=(15, 8))
sns.scatterplot(x='Age', y='Sleep Duration', hue='Sleep Disorder', data= sleep_df)
plt.title('Age vs. Sleep Duration with Sleep Disorder Information')
plt.show()

The visualizations provide some interesting insights into the dataset:

Sleep Duration by Occupation and Sleep Disorder (Boxplot): This plot shows the distribution of sleep duration across different occupations, with a distinction made for those with and without sleep disorders. It can help identify if certain occupations are more prone to having sleep disorders and how it affects sleep duration.

Age vs. Sleep Duration with Sleep Disorder Information (Scatterplot): This plot illustrates the relationship between age and sleep duration, with color coding to indicate the presence of a sleep disorder. This visualization can help determine if there's a trend in sleep duration with age and how sleep disorders might be distributed across different ages.

4. Statistical Analysis

# Analyzing correlations between different variables
correlation_data = sleep_df.copy()

# Converting categorical data to numerical for correlation analysis
correlation_data['Gender'] = correlation_data['Gender'].map({'Male': 0, 'Female': 1})
correlation_data['BMI Category'] = correlation_data['BMI Category'].astype('category').cat.codes
correlation_data['Sleep Disorder'] = correlation_data['Sleep Disorder'].astype('category').cat.codes
correlation_data['Occupation'] = correlation_data['Occupation'].astype('category').cat.codes

# Calculating the correlation matrix
correlation_matrix = correlation_data.corr()

# Displaying the correlation matrix with a heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

The correlation matrix heatmap provides insights into how different variables in the dataset are related to each other. Here are some key observations:

Sleep Disorder and Other Variables: The variable 'Sleep Disorder' has some degree of correlation with other factors like age, BMI category, stress level, and quality of sleep. This suggests these factors might be important in predicting sleep disorders.

Physical Activity and Quality of Sleep: There is a correlation between physical activity level and quality of sleep, although it's not very strong. This indicates that increased physical activity could be associated with better sleep quality, but other factors might also play a significant role.

Age and Sleep Duration: There's a correlation between age and sleep duration. This suggests that sleep duration tends to vary with age, which is also visible in the scatterplot we created earlier.

These insights can guide the construction of a classifier for predicting sleep disorders. Factors like BMI category, stress level, physical activity, and age could be particularly relevant. Additionally, exploring interactions between these variables might reveal more complex patterns.