===============================================================================
Sleep Health and Lifestyle
===============================================================================
1. Data Understanding
Source: Kaggle
Data: This dataset contains sleep and cardiovascular metrics as well as lifestyle factors of close to 400 fictive persons.
Background: A health insurance company requires to identify whether or not a potential client is likely to have a sleep disorder. The company wants to use this information to determine the premium they want the client to pay.
Objective: Automatically identify potential sleep disorders.
Problem Solution: Construct a classifier to predict the presence of a sleep disorder based on the other columns in the dataset.
The data contains the following columns:
Person ID
Gender
Age
Occupation
Sleep Duration
: Average number of hours of sleep per dayQuality of Sleep
: A subjective rating on a 1-10 scalePhysical Activity Level
: Average number of minutes the person engages in physical activity dailyStress Level
: A subjective rating on a 1-10 scaleBMI Category
Blood Pressure
: Indicated as systolic pressure over diastolic pressureHeart Rate
: In beats per minuteDaily Steps
Sleep Disorder
: One ofNone
,Insomnia
orSleep Apnea
Let's start with the first step: Data Understanding. We'll load the data and check the first few rows to understand its structure. We'll also look at the data types and check for any missing values.
🌎 Some guiding questions to help you explore this data:
- Which factors could contribute to a sleep disorder?
- Does an increased physical activity level result in a better quality of sleep?
- Does the presence of a sleep disorder affect the subjective sleep quality metric?
import pandas as pd
data = pd.read_csv('data.csv')
data.head()
data.info()
The dataset contains 374 entries and 13 columns. Each entry represents a fictive individual's health and sleep-related metrics. There are no missing values in the dataset, which is good as it means we have a complete dataset to work with. The data types are also consistent with the data description provided.
2. Exploratory Data Analysis (EDA)
Next, we'll perform Exploratory Data Analysis (EDA) to understand the data distributions, look for any anomalies or interesting patterns. We'll start by generating descriptive statistics to understand the central tendency, dispersion, and shape of the dataset's distribution.
Next, we will analyze the distribution of key numerical and categorical variables and their relationship with the presence of a sleep disorder. Let's also particularly focus on the "Sleep Disorder" column, as it is our target variable for prediction.
# descriptive statistics
descriptive_stats = data.describe()
# unique values in each categorical column
unique_values = {}
cat_columns = ['Gender', 'Occupation', 'BMI Category', 'Sleep Disorder', 'Blood Pressure']
for col in cat_columns:
unique_values[col] = data[col].unique()
display(descriptive_stats, unique_values)
The descriptive statistics provide the following insights:
- The average
age
of individuals in the dataset is approximately 42 years, with a minimum of 27 and a maximum of 59 years. - The average
sleep duration
is approximately 7.13 hours, with a minimum of 5.8 and a maximum of 8.5 hours. - The average
quality of sleep
rating is 7.31 on a scale of 1 to 10. - On average, individuals engage in
physical activity
for about 59 minutes per day. - The average
stress level
is around 5.38 on a scale of 1 to 10. - The average
heart rate
is approximately 70 beats per minute. - The average number of
daily steps
is around 6,817.
When it comes to categorical data, there seems to be a redundancy with 'Normal' and 'Normal Weight' in BMI Category
this might require cleaning.
Also, the variable Blood Pressure
in the format 'systolic/diastolic' (e.g., '126/83') is a composite of two numeric measurements. For most machine learning models, it would be more effective to split this variable into its individual components rather than keeping it as a single string, as the Systolic
and Diastolic
readings can have different implications for health, and models can leverage this information if they're separate features.
We will address thess both cases in the next step to include the modified columns in the further EDA.
# Correcting the inconsistency in 'BMI Category'
data['BMI Category'].replace({'Normal Weight': 'Normal'}, inplace=True)
# Splitting the 'Blood Pressure' column into 'Systolic' and 'Diastolic' columns
data['Systolic'] = data['Blood Pressure'].str.split('/').str[0].astype(int)
data['Diastolic'] = data['Blood Pressure'].str.split('/').str[1].astype(int)
data.drop(['Blood Pressure','Person ID'], axis=1, inplace=True)
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="white")
sns.set_palette(palette='Set3')
# List of key numerical variables
num_vars = data.select_dtypes(include=['int64', 'float64']).columns.tolist()
# PairGrid instance, mapping a histogram+KDE to the diagonal and regplot to the off-diagonal elements to show the bivariate distributions with a regression line
pair_grid = sns.PairGrid(data=data[num_vars], diag_sharey=False)
pair_grid.map_diag(sns.histplot, kde=True)
pair_grid.map_offdiag(sns.regplot, scatter_kws={'s':50, 'alpha':0.5}, line_kws={'color':'red'})
# Correlation matrix
corr_matrix = data[num_vars].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True)
plt.title('Correlation matrix of numerical variables', fontsize=16)
plt.show()
# Box plots to identify any outliers
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(20, 15))
for i, var in enumerate(num_vars):
row = i // 3
col = i % 3
sns.boxplot(x=data[var], ax=axes[row, col])
axes[row, col].set_title(f'Box Plot of {var}', fontsize=14)
axes[row, col].set_xlabel(var, fontsize=12)
plt.tight_layout()
plt.show()
2.1 Numerical variables
The visualizations provide multiple insights into the dataset:
Histograms:
- The distribution of
Age
seems to be slightly right-skewed, suggesting that there are more individuals in the younger age range than in the older range. - The
Quality of Sleep
is left-skewed meaning there are more individuals in our data with better quality of sleep. - The
Heart Rate
seems to have a relatively normal distribution being skewed towards the right end suggesting there are more individuals with lower heart rates. - Looking at the other numerical variables we can see that we do not have a pattern in most variables with half of them showing a balance in the data.
Correlation:
From the scatter plots interpreted along with heat map we can extract the following insights about the relationship between variables:*
Sleep Duration
has a strong positive linear relationship withQuality of Sleep
backed up by correlation 0.88, suggesting that as the sleep duration increases, the quality of sleep tends to improve.Sleep Duration
has a clear negative linear relationships withStress Level
(-0.81) andHeart Rate
(-0.52), the latter being less correlated whih might be influensed by outliers, further analysis is required.Quality of Sleep
seems to have strong negative linear relationships withStress Level
(-0.90) andHeart Rate
(-0.66) suggesting that individuals with high stress and high heart rates tend to rate lower their quality of sleep.Stress Level
seems to have relatively strong positive linear relationship withHeart Rate
(0.67), implying that higher stress levels are associated with higher heart rates.- There is a also positive linear trend between
Age
andQuality of Sleep
with correlation 0.47 suggesting that older individuals tend to have better quality of sleep. - The positive linear relationships can be seen between
Age
and bothSystolic
andDiastolic
variables with correlations 0.61 and 0.59 respectively. Systolic
andDiastolic
are also strongly correlated (0.91).
Outliers:
The box plots reveal outliers in Heart Rate
, however, these "outliers" may be natural extremes and not errors, as they're within conceivable ranges for individuals.
# Distribution of categorical variables
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(16, 5))
for i, var in enumerate(['Gender', 'BMI Category', 'Sleep Disorder']):
col = i
order = data[var].value_counts().index
sns.countplot(x=var, data=data, ax=axes[col], order=order)
axes[col].set_title(f'Distribution of {var}', fontsize=14)
axes[col].set_xlabel(var, fontsize=12)
axes[col].set_ylabel('Count', fontsize=12)
axes[col].set_xticklabels(axes[col].get_xticklabels(), rotation=45)
plt.tight_layout()
plt.show()
# Distribution of Sleep Disorder per Occupation variable
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))
order = data['Occupation'].value_counts().index
sns.countplot(x='Occupation', data=data, ax=ax1, order=order)
ax1.set_title('Distribution of Occupations', fontsize=14)
ax1.set_xlabel('Occupation', fontsize=12)
ax1.set_ylabel('Count', fontsize=12)
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)
# Distribution of categorical variables
occup_dis = data.groupby('Occupation')['Sleep Disorder'].value_counts(normalize=True).unstack().sort_values(by='None', ascending=False)
order_sleep_disorder = ['None', 'Sleep Apnea', 'Insomnia']
occup_dis[order_sleep_disorder].plot(kind='barh',stacked=True, ax=ax2)
ax2.set_title('Sleep Disorder per Occupation', fontsize=14)
ax2.set_xlabel('Proportions', fontsize=12)
ax2.set_ylabel('Occupation', fontsize=12)
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=3)
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 5))
sns.boxplot(data=data,y='Sleep Duration', x='Occupation')
plt.xticks(rotation=90)
plt.show()
2.2 Categorical variables
There are more instances of 'None' in the Sleep Disorder
category, meaning we are dealing with imbalanced target variable in our classification problem, which later we will adress by applying the algorithms with Class Weights.
The Gender
distribution seems fairly balanced.
The Occupation
distribution is varied.
Key Insights:
- Dominance of 'None' Category in
Sleep Disorders
: For almost all occupations, a majority don't suffer from sleep disorders, as indicated by the prevalence of the 'None' category. - Nurses at Risk: A significant portion of nurses seem to be suffering from 'Sleep Apnea' compared to other occupations. This could suggest unique stressors or lifestyle factors that predispose them to this disorder.
- Teachers' and Salespersons' Insomnia:the incidence 'Insomnia' for both these occupations is higher compared to most other occupations.
- Occupational Rarity vs. Sleep Disorder: While 'Sales Representative' is one of the least common occupations in the dataset, it has a relatively higher proportion of sleep disorders compared to its population.
Given these insights, the next steps involve encoding categorical variables and standardizing the data. After these preprocessing steps, we can proceed with feature engineering and model building.
Let's start with data preprocessing.
3. Data Preprocessing
As the data was alredy cleaned, the only step left is Encoding Categorical Variables. The categorical variables (Gender
, Occupation
, BMI Category
, and Sleep Disorder
) have to be encoded into numerical formats. This transformation is essential for machine learning algorithms, which require numerical input.
#installing required libraries
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay
X = data.drop(['Sleep Disorder'], axis=1)
y = data['Sleep Disorder']
# Label encoding for categorical variables in X
label_encoders = {} # To store the encoder objects for potential inverse transformations later
for col in X.select_dtypes(include=['object']).columns:
le = LabelEncoder()
X[col] = le.fit_transform(X[col])
label_encoders[col] = le
# Encoding the target variable
le_target = LabelEncoder()
y = le_target.fit_transform(y)
X.head(), y[:5]
4. Model building
Now, we're ready to move on to the model building phase. Here's the plan:
4.1 Model Selection
-
Given the nature of the data - it's multicollinearity and imbalance, we will start with
Logistic Regression
as a baseline model, as this algorithm supports the use of class weights. We'll set class weights to be "balanced", which will automatically adjust weights inversely proportional to class frequencies in the input data. In combination with the class_weight parameter, we will set stratify parameter in train_test_split to y to ensure that the minority class is adequately represented in both sets. -
To address multicollinearity concern, we will fit our data to Regularized Linear Model such as Logistic Regression with L2 regularization (
Ridge
) which can constrain the magnitude of coefficients. This prevents any one feature from having too much influence on the model due to multicollinearity. -
We will also try fitting
SVM
as it can be effective in the presence of multicollinearity. -
Next, we will fit our data to
Random Forests Classifier
as this ensemble model naturally handles multicollinearity well. Each decision tree in the forest considers a subset of features, reducing the impact of multicollinear features. The ensemble approach of random forest also ensures that the model is robust to individual feature relationships. It also provides an indication of feature importance, which can be insightful.
As Regularized Linear Models tend to show better results when fitted the standardised data, we will apply StandardScaler()
to numerical features. But first, we split our dta to train and test to avoid data leakage.