Sleep Health and Lifestyle

===============================================================================

===============================================================================

1. Data Understanding

Data: This dataset contains sleep and cardiovascular metrics as well as lifestyle factors of close to 400 fictive persons.

Background: A health insurance company requires to identify whether or not a potential client is likely to have a sleep disorder. The company wants to use this information to determine the premium they want the client to pay.

Objective: Automatically identify potential sleep disorders.

Problem Solution: Construct a classifier to predict the presence of a sleep disorder based on the other columns in the dataset.

The data contains the following columns:

Person ID
Gender
Age
Occupation
Sleep Duration: Average number of hours of sleep per day
Quality of Sleep: A subjective rating on a 1-10 scale
Physical Activity Level: Average number of minutes the person engages in physical activity daily
Stress Level: A subjective rating on a 1-10 scale
BMI Category
Blood Pressure: Indicated as systolic pressure over diastolic pressure
Heart Rate: In beats per minute
Daily Steps
Sleep Disorder: One of None, Insomnia or Sleep Apnea

Let's start with the first step: Data Understanding. We'll load the data and check the first few rows to understand its structure. We'll also look at the data types and check for any missing values.

🌎 Some guiding questions to help you explore this data:

Which factors could contribute to a sleep disorder?
Does an increased physical activity level result in a better quality of sleep?
Does the presence of a sleep disorder affect the subjective sleep quality metric?

import pandas as pd

data = pd.read_csv('data.csv')
data.head()

data.info()

The dataset contains 374 entries and 13 columns. Each entry represents a fictive individual's health and sleep-related metrics. There are no missing values in the dataset, which is good as it means we have a complete dataset to work with. The data types are also consistent with the data description provided.

2. Exploratory Data Analysis (EDA)

Next, we'll perform Exploratory Data Analysis (EDA) to understand the data distributions, look for any anomalies or interesting patterns. We'll start by generating descriptive statistics to understand the central tendency, dispersion, and shape of the dataset's distribution.

Next, we will analyze the distribution of key numerical and categorical variables and their relationship with the presence of a sleep disorder. Let's also particularly focus on the "Sleep Disorder" column, as it is our target variable for prediction.

# descriptive statistics
descriptive_stats = data.describe()

# unique values in each categorical column
unique_values = {}
cat_columns = ['Gender', 'Occupation', 'BMI Category', 'Sleep Disorder', 'Blood Pressure']
for col in cat_columns:
    unique_values[col] = data[col].unique()

display(descriptive_stats, unique_values)

The descriptive statistics provide the following insights:

The average age of individuals in the dataset is approximately 42 years, with a minimum of 27 and a maximum of 59 years.
The average sleep duration is approximately 7.13 hours, with a minimum of 5.8 and a maximum of 8.5 hours.
The average quality of sleep rating is 7.31 on a scale of 1 to 10.
On average, individuals engage in physical activity for about 59 minutes per day.
The average stress level is around 5.38 on a scale of 1 to 10.
The average heart rate is approximately 70 beats per minute.
The average number of daily steps is around 6,817.

When it comes to categorical data, there seems to be a redundancy with 'Normal' and 'Normal Weight' in BMI Category this might require cleaning. Also, the variable Blood Pressure in the format 'systolic/diastolic' (e.g., '126/83') is a composite of two numeric measurements. For most machine learning models, it would be more effective to split this variable into its individual components rather than keeping it as a single string, as the Systolic and Diastolic readings can have different implications for health, and models can leverage this information if they're separate features.

We will address thess both cases in the next step to include the modified columns in the further EDA.

# Correcting the inconsistency in 'BMI Category'
data['BMI Category'].replace({'Normal Weight': 'Normal'}, inplace=True)

# Splitting the 'Blood Pressure' column into 'Systolic' and 'Diastolic' columns
data['Systolic'] = data['Blood Pressure'].str.split('/').str[0].astype(int)
data['Diastolic'] = data['Blood Pressure'].str.split('/').str[1].astype(int)

data.drop(['Blood Pressure','Person ID'], axis=1, inplace=True)

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="white")
sns.set_palette(palette='Set3')

# List of key numerical variables
num_vars = data.select_dtypes(include=['int64', 'float64']).columns.tolist()

# PairGrid instance, mapping a histogram+KDE to the diagonal and regplot to the off-diagonal elements to show the bivariate distributions with a regression line
pair_grid = sns.PairGrid(data=data[num_vars], diag_sharey=False)

pair_grid.map_diag(sns.histplot, kde=True)

pair_grid.map_offdiag(sns.regplot, scatter_kws={'s':50, 'alpha':0.5}, line_kws={'color':'red'})

# Correlation matrix
corr_matrix = data[num_vars].corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True)
plt.title('Correlation matrix of numerical variables', fontsize=16)
plt.show()

# Box plots to identify any outliers
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(20, 15))

for i, var in enumerate(num_vars):
    row = i // 3
    col = i % 3
    sns.boxplot(x=data[var], ax=axes[row, col])
    axes[row, col].set_title(f'Box Plot of {var}', fontsize=14)
    axes[row, col].set_xlabel(var, fontsize=12)
    
plt.tight_layout()
plt.show()

2.1 Numerical variables

The visualizations provide multiple insights into the dataset:

Histograms:

The distribution of Age seems to be slightly right-skewed, suggesting that there are more individuals in the younger age range than in the older range.
The Quality of Sleep is left-skewed meaning there are more individuals in our data with better quality of sleep.
The Heart Rate seems to have a relatively normal distribution being skewed towards the right end suggesting there are more individuals with lower heart rates.
Looking at the other numerical variables we can see that we do not have a pattern in most variables with half of them showing a balance in the data.

Correlation:

From the scatter plots interpreted along with heat map we can extract the following insights about the relationship between variables:*

Sleep Duration has a strong positive linear relationship with Quality of Sleep backed up by correlation 0.88, suggesting that as the sleep duration increases, the quality of sleep tends to improve. Sleep Duration has a clear negative linear relationships with Stress Level (-0.81) and Heart Rate (-0.52), the latter being less correlated whih might be influensed by outliers, further analysis is required.
Quality of Sleep seems to have strong negative linear relationships with Stress Level (-0.90) and Heart Rate (-0.66) suggesting that individuals with high stress and high heart rates tend to rate lower their quality of sleep.
Stress Level seems to have relatively strong positive linear relationship with Heart Rate (0.67), implying that higher stress levels are associated with higher heart rates.
There is a also positive linear trend between Age and Quality of Sleep with correlation 0.47 suggesting that older individuals tend to have better quality of sleep.
The positive linear relationships can be seen between Age and both Systolic and Diastolic variables with correlations 0.61 and 0.59 respectively.
Systolic and Diastolic are also strongly correlated (0.91).

Outliers:

The box plots reveal outliers in Heart Rate, however, these "outliers" may be natural extremes and not errors, as they're within conceivable ranges for individuals.

# Distribution of categorical variables
fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(16, 5))

for i, var in enumerate(['Gender', 'BMI Category', 'Sleep Disorder']):
    col = i
    order = data[var].value_counts().index
    sns.countplot(x=var, data=data, ax=axes[col], order=order)    
    axes[col].set_title(f'Distribution of {var}', fontsize=14)
    axes[col].set_xlabel(var, fontsize=12)
    axes[col].set_ylabel('Count', fontsize=12)
    axes[col].set_xticklabels(axes[col].get_xticklabels(), rotation=45)
plt.tight_layout()
plt.show()


# Distribution of Sleep Disorder per Occupation variable 
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

order = data['Occupation'].value_counts().index
sns.countplot(x='Occupation', data=data, ax=ax1, order=order)
ax1.set_title('Distribution of Occupations', fontsize=14)
ax1.set_xlabel('Occupation', fontsize=12)
ax1.set_ylabel('Count', fontsize=12)
ax1.set_xticklabels(ax1.get_xticklabels(), rotation=45)


# Distribution of categorical variables
occup_dis = data.groupby('Occupation')['Sleep Disorder'].value_counts(normalize=True).unstack().sort_values(by='None', ascending=False)
order_sleep_disorder = ['None', 'Sleep Apnea', 'Insomnia']

occup_dis[order_sleep_disorder].plot(kind='barh',stacked=True, ax=ax2)
ax2.set_title('Sleep Disorder per Occupation', fontsize=14)
ax2.set_xlabel('Proportions', fontsize=12)
ax2.set_ylabel('Occupation', fontsize=12)
plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=3)

plt.tight_layout()
plt.show()

plt.figure(figsize=(10, 5)) 
sns.boxplot(data=data,y='Sleep Duration', x='Occupation')
plt.xticks(rotation=90)
plt.show()

2.2 Categorical variables

There are more instances of 'None' in the Sleep Disorder category, meaning we are dealing with imbalanced target variable in our classification problem, which later we will adress by applying the algorithms with Class Weights. The Gender distribution seems fairly balanced. The Occupation distribution is varied.

Key Insights:

Dominance of 'None' Category in Sleep Disorders: For almost all occupations, a majority don't suffer from sleep disorders, as indicated by the prevalence of the 'None' category.
Nurses at Risk: A significant portion of nurses seem to be suffering from 'Sleep Apnea' compared to other occupations. This could suggest unique stressors or lifestyle factors that predispose them to this disorder.
Teachers' and Salespersons' Insomnia:the incidence 'Insomnia' for both these occupations is higher compared to most other occupations.
Occupational Rarity vs. Sleep Disorder: While 'Sales Representative' is one of the least common occupations in the dataset, it has a relatively higher proportion of sleep disorders compared to its population.

Given these insights, the next steps involve encoding categorical variables and standardizing the data. After these preprocessing steps, we can proceed with feature engineering and model building.

Let's start with data preprocessing.

3. Data Preprocessing

As the data was alredy cleaned, the only step left is Encoding Categorical Variables. The categorical variables (Gender, Occupation, BMI Category, and Sleep Disorder) have to be encoded into numerical formats. This transformation is essential for machine learning algorithms, which require numerical input.

#installing required libraries
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

X = data.drop(['Sleep Disorder'], axis=1) 
y = data['Sleep Disorder']

# Label encoding for categorical variables in X
label_encoders = {}  # To store the encoder objects for potential inverse transformations later

for col in X.select_dtypes(include=['object']).columns:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col])
    label_encoders[col] = le

# Encoding the target variable
le_target = LabelEncoder()
y = le_target.fit_transform(y)

X.head(), y[:5]

4. Model building

Now, we're ready to move on to the model building phase. Here's the plan:

4.1 Model Selection

Given the nature of the data - it's multicollinearity and imbalance, we will start with Logistic Regression as a baseline model, as this algorithm supports the use of class weights. We'll set class weights to be "balanced", which will automatically adjust weights inversely proportional to class frequencies in the input data. In combination with the class_weight parameter, we will set stratify parameter in train_test_split to y to ensure that the minority class is adequately represented in both sets.
To address multicollinearity concern, we will fit our data to Regularized Linear Model such as Logistic Regression with L2 regularization (Ridge) which can constrain the magnitude of coefficients. This prevents any one feature from having too much influence on the model due to multicollinearity.
We will also try fitting SVM as it can be effective in the presence of multicollinearity.
Next, we will fit our data to Random Forests Classifier as this ensemble model naturally handles multicollinearity well. Each decision tree in the forest considers a subset of features, reducing the impact of multicollinear features. The ensemble approach of random forest also ensures that the model is robust to individual feature relationships. It also provides an indication of feature importance, which can be insightful.

As Regularized Linear Models tend to show better results when fitted the standardised data, we will apply StandardScaler() to numerical features. But first, we split our dta to train and test to avoid data leakage.

‌
‌
‌

Sleep Health and Lifestyle | Random Forest Classifier 95%

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Sleep Health and Lifestyle