Understanding Hair Loss With Data

Do you want to know why you lose hair?

📖 Executive Summary

This report provides a comprehensive analysis of factors contributing to hair loss based on a dataset that includes various health, lifestyle, and genetic variables. The analysis is divided into three levels: descriptive statistics, data visualization, and machine learning modeling. The goal is to understand which factors are associated with hair loss, identify subgroups at risk, and provide actionable insights for targeted health interventions.

The findings highlight several potential contributing factors such as age, medical conditions, nutritional deficiencies, and stress levels. Additionally, ensemble machine learning models are employed to predict the likelihood of hair loss. Lastly, clustering techniques are used to identify distinct subgroups of individuals.

Introduction
Data Overview
Descriptive Statistics
Data Visualization
Machine Learning Analysis
Key Findings and Insights
Recommendations
Conclusion

1. Introduction

Hair loss is a significant health concern that affects both appearance and overall health. This analysis aims to uncover the potential contributing factors to hair loss, including medical, genetic, nutritional, and lifestyle factors, by leveraging data science techniques.

2. Data Overview

The dataset contains information on individuals' medical history, lifestyle choices, genetic factors, and more. Each row represents one individual, and the key features include:

Medical Conditions: Various conditions like alopecia, thyroid problems, dermatitis, etc.
Lifestyle Factors: Smoking status, stress levels, weight loss, etc.
Genetic Factors: Family history of baldness.
Nutritional Deficiencies: Deficiencies in essential vitamins and minerals.

3. Descriptive Statistics

Average Age and Age Distribution

The average age of individuals in the dataset was calculated to understand the typical age of participants. The average age of the study participant is 34.19 years. Of men with hair loss, the average age is 33.6 years. Of men with no hair loss, the average age is 34.8 years.
The age distribution was plotted to visualize the spread across different age groups.

Common Medical Conditions

The prevalence of each medical condition was determined, showing that Alopecia Areata and Psoriasis are among the most common.
Alopecia Areata and Androgenetic Alopecia are observed to have higher percentages among individuals with hair loss.
The frequency of each medical condition was calculated to provide insight into which conditions are most closely associated with hair loss.

Nutritional Deficiencies

The dataset was analyzed to determine the types of nutritional deficiencies present, such as iron deficiency and vitamin D deficiency.
Deficiencies in Magnesium, Protein, and Vitamin A seem to have slightly higher percentages among individuals with hair loss
The occurrence of each deficiency was calculated to identify which deficiencies are more prevalent.

4. Data Visualization

Proportion of Patients with Hair Loss by Age Group

The proportion of patients experiencing hair loss was plotted across different age groups to understand whether age is a significant factor. The results were inconclusive.

Factors Associated with Hair Loss

A correlation heatmap was generated to explore relationships between medical conditions, lifestyle factors, and hair loss.
The analysis indicated that factors such as Age Group 31-40, Smoking status, and specific medical conditions have weak to moderate associations with hair loss.

Hair Loss Under Different Stress Levels

The dataset was visualized to understand how stress levels (low, moderate, high) relate to hair loss.
The visualizations showed a trend where moderate stress levels were exhibited increased occurrences of hair loss. However the results were inconclusive.

5. Machine Learning Analysis

Classification Model for Hair Loss Prediction

Both linear and non-linear classification models were built to predict whether an individual will experience hair loss based on given factors.
The models were evaluated for accuracy, precision, and recall to determine its effectiveness.
Most models' accuracy score was low (between 46-53%), akin to a coin flip, due to evidence that the model was underfitting the training dataset. This can be rectified by introducing more data, increasing the model complexity with more features, and deriving new features.

Cluster Analysis for Hair Loss Subgroups

K-Means clustering was applied to identify potential subgroups among individuals based on their medical and lifestyle factors.
The clustering results indicated distinct groups that could be useful for targeted health interventions.

Key Factors Predicting Hair Loss

Random Forests were used to identify the key factors that best predict hair loss.
Feature importance was calculated to determine which factors (e.g., age, medical conditions) were the most significant predictors.
The most significant predictors of hair loss were the presence of: Hormonal Changes, Environmental Factors, Poor Hair Care Habits, Weight Loss, and Smoking. In addition, instances of Low and Moderate Stress Levels were also strong predictors of hair loss.

6. Key Findings and Insights

Age and specific medical conditions like Alopecia Areata are associated with a higher likelihood of hair loss, but the relationships are not strong.
Stress levels and smoking status appear to have complex interactions with hair loss, suggesting the need for further exploration.
Cluster analysis identified distinct groups, which could be targeted for personalized health interventions.
The three clusters are generalized as follows:
- Group 1: Younger Individuals with Moderate Stress and Genetic Factors.
- Group 2: Middle-Aged Individuals with Moderate to High Stress and Hormonal Influences
- Group 3: Middle-Aged, High Stress, and Lifestyle Influences
Several machine learning algorithms were implemented to predict Hair Loss. Among them, ensemble method models performed best, including Random Forest, Gradient Boosting, and XGBoost. However, the accuracy levels of these models was quite low (46-53%). The best performing model was a stacked model that utilized a combination of decision tree models. It achieved an accuracy level of 57%.

7. Recommendations

The sample size is too small for the machine learning algorithm to predict hair loss at a high accuracy level. There is evidence that the model is underfitting the dataset. To improve upon the current performance benchmark will require more features and more observations.
Health interventions could be targeted based on identified clusters, focusing on individuals at higher risk.

8. Conclusion

This report provides a comprehensive analysis of factors contributing to hair loss. Although the correlations identified are weak, they provide a foundation for further exploration using more sophisticated modeling techniques. The insights derived from machine learning and clustering can help inform targeted health interventions and personalized care.

Code and Analysis

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import seaborn as sns

import scipy.stats as stats
import statsmodels.api as sm

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report
from sklearn.decomposition import PCA
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.ensemble import StackingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

from itertools import combinations

from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

# import original dataset
data = pd.read_csv('data/Predict Hair Fall.csv')

# create duplicate
df = data.copy()

# revise column formatting
df.columns = [x.strip() for x in df.columns]

# drop id
df = df.drop('Id', axis = 1)

# reformat Nutritional deficiences
df['Nutritional Deficiencies'] = df['Nutritional Deficiencies'].str.replace('deficiency', '').str.strip()
df['Nutritional Deficiencies'] = df['Nutritional Deficiencies'].str.replace('Deficiency', '').str.strip()
df['Nutritional Deficiencies'] = df['Nutritional Deficiencies'].apply(lambda x: x.strip())

# clean Medical conditions values
df['Medical Conditions'] = df['Medical Conditions'].apply(lambda x: x.strip())

# reformat Medical Conditions, Medications & Treatments, and Nutritional Deficiencies
mapping = {
            'Medical Conditions': 'Conditions', 
            'Medications & Treatments': 'Medications', 
            'Nutritional Deficiencies': 'Deficiencies'
          }
cols = []
for col in df.columns:
    try:
        cols.append(mapping[col])
    except:
        cols.append(col)
df.columns = cols

# # Remap Yes/No Columns
# mapping = {'Yes': 1, 'No': 0}
# for col in ['Genetics', 'Hormonal Changes', 'Poor Hair Care Habits', 'Environmental Factors', 'Smoking', 'Weight Loss']:
#     df[col] = df[col].map(mapping)

df_copy = df.copy()

# preview data
df.head(10)

df.info()

df.isnull().sum()

Level 1: Descriptive Statistics

1A. What is the average age of observations?

# Age Analysis
average_age = df['Age'].mean()
median_age = df['Age'].median()
min_age = df['Age'].min()
max_age = df['Age'].max()

print(f'Average age is {round(average_age, 2)} years')
print(f'Median age is {round(median_age, 2)} years')
print(f'Youngest age is {min_age} years')
print(f'Oldest age is {max_age} years')

average_age_no_hair_loss = df[df['Hair Loss'] == 0]['Age'].mean()
average_age_hair_loss = df[df['Hair Loss'] == 1]['Age'].mean()

print(f'The average age of men with hair loss is {round(average_age_hair_loss, 2)}')
print(f'The average age of men with no hair loss is {round(average_age_no_hair_loss, 2)}')

1B. What is the age distribution?

plt.figure(figsize = (16, 6))
sns.boxplot(data = df, hue = 'Hair Loss', x = 'Age')
plt.title("Age Distribution")
plt.tight_layout()

fig, axs = plt.subplots(2, 2, figsize = (16, 8))
axs = axs.flatten()

sns.histplot(data = df, x = 'Age', kde=True, ax = axs[0])
axs[0].set_title('Age Distribution')
axs[0].set_xlabel('Age')
axs[0].set_ylabel('Frequency')

# Average age of bald and non-bald observations
sns.histplot(data=df, x='Age', hue='Hair Loss', kde = True, bins = 15, ax = axs[1], alpha = 0.5)
axs[1].set_title('Average Age of Participants with and without Hair Loss')
axs[1].set_xlabel('Age')
axs[1].set_ylabel('Frequency')

sns.boxplot(data = df, x = 'Age', ax = axs[2])
axs[2].set_title('Age Distribution')
axs[2].set_xlabel('Age')
axs[2].set_ylabel('Frequency')

# Average age of bald and non-bald observations
sns.boxplot(data=df, x='Age', hue='Hair Loss', ax = axs[3])
axs[3].set_title('Average Age of Participants with and without Hair Loss')
axs[3].set_xlabel('Age')
axs[3].set_ylabel('Frequency')
axs[3].legend(loc='upper right')


plt.tight_layout()
plt.show()

2A. Which medical conditions are the most common? How often do they occur?

‌
‌
‌