Hair Loss Analysis: Machine Learning for Prediction and Explainable Insights: A Comprehensive Analysis
Executive Summary
This project aimed to build a predictive model to assess the likelihood of hair loss based on a dataset encompassing diverse features related to medical conditions, medications, treatments, and nutritional deficiencies. We followed a systematic approach to preprocess, analyze, model, and interpret the data. Here’s a comprehensive analysis of each stage in the project and the insights derived.
Introduction
Hair loss is a common issue influenced by a range of factors, including genetics, medical conditions, treatments, and nutrition. This project aimed to build a machine learning model capable of predicting the likelihood of hair loss based on these diverse attributes. Through systematic data preprocessing, exploratory data analysis, model training, and interpretability tools, we aimed to uncover insights into hair loss factors and assess the reliability of the predictions. This essay provides an in-depth analysis of each stage, highlighting key findings and recommendations for further improvements.
💾 The data
The survey provides the information you need in the Predict Hair Fall.csv in the data folder.
Data contains information on persons in this survey. Each row represents one person.
- "Id" - A unique identifier for each person.
- "Genetics" - Whether the person has a family history of baldness.
- "Hormonal Changes" - Indicates whether the individual has experienced hormonal changes (Yes/No).
- "Medical Conditions" - Medical history that may lead to baldness; alopecia areata, thyroid problems, scalp infections, psoriasis, dermatitis, etc.
- "Medications & Treatments" - History of medications that may cause hair loss; chemotherapy, heart medications, antidepressants, steroids, etc.
- "Nutritional Deficiencies" - Lists nutritional deficiencies that may contribute to hair loss, such as iron deficiency, vitamin D deficiency, biotin deficiency, omega-3 fatty acid deficiency, etc.
- "Stress" - Indicates the stress level of the individual (Low/Moderate/High).
- "Age" - Represents the age of the individual.
- "Poor Hair Care Habits" - Indicates whether the individual practices poor hair care habits (Yes/No).
- "Environmental Factors" - Indicates whether the individual is exposed to environmental factors that may contribute to hair loss (Yes/No).
- "Smoking" - Indicates whether the individual smokes (Yes/No).
- "Weight Loss" - Indicates whether the individual has experienced significant weight loss (Yes/No).
- "Hair Loss" - Binary variable indicating the presence (1) or absence (0) of baldness in the individual.
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score, classification_report, silhouette_score
from lime import lime_tabular
import warnings
warnings.filterwarnings('ignore')Data Loading and Initial Exploration
print("1️⃣ Loading and Validating Data...")
file_path = 'data/Predict Hair Fall.csv'
# Load the data
try:
df = pd.read_csv(file_path)
except FileNotFoundError:
print(f"Error: File '{file_path}' not found")
except Exception as e:
print(f"Error loading file: {str(e)}")
print("\nBasic Validation Checks:")
print("-" * 50)
df.head()Data Cleaning and Preprocessing
from sklearn.preprocessing import LabelEncoder
def clean_data(df):
"""
Clean and preprocess the dataset
Parameters:
df (pd.DataFrame): Raw dataset
Returns:
pd.DataFrame: Cleaned dataset
"""
print("\n2️⃣ Cleaning and Preprocessing Data...")
df_clean = df.copy()
# Convert Hair Loss to int
print("Converting 'Hair Loss' to integer values...")
df_clean['Hair Loss'] = df_clean['Hair Loss'].astype(int)
# Handle 'No Data' values
columns_with_no_data = ['Medical Conditions', 'Medications & Treatments',
'Nutritional Deficiencies ']
print("Replacing 'No Data' values with the mode of each column...")
for col in columns_with_no_data:
df_clean[col] = df_clean[col].replace('No Data', df_clean[col].mode()[0])
# Apply One-Hot Encoding to specific categorical columns
print("\nApplying One-Hot Encoding to 'Medical Conditions', 'Medications & Treatments', and 'Nutritional Deficiencies'...")
df_clean = pd.get_dummies(df_clean, columns=['Medical Conditions', 'Medications & Treatments', 'Nutritional Deficiencies '], drop_first=True)
# Encode the 'Stress' column with ordinal values
ordinal_mapping = {'Low': 1, 'Moderate': 2, 'High': 3}
print("Encoding 'Stress' as ordinal values (Low=1, Moderate=2, High=3)...")
df_clean['Stress'] = df_clean['Stress'].map(ordinal_mapping)
# Convert binary columns
binary_columns = ['Genetics', 'Hormonal Changes', 'Poor Hair Care Habits ',
'Environmental Factors', 'Smoking', 'Weight Loss ']
print("\nConverting binary columns ('Yes'/'No') to 1/0...")
for col in binary_columns:
df_clean[col] = df_clean[col].map({'Yes': 1, 'No': 0})
print("\nSample of Cleaned Data:")
print(df_clean.head())
return df_clean
EDA with Data Validation
print("\nSummary statistics of cleaned dataset:")
df_clean = clean_data(df)
print(df_clean.describe())
# Distribution of Hair Loss
plt.figure(figsize=(10, 6))
sns.countplot(data=df_clean, x='Hair Loss')
plt.title('Distribution of Hair Loss Cases')
plt.show()
# Age Distribution by Hair Loss
plt.figure(figsize=(10, 6))
sns.boxplot(data=df_clean, x='Hair Loss', y='Age')
plt.title('Age Distribution by Hair Loss Status')
plt.show()
# only the top correlations with 'Hair Loss'
correlation_matrix = df_clean.corr()
top_features = correlation_matrix['Hair Loss'].abs().sort_values(ascending=False).head(15).index
top_corr_matrix = correlation_matrix.loc[top_features, top_features]
# Plotting the heatmap
plt.figure(figsize=(14, 10))
sns.heatmap(top_corr_matrix, annot=True, cmap='coolwarm', center=0, fmt=".2f", cbar_kws={"shrink": .8})
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.title('Top 15 Feature Correlations with Hair Loss')
plt.tight_layout()
plt.show()
Summary Statistics
Overall Insights
The dataset appears well-balanced for the target variable, Hair Loss, which is beneficial for training a classification model without needing additional balancing techniques. The age range and binary features (e.g., genetics, hormonal changes) are fairly diverse, suggesting that the dataset covers a range of factors relevant to hair loss. The presence of several one-hot encoded columns from Medical Conditions, Medications & Treatments, and Nutritional Deficiencies shows that the data has been enriched with detailed categorical information, which may help the model identify more complex relationships.
These summary statistics provide a foundational understanding of the dataset and highlight areas where further analysis could reveal patterns, such as in exploring the distribution of stress, age, and medical conditions in relation to hair loss.
Distribution of Hair Loss Cases
The count plot for the distribution of hair loss cases shows the frequency of each category within the 'Hair Loss' variable. This helps us understand the prevalence of hair loss in the dataset. If the bars are uneven, it indicates an imbalance in the dataset, which might need to be addressed in further analysis or modeling.
Age Distribution by Hair Loss Status
The box plot for age distribution by hair loss status provides insights into the age range and median age for individuals with and without hair loss. The spread of the data (interquartile range) and the presence of any outliers can also be observed. This helps in understanding if age is a significant factor in hair loss.
Correlation Matrix
Key Observations:
Age and Smoking: Age and Smoking have weak correlations with Hair Loss, with correlation values close to zero. This suggests that these variables may not be significant standalone predictors of hair loss, though they could still contribute in combination with other factors. Medical Conditions and Hair Loss: Several medical conditions, including Seborrheic Dermatitis, Androgenetic Alopecia, and Thyroid Problems, show some minor correlation with Hair Loss. While individually weak, these conditions could cumulatively affect hair loss and might benefit from being part of an interaction term or used in clustering to see if they form patterns in hair loss cases. Medical Conditions_No Data has a mild correlation with age, which may indicate that missing data on medical conditions could be related to certain age groups. Genetics: The Genetics feature has a slightly positive correlation with Hair Loss, though still weak. This weak correlation suggests that while genetics may play a role, it might not be a strong independent predictor in this dataset. Nutritional Deficiencies: Magnesium Deficiency and Omega-3 Fatty Acids Deficiency show very weak correlations with Hair Loss. This weak correlation could imply that these specific deficiencies might not strongly influence hair loss but could have subtle effects in combination with other nutritional factors. Weight Loss: Weight Loss has a very weak positive correlation with Hair Loss. This could indicate that while weight loss alone might not be a major factor, it might contribute to hair loss when combined with other stressors or health conditions. Inter-feature Correlations: There are moderate correlations between certain medical conditions, such as Seborrheic Dermatitis and Thyroid Problems, or Steroids with Immunomodulators. These inter-feature correlations suggest that some features may frequently occur together, which might warrant creating interaction terms or conducting cluster analysis to explore groups of individuals with similar medical conditions.
print(df_clean.columns)
Clustering Analysis
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
def analyze_clusters(df_clean):
"""Perform clustering analysis with dynamically selected features."""
# Define basic features and dynamically select one-hot encoded features
basic_features = ['Age', 'Stress', 'Genetics', 'Hormonal Changes']
medical_conditions_features = [col for col in df_clean.columns if col.startswith('Medical Conditions')]
medications_features = [col for col in df_clean.columns if col.startswith('Medications & Treatments')]
nutritional_deficiencies_features = [col for col in df_clean.columns if col.startswith('Nutritional Deficiencies')]
# Combine all selected features for clustering
cluster_features = basic_features + medical_conditions_features + medications_features + nutritional_deficiencies_features
print(f"Clustering on features: {cluster_features}")
X_cluster = df_clean[cluster_features]
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_cluster)
# Perform PCA for dimensionality reduction
pca = PCA(n_components=3) # Adjust the number of components as needed
X_pca = pca.fit_transform(X_scaled)
print("\nExplained Variance Ratio by Principal Components:", pca.explained_variance_ratio_)
# Determine the optimal number of clusters using silhouette score
silhouette_scores = []
K = range(2, 8)
for k in K:
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X_pca)
score = silhouette_score(X_pca, kmeans.labels_)
silhouette_scores.append(score)
print(f"Silhouette score for k={k}: {score:.3f}")
# Plot silhouette scores to visualize optimal k
plt.figure(figsize=(10, 6))
plt.plot(K, silhouette_scores, 'bx-')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Silhouette Score')
plt.title('Optimal Number of Clusters')
plt.show()
# Apply KMeans with the optimal number of clusters
optimal_k = K[np.argmax(silhouette_scores)]
kmeans = KMeans(n_clusters=optimal_k, random_state=42)
df_clean['Cluster'] = kmeans.fit_predict(X_pca)
# Print cluster statistics for analysis
print(f"\nOptimal number of clusters: {optimal_k}")
print("\nCluster Statistics:")
cluster_stats = df_clean.groupby('Cluster').agg({
'Hair Loss': ['mean', 'count'],
'Age': 'mean',
'Stress': 'mean',
'Genetics': 'mean'
}).round(3)
print(cluster_stats)
return df_clean, optimal_k
# Perform clustering
df_clustered, optimal_k = analyze_clusters(df_clean)
# Visualize clusters in terms of Age and Stress
plt.figure(figsize=(12, 6))
sns.scatterplot(x=df_clustered['Age'], y=df_clustered['Stress'], hue=df_clustered['Cluster'], style=df_clustered['Hair Loss'])
plt.title('Clusters by Age and Stress Level')
plt.xlabel('Age')
plt.ylabel('Stress Level')
plt.legend(title='Cluster', loc='upper right')
plt.show()