Hair Loss Prediction
Table Of Content
-
1 Introduction
-
2 Libraries & Configuration
- 2.1 Libraries
- 2.2 Functions
- 2.3 Configuration
-
3 Data Wrangling
- 3.1 Data Validation
- 3.2 Data cleaning
-
4 EXploratory Data Analysis
- 4.1 Target Variable - Hair loss
- 4.2 Numeri and Categorical Features
-
5 Model fitting and Evaluation
- 5.1 Data Pre-processing
- 5.2 Logistic Regression Model
- 5.3 Random Forest Model
- 5.4 Decision Tree Model
- 5.5 Model Summary
-
6 Segmentation Analysis
- 6.1 Scaling
- 6.2 Principal Component Analyis
- 6.3 Clustering Using Kmeans
- 6.4 Hair Loss clusters
-
6 Summary
1. Introduction
As people age, hair loss becomes a common health concern impacting both appearance and overall well-being. Hair density is not only a cosmetic feature but also a potential indicator of health status. In my hair loss prediction project, I investigate a wide range of factors that may contribute to hair loss, including genetics, hormonal fluctuations, medical conditions, medications, nutritional deficiencies, psychological stress, and more. By performing extensive data exploration and analysis, I aim to uncover meaningful correlations between these factors and hair loss. The insights gained from this project could support individualized health management strategies, inform medical interventions, and benefit related industries focused on hair care and wellness.
Objective:
The objective of this project is to develop a predictive model for hair loss risk by analyzing various contributing factors, including genetics, hormonal changes, medical conditions, medications, nutritional status, and psychological stress. Through data-driven insights, this project aims to identify significant correlations and build a reliable tool that can support personalized health management, guide medical interventions, and provide actionable recommendations.
Methodology:
Here are the main steps of the project:
-
Exploratory Data Analysis (EDA): Performing both univariate and bivariate analysis to understand the data distribution and relationships between variables.
-
Predictive Modeling: Using logistic regression, random forest, and decision tree algorithms to build predictive models.
-
Feature Importance: Determining the importance of each feature in the models.
-
Model Evaluation: Evaluating the models using metrics such as AUC (Area Under the ROC Curve) and accuracy.
-
Cluster Analysis: Grouping similar data points using k-means clustering techniques.
Data:
A survey was done and provides the information needed in the Predict Hair Fall.csv in the data folder. Data contains information on persons in this survey. Each row represents one person.
-
Id - A unique identifier for each person.
-
Genetics - Whether the person has a family history of baldness.
-
Hormonal Changes - Indicates whether the individual has experienced hormonal changes (Yes/No).
-
Medical Conditions - Medical history that may lead to baldness; alopecia areata, thyroid problems, scalp infections, psoriasis, dermatitis, etc.
-
Medications & Treatments - History of medications that may cause hair loss; chemotherapy, heart medications, antidepressants, steroids, etc.
-
Nutritional Deficiencies - Lists nutritional deficiencies that may contribute to hair loss, such as iron deficiency, vitamin D deficiency, biotin deficiency, omega-3 fatty acid deficiency, etc.
-
Stress - Indicates the stress level of the individual (Low/Moderate/High).
-
Age - Represents the age of the individual.
-
Poor Hair Care Habits - Indicates whether the individual practices poor hair care habits (Yes/No).
-
Environmental Factors - Indicates whether the individual is exposed to environmental factors that may contribute to hair loss (Yes/No).
-
Smoking - Indicates whether the individual smokes (Yes/No).
-
Weight Loss - Indicates whether the individual has experienced significant weight loss (Yes/No).
-
Hair Loss - Binary variable indicating the presence (1) or absence (0) of baldness in the individual.
Summary :
Exploratory Data Analysis (EDA) uncovered critical insights into the relationships between various factors and hair loss. It revealed that genetic predisposition and specific medical conditions, such as alopecia, are strongly correlated with hair loss. Nutritional deficiencies and high stress levels also have a moderate association with hair loss, while factors like smoking and poor hair care habits showed weaker associations.
I developed logistic regression, random forest, and decision tree models, with the decision tree model performing the best, though its accuracy remained below 60%.
Key features in the predictive models were age, medical conditions, and nutritional deficiencies.
K-means clustering segmented individuals into three distinct groups based on age and hair loss patterns, highlighting significant demographic trends across the clusters.
2. Libraries & Configurations
2.1 Libraries
Loading the relevant libraries and user-defined functions
"""importing relevant libraries"""
import pandas as pd # for data manipulation
import numpy as np # for data computation
import matplotlib.pyplot as plt #for 2D data visualization
import seaborn as sns #for 2D data visualization
import altair as alt #altair for declarative statistical visualization
from scipy import stats # for statistics
import seaborn as sns # for visualization
from sklearn.preprocessing import StandardScaler #for standardization
from sklearn.model_selection import train_test_split #for splitting the data
from sklearn.linear_model import LogisticRegression #base linear regression model
from sklearn.ensemble import RandomForestClassifier #Esemble linear regression model
from sklearn.tree import DecisionTreeClassifier #Tree based model
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, roc_curve
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA #Dimentionality reduction
from sklearn.cluster import KMeans #Clustering
from sklearn.feature_selection import RFE #feature importance
%matplotlib inline
2.2 Functions
Defining the functions for the analysis
# utility function to print markdown string
def printmd(string):
display(Markdown(string))
def chi2_test(col):
"""
Perform Chi-Square Test of Independence for a specified column against Hair Loss.
Parameters:
col (str): The name of the column to test.
df (pandas.DataFrame): The DataFrame containing the data.
Returns:
str: The result of the Chi-Square test interpretation.
"""
# Creating the contingency table
contingency_table = pd.crosstab(df[col], df['Hair Loss'])
# Performing the Chi-Square test
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
print(f"Chi-Square Statistic: {round(chi2,4)}")
print(f"P-Value: {round(p,4)}")
# Interpretation
alpha = 0.05
if p < alpha:
result = f"There is a significant association between {col} and hair loss."
else:
result = f"There is no significant association between {col} and hair loss."
return result
2.3 Configurations
Setting the configurations to be used for our analysis.
# seed value
SEED = 42
#set seaborn theme
sns.set_theme(style="darkgrid", palette="colorblind")
#displaying all columns
pd.set_option('display.max_columns', None)
3. Data Wrangling
Loading and wrangling the data
#loading the dataframe
df= pd.read_csv(r'data/Predict Hair Fall.csv')
#viewing the dataframe
df.head()
#checking the number of rows and columns in the dataframe
df.shape
This data set has 999 rows and 13 columns consisting of both numeric and categorical features.