The Bald Truth: A Data-Driven Guide and Classifier to Hair Loss
βοΈ Executive summary
This study analyzed survey data from 999 individuals to investigate the factors influencing hair loss. Variables such as genetics, hormonal changes, medical conditions, lifestyle choices, and environmental factors were examined. The analysis was conducted across three levels: descriptive statistics, visualization, and machine learning.
Key findings revealed that no single factor strongly predicts hair loss, highlighting the need for multi-factorial approaches. Age, genetics, and lifestyle factors (e.g., smoking, poor hair care habits) emerged as the most significant predictors, while medical history and nutritional deficiencies showed limited impact. Among the machine learning models tested, XGBoost achieved the best performance, though classification accuracy ranged between 45% and 65% due to overlapping predictor characteristics.
To improve accuracy, future work should explore advanced modeling techniques like Bayesian hyperparameter optimization and adopt robust validation methods such as nested cross-validation. Enhancing dataset quality by balancing age group representation, incorporating gender as a predictor, and maybe expanding feature sets will also be critical.
This project underscores that while genetics play a role in hair loss, lifestyle improvements and better hair care routines can significantly reduce risk. This conclusion provides a foundation for refining predictive models and advancing our understanding of hair loss prevention.
π Keywords:
- Data analysis and visualization
- Cramer's V (Measure of association between categorical variables)
- Feature selection and engineering (PCA and Random Forest variable importance)
- Classification model training and testing
- Naive Bayes
- Logistic Regression
- SVM (Radial Basis Function)
- Random Forest
- XGBoost
- Neural Networks (mlp)
 
π Background
As we age, hair loss becomes one of the health concerns of many people. The fullness of hair not only affects appearance, but is also closely related to an individual's health.
A survey brings together a variety of factors that may contribute to hair loss, including genetic factors, hormonal changes, medical conditions, medications, nutritional deficiencies, psychological stress, and more. Through data exploration and analysis, the potential correlation between these factors and hair loss can be deeply explored, thereby providing useful reference for the development of individual health management, medical intervention and related industries.
ποΈ Packages' library
One of the strengths of R is easiness in working with coding packages developed by other coders. In this work several packages were applied. To name a few: tidyverse (data manipulation, plotting, etc), caret (general machine learning modelling), fastDummies (one-hot encoding preprocessing). All the packages used in this study and in the development , can be seen this work's code.
# Description:
# Add below your list of packages required for this project.
# List of required libraries
required_libraries <- c(
	"tidyverse",
	"readr",
	"gridExtra",
	"fastDummies",
	"FactoMineR",
	"factoextra",
	"caret",
	"naivebayes",
	"xgboost",
	"rcompanion",
	"data.table",
	"kernlab",
	"scorecard",
	"pROC",
	"MLeval",
	"MLmetrics",
	"RSNNS"
) 
# Install missing libraries automatically
install_if_missing <- function(packages) {
missing_packages <- packages[!(packages %in% installed.packages()[, "Package"])]
if (length(missing_packages)) {
  message("Installing missing packages: ", paste(missing_packages, collapse = ", "))
  install.packages(missing_packages)
 }
}
# Call the function to install missing packages
install_if_missing(required_libraries)
# Load the libraries
invisible(lapply(required_libraries, library, character.only = TRUE))
# Error handling for package loading
loaded_libraries <- sapply(required_libraries, require, character.only = TRUE)
if (any(!loaded_libraries)) {
  stop("Error: Some required libraries failed to load: ", 
    paste(names(loaded_libraries[!loaded_libraries]), collapse = ", "))
} else {
  message("All libraries loaded successfully.")
}
# Ensure that the system language is set to English.
# Set all locale settings to English.
Sys.setenv(LANG = "en")
Sys.setlocale("LC_ALL", "en_US.UTF-8")
πΎ The data
The survey provides the information you need in the Predict Hair Fall.csv in the data folder.
Data contains information on persons in this survey. Each row represents one person.
- Id - A unique identifier for each person.
- Genetics - Whether the person has a family history of baldness (Yes/No).
- Hormonal Changes - Indicates whether the individual has experienced hormonal changes (Yes/No).
- Medical Conditions - Medical history that may lead to baldness; alopecia areata, thyroid problems, scalp infections, psoriasis, dermatitis, etc.
- Medications & Treatments - History of medications that may cause hair loss; chemotherapy, heart medications, antidepressants, steroids, etc.
- Nutritional Deficiencies - Lists nutritional deficiencies that may contribute to hair loss, such as iron deficiency, vitamin D deficiency, biotin deficiency, omega-3 fatty acid deficiency, etc.
- Stress - Indicates the stress level of the individual (Low/Moderate/High).
- Age - Represents the age of the individual.
- Poor Hair Care Habits - Indicates whether the individual practices poor hair care habits (Yes/No).
- Environmental Factors - Indicates whether the individual is exposed to environmental factors that may contribute to hair loss (Yes/No).
- Smoking - Indicates whether the individual smokes (Yes/No).
- Weight Loss - Indicates whether the individual has experienced significant weight loss (Yes/No).
- Hair Loss - Binary variable indicating the presence (1) or absence (0) of baldness in the individual.
The dataset has 999 individuals. These individuals provided 12 types of information (e.g., medical conditions, hair loss condition, stress levels, etc.).
Load dataset
In the table below, we can see 10 rows of the provided dataset as decribed above.
# Code main settings 
# Set custom colors to be used on the plots regarding
custom_colors <- c("Yes" = "#03ef62",   # Yes (1)
                   "No" = "#05192d")    # No (0)# Load csv file 
data <- read_csv('/work/files/workspace/data/Predict Hair Fall.csv', show_col_types = FALSE)
head(data, 10)
# Prepare dataset for analysis
# Remove spaces and transform certain variables into factor
df_data <- data %>% 
  transmute(id = factor(Id),
         genetics = factor(Genetics, levels = c("No", "Yes")),
         hormonal_changes = factor(`Hormonal Changes`, levels = c("No", "Yes")),
         medical_conditions = as.factor(`Medical Conditions`),
         medications_and_treatments = as.factor(`Medications & Treatments`),
         nutritional_deficiencies = as.factor(`Nutritional Deficiencies`),
         stress = factor(Stress, levels = c("Low", "Moderate", "High")),
         age = as.numeric(Age),
         poor_hair_care_habits = factor(`Poor Hair Care Habits`, levels = c("No", "Yes")),
         environmental_factors = factor(`Environmental Factors`, levels = c("No", "Yes")),
         smoking = factor(Smoking, levels = c("No", "Yes")),
         weight_loss = factor(`Weight Loss`, levels = c("No", "Yes")),
         hair_loss = factor(`Hair Loss`, levels = c(0, 1), labels = c("No", "Yes")))π Level 1: Descriptive statistics
This section approaches the problem by visualizing the given initial dataset.
Three questions are proposed by the competition description. However, other observations are analyzed.
- What is the average age? What is the age distribution?
- Which medical conditions are the most common? How often do they occur?
- What types of nutritional deficiencies are there and how often do they occur?
This dataset has several people where it is not referenced individuals' gender, however their age is one of the variables obtained. For a proper statistical study, the sample should have similar proportions of individuals with similar ages. The average age of the population in the study is:
Age of the population
paste("The average age:", round(mean(df_data$age), digits = 2), "years")However, the distribution of ages is given by the plot:
ggplot(df_data, aes(x = age)) + 
  geom_histogram(binwidth = 1, fill="#05192d", color = "#e9ecef", alpha = 0.9) +
  geom_vline(xintercept = mean(df_data$age), color = "#03ef62", linewidth = 1.2) +
  annotate("text", x = mean(df_data$age) + 1, y = max(table(df_data$age)) + 1, 
           label = paste("Mean:", round(mean(df_data$age), 2), "years"),
		   color = "#029c45",
		   size = 4.5,
		   hjust = 0) +
  theme_bw() +
  labs(title = "Fig.1: Age distribution",
	   x = "Age",
	   y = "Number of people")From the Figure 1, it is observed that the ages of the people in the study range from 18 to 50 years old. The distribution of the ages seem quite uniform with a slight lower number of individuals with lower ages (<25). However, the average number of people per age is given by:
β
β