Skip to content
0

📖 Background

As we age, hair loss becomes one of the health concerns of many people. The fullness of hair not only affects appearance, but is also closely related to an individual's health.

A survey brings together a variety of factors that may contribute to hair loss, including genetic factors, hormonal changes, medical conditions, medications, nutritional deficiencies, psychological stress, and more. Through data exploration and analysis, the potential correlation between these factors and hair loss can be deeply explored, thereby providing useful reference for the development of individual health management, medical intervention and related industries.

💾 The data

The survey provides the information you need in the Predict Hair Fall.csv in the data folder.

Data contains information on persons in this survey. Each row represents one person.

  • "Id" - A unique identifier for each person.
  • "Genetics" - Whether the person has a family history of baldness.
  • "Hormonal Changes" - Indicates whether the individual has experienced hormonal changes (Yes/No).
  • "Medical Conditions" - Medical history that may lead to baldness; alopecia areata, thyroid problems, scalp infections, psoriasis, dermatitis, etc.
  • "Medications & Treatments" - History of medications that may cause hair loss; chemotherapy, heart medications, antidepressants, steroids, etc.
  • "Nutritional Deficiencies" - Lists nutritional deficiencies that may contribute to hair loss, such as iron deficiency, vitamin D deficiency, biotin deficiency, omega-3 fatty acid deficiency, etc.
  • "Stress" - Indicates the stress level of the individual (Low/Moderate/High).
  • "Age" - Represents the age of the individual.
  • "Poor Hair Care Habits" - Indicates whether the individual practices poor hair care habits (Yes/No).
  • "Environmental Factors" - Indicates whether the individual is exposed to environmental factors that may contribute to hair loss (Yes/No).
  • "Smoking" - Indicates whether the individual smokes (Yes/No).
  • "Weight Loss" - Indicates whether the individual has experienced significant weight loss (Yes/No).
  • "Hair Loss" - Binary variable indicating the presence (1) or absence (0) of baldness in the individual.
import pandas as pd
data = pd.read_csv('data/Predict Hair Fall.csv')
data.head(10)

My Work

Please Note: Report View Does Not Show All Printed Results

Executive Summary

Intro

Hello! The following project is about hair loss. Like many people I have first hand experience with hairloss. I experienced some light balding in my early twenties. Currently I have a prescription for and take medication to prevent further hair loss. I am also familiar with the only real way to undo hair loss, having a hair transplant. As part of that I got to talk to a doctor that specializes in hair at length.

The goal of the project was to gain insights into hair loss. The main goals were to create a model to predict if someone will experience hair loss, determine what features are most important to predicing hair loss, and finally using clustering to see if there are different distinct groups that experience hair loss.

Sadly the data provided by this survey fell short of being ideal for achieving the project's goals. A critical yet often undervalued skill for a data scientist is the ability to communicate clearly about the limitations of their data and its impact on achieving desired outcomes. Data challenges can stem from various factors, such as insufficient quantity or lacking quality. In severe cases, these issues may render the data unsuitable for meaningful analysis. However, in this instance, while the dataset's shortcomings are noteworthy, it is still possible to extract some insights and meet the goals, unfortunately to a more limited extent than hoped at the outset.

Recommendations

My main recommendations are as follows:

  1. Collect more appropriate data to do further analysis.
  2. To predict hair loss using the survey, I recommend using the model named finalModel from this report. It is a logistic model trained on a subset of the data features and includes a number of interaction variables. It uses elastic net.
  3. If possible, avoid stress. It is the only one of the key factors found in the analysis that someone can take steps that will lead to a reduced risk of hair loss.
  4. Consult someone with subject matter expertise to improve clustering.
  5. Talk to a doctor! Preventing hair loss is easier than treating it!

Going into more details for each of the recommendations,

As stated the current data is not well suited for the desired goals, I explain some of the data’s issues as they come up in this report. There is a larger issue outside of that. If we step back and look at the goal of this project it is to learn about what contributes to hair loss and predict if someone is likely to experience it. The hope would be helping people prevent hair loss.
We only have a single survey to analyze.
Surveys can collect valuable data, but it is not great when searching for causation. On the bright side the survey can inform future data collection. Ideally future data could be collected by conducting controlled experiments for some of the traits this survey showed may be important. This is expensive and may not be feasible.
Another potentially future approach would be to try and collect Panel Data, that is data following subjects across time. Panel data would just require surveying the same subjects multiple times across time. This would be more expensive and complicated than a single survey but less than conducting a large number of controlled studies.
Panel data would also allow the use of some Time Series and Survival Analysis methods that may be more powerful for this problem.
It must also be said the data does not seem to be representative of the population. This can lead to various issues. The most egregious example relates to how the presence of hair loss is proportioned across various ages. Will cover such issues as they come up in the report. The data makes me very skeptical about the survey and makes me wish we were provided more information about its methodology.

The Best model I trained for predicting hair loss was saved as finalModel. Again it is a simple logistic regression model. Details about its performance can be found both in the model results in the next section as well as right after It is trained in this notebook. In my experience data scientists always kind of hope a fancy model will be the best for a given task. Sometimes simple models produce better results.

Of the key factors Identified towards the end of this report many were out of the control of subjects. Stress thought having a low correlation with hair loss on its own was a very important factor for the decision trees. Stress’ interaction terms also improve model performance when used for training our final logistic model.

Obviously subject matter expertise is critical to consider when working with any machine learning application. This is also very true with clustering. Clustering is in most cases a type of unsupervised learning. This is the case for us currently as we do not know if there are distinct groups of people that experience hair loss. We definitely do not have labeled data for different hair loss groups. I was able to create some clusters. Without subject matter expertise we cannot really know what metric is best to evaluate the model as different metrics make different assumptions as to what makes good clusters.

On a related note to subject matter expertise, if we step outside the data and competition framework for a moment, it's worth noting that anyone concerned about hair loss should consult a specialist. A doctor focusing on hair can provide guidance and help prevent hair loss. Prevention is easier than treatment, regaining lost hair is often difficult and costly. While our work aims to identify contributing factors and prevention strategies, personal medical advice remains invaluable.

Model Results & Important Findings

I trained 20 models for the purpose of predicting if a person is going to experience hair loss. Twelve Logistic Regression Models, three solo Decision Trees, Three Random Forests, and two Support Vector Machines

Summarizing the results from all 20 models would be excesive but we can compare the best of each model type. As you read though my report you will also see I have included a visualization of the confusion matrix for each of the models included in the following table.

Model TypeData SubsetAccuracyPrecisionRecallF1
Logistic RegressionCorrelation > 0.02 w/ Added Interaction Terms0.5650.560.580.57
Decision TreeWithout Age Feature0.530.520.550.53
Random ForestWithout Age Feature0.520.510.460.49
Support Vector MachineWithout Age Feature0.550.560.440.49

I found the Logistic Regression models were the most promising so I will go into a little more in detail about them next. For the Logistic Regression models I trained three models per data subset, using L1,L2, and Elasticnet penalties. I have included the following table to summarize my findings from the logistic models. Again to save time and space I will only be including information on the best model for each data subset in the table.

Data SubsetPenalty UsedAccuracyPrecisionRecallF1
All FeaturesElasticnet0.560.550.580.56
Without Age FeatureL10.5250.520.520.52
Correlation > 0.02L20.5550.550.580.56
Correlation > 0.02 w/ Added Interaction TermsL20.5650.560.580.57