Competition: Multifactorial analysis of hair loss trends

Decoding hair loss: a multifactorial analysis

Executive summary

Hair loss is influenced by a combination of factors, including age, medical conditions, lifestyle habits, and stress levels. This analysis reveals no single dominant cause but highlights differences across various factors. For instance, the highest proportion of hair loss (57%) is observed in individuals aged 27–29, with medical conditions such as alopecia areata and seborrheic dermatitis being the most impactful contributors. While stress levels showed only slight differences, moderate stress was associated with a marginally higher likelihood of hair loss compared to low or high stress. These findings point to actionable strategies targeting health management, education, and innovations in hair care.

Key findings and insights

Hair loss by age group

The analysis reveals variability in hair loss likelihood across different age groups:

Age group 27–29: This group shows the highest proportion of hair loss (57%), potentially linked to hormonal or lifestyle changes during early adulthood.
Age group 48–50: Hair loss is least common here (42.5%), suggesting that health factors may stabilize with age.
Overall trend: Most age groups hover around the 50% threshold, indicating other critical periods of increased hair loss likelihood.

These patterns indicate that hair loss is not directly tied to age but is shaped by a combination of physiological changes and environmental influences at various life stages.

Factors associated with hair loss

A deep dive into the factors associated with hair loss reveals the following:

Top 5 contributors:

Alopecia areata (57%) – A medical condition leading the list, reflecting its association with hair loss.
Seborrheic dermatitis (55.8%) – Another scalp-related medical condition contributing relatively significantly.
Androgenetic alopecia (55.7%) – A hereditary medical condition, indicating the role of genetics in hair loss.
Steroids (55.1%) – Among medications and treatments, this shows the most notable link to hair loss.
Magnesium deficiency (54.8%) – A nutritional deficiency, emphasizing the impact of diet on hair health.

General observations:

Factors slightly above or below the 50% threshold suggest a broad and diffuse set of influences, rather than any single overwhelming cause.
Medical conditions dominate the top three spots, highlighting their more significant role in hair loss.

Hair loss and stress levels

Contrary to expectations, stress levels show minimal differentiation in hair loss likelihood:

Moderate stress shows the highest (51.4%), though still not significantly distinct.
Low stress (49.1%) and high stress (48.6%) exhibit similar impacts, with high stress unexpectedly showing the lowest likelihood.

These findings challenge the intuitive notion that stress directly drives hair loss, suggesting a complex interplay where stress interacts with other factors like nutrition or medical conditions.

Recommendations

Address medical conditions

Prioritize early diagnosis and treatment of higher-impact conditions such as alopecia areata and seborrheic dermatitis.
Encourage healthcare providers to integrate stress assessment and management into routine consultations for holistic hair loss prevention.

Tailor campaigns to key age groups

Design campaigns addressing the peak hair loss rate in younger adults (27–29) age group.
Promote early preventive strategies, such as routine health checkups and lifestyle modifications.

Enhance nutrition and medication awareness for hair health

Highlight the importance of magnesium in maintaining hair health through educational campaigns.
Educate individuals using steroid-based medications on the potential side effects, including hair loss.

Conclusion

This analysis underscores the multifaceted nature of hair loss, showing that no single factor overwhelmingly dictates its occurrence. Instead, hair loss stems from the interplay of medical conditions, nutritional deficiencies, age-related changes, and lifestyle influences. Although most factors hover around a 50% likelihood threshold, key insights highlight actionable opportunities for targeted intervention and innovation:

Medical conditions, such as alopecia areata and seborrheic dermatitis, emerge as the most significant contributors, underscoring the need for early diagnosis and specialized treatment.
Younger adults (27–29) are disproportionately affected, with this age group showing the strongest associations with hair loss.
Moderate stress, while not a primary cause, interacts with other factors to subtly influence outcomes.
Nutritional deficiencies, particularly magnesium, and certain medications like steroids highlight actionable opportunities for dietary improvements, education, and supplement-based interventions.

These findings emphasize the importance of a collaborative approach, involving healthcare providers, individuals, and industry leaders, to address hair loss holistically. By focusing on early medical intervention, tailored treatments, and comprehensive health strategies, meaningful improvements can be achieved for those affected by hair loss.

Technical overview of the analysis

Data cleaning and preparation

The dataset was meticulously cleaned to ensure accuracy and reliability. Duplicate IDs were removed to eliminate redundancy, and text fields were standardized to correct formatting inconsistencies, such as trimming whitespace and harmonizing capitalization. Binary variables, like “Hair Loss,” were converted into clear “Yes/No” categories for better interpretability, and missing or inconsistent data points were addressed. These steps ensured that the dataset was reliable and ready for accurate analysis, laying a solid foundation for the insights derived.

# Import libraries and load dataset
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns

data = pd.read_csv('data/Predict Hair Fall.csv')

# Remove leading and trailing whitespace from column names and all string columns values
data.columns = data.columns.str.strip()
data = data.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

# Identify the Ids that are duplicated and filter out rows with any of the duplicate Ids
duplicate_ids = data['Id'][data.duplicated(subset='Id', keep=False)]
data = data[~data['Id'].isin(duplicate_ids)].reset_index(drop=True)

# Apply sentence case, ensuring 'A', 'D', and 'E' remain in uppercase
def sentence_case_with_exceptions(value):
    if isinstance(value, str):
        value = value.capitalize()
        words = value.split()
        corrected_words = ["A" if word == "a" else
                           "D" if word == "d" else
                           "E" if word == "e" else 
                           word for word in words]
        return " ".join(corrected_words)
    return value  # Return non-string values as-is
data = data.applymap(sentence_case_with_exceptions)

# Add "deficiency" to values that contain "Omega-3 fatty acids" if not already present
data['Nutritional Deficiencies'] = data['Nutritional Deficiencies'].apply(lambda x: x + ' deficiency' if 'Omega-3 fatty acids' in x and 'deficiency' not in x else x)

# Replace 0 with 'No' and 1 with 'Yes' in the 'Hair Loss' column
data['Hair Loss'] = data['Hair Loss'].replace({0: 'No', 1: 'Yes'})

# Display the cleaned dataset
data.head(10)

What is the proportion of patients with hair loss in different age groups?

To explore trends in hair loss across different age ranges, the data was grouped into 3-year bins, creating manageable and comparable age segments. For each group, the percentage of individuals experiencing hair loss was calculated, enabling a clear view of how hair loss likelihood varied across the population. A stacked bar chart was chosen to visualize these results, offering a clear representation of hair loss distribution within each interval. Annotations were added to display percentages directly on the chart, improving readability and facilitating easy interpretation of the trends.

# Calculate the minimum and maximum age in the dataset
min_age = data['Age'].min()
max_age = data['Age'].max()

# Define 3-year age bins
age_bins = list(range(18, 52, 3))
age_labels = [f"{age_bins[i]}-{age_bins[i+1]-1}" for i in range(len(age_bins)-1)]

# Create a new column for age groups with 3-year intervals
data['Age Group'] = pd.cut(data['Age'], bins=age_bins, labels=age_labels, right=False, include_lowest=True)

# Calculate the count of people with and without hair loss and the total count in each age group
hair_loss_counts = data[data['Hair Loss'] == 'Yes'].groupby('Age Group').size()
no_hair_loss_counts = data[data['Hair Loss'] == 'No'].groupby('Age Group').size()
total_counts = hair_loss_counts + no_hair_loss_counts

# Calculate proportions for each segment
hair_loss_percent = (hair_loss_counts / total_counts * 100)
no_hair_loss_percent = (no_hair_loss_counts / total_counts * 100)

# Visualization
plt.figure(figsize=(12, 6))
bars_hair_loss = plt.bar(hair_loss_counts.index, hair_loss_counts, bottom=no_hair_loss_counts, label='Hair loss', color='darkblue')
bars_no_hair_loss = plt.bar(hair_loss_counts.index, no_hair_loss_counts, label='No hair loss', color='lightblue')

# Add percentage annotations on the bars
for i, (no_hair_loss, hair_loss) in enumerate(zip(no_hair_loss_percent, hair_loss_percent)):
    plt.text(i, no_hair_loss_counts[i] / 2, f"{no_hair_loss:.1f}%", ha='center', va='center', color='black')
    plt.text(i, no_hair_loss_counts[i] + hair_loss_counts[i] / 2, f"{hair_loss:.1f}%", ha='center', va='center', color='white')

# Final plot adjustments
plt.title('Proportion of people with hair loss in different age groups', fontsize=15, fontweight='bold', pad=15)
plt.xlabel('Age group', fontsize=12, fontweight='bold')
plt.ylabel('Count of people', fontsize=12, fontweight='bold')
plt.xticks(rotation=45)
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

What factors are associated with hair loss?

The analysis of contributing factors required a dual approach to handle both binary and multi-level variables. Binary variables, such as “Smoking” or “Genetics,” were analyzed by calculating the proportion of individuals with hair loss who also exhibited these characteristics. For multi-level factors, such as medical conditions or nutritional deficiencies, the analysis broke down each category to determine its specific association with hair loss likelihood. This allowed for a detailed comparison across all contributing factors. The results were then consolidated into a single ranking, providing a clear view of which factors had the strongest association. To present this effectively, a horizontal bar chart was created, sorting factors by their impact and including annotations to emphasize key findings.

# Function to calculate binary factors
def calculate_yes_percentages(data, binary_factors, target):
    binary_results = {
        factor: data[data[factor] == 'Yes'][target].value_counts(normalize=True) * 100
        for factor in binary_factors
    }
    return pd.DataFrame(binary_results).T

# Function to calculate multi-level factors
def calculate_category_percentages(data, multi_factors, target):
    multi_results = {}
    for factor in multi_factors:
        # Filter out "No data" rows for the current factor
        valid_data = data[data[factor] != "No data"]
        # Calculate percentages for each valid category
        factor_results = {
            category: valid_data[valid_data[factor] == category][target].value_counts(normalize=True) * 100
            for category in valid_data[factor].unique()
        }
        multi_results[factor] = pd.DataFrame(factor_results).T
    return multi_results

# Function to consolidate binary and multi-category percentages
def consolidate_results(binary_yes_percentages, multi_category_percentages):
    # Process binary_yes_percentages
    binary_data = binary_yes_percentages[['Yes']].copy()
    binary_data['Group of factors'] = 'Other'
    binary_data = binary_data.rename(columns={'Yes': 'Impact on hair lose (%)'}).reset_index(names='Factor')
    # Process multi_category_percentages
    multi_data_list = []
    for factor, df in multi_category_percentages.items():
        filtered_df = df[['Yes']].copy()
        filtered_df['Group of factors'] = factor
        filtered_df = filtered_df.rename(columns={'Yes': 'Impact on hair lose (%)'}).reset_index(names='Factor')
        multi_data_list.append(filtered_df)  
    # Combine binary and multi-category data
    consolidated_df = pd.concat([binary_data] + multi_data_list, ignore_index=True)
    return consolidated_df[['Group of factors', 'Factor', 'Impact on hair lose (%)']]

# Define the factor groups
binary_factors = ['Genetics', 'Hormonal Changes', 'Poor Hair Care Habits', 'Environmental Factors', 'Smoking', 'Weight Loss']
multi_factors = ['Medical Conditions', 'Medications & Treatments', 'Nutritional Deficiencies']

# Perform calculations
binary_yes_percentages = calculate_yes_percentages(data, binary_factors, 'Hair Loss')
multi_category_percentages = calculate_category_percentages(data, multi_factors, 'Hair Loss')
consolidated_dataframe = consolidate_results(binary_yes_percentages, multi_category_percentages)

# Visualization
sorted_data = consolidated_dataframe.sort_values(by="Impact on hair lose (%)", ascending=False)
fig, ax = plt.subplots(figsize=(10, 10))

sns.barplot(
    data=sorted_data,
    x="Impact on hair lose (%)",
    y="Factor",
    hue="Group of factors",
    dodge=False,
    ax=ax
)

# Add descriptive impact line and legend
ax.axvline(x=50, color='red', linestyle='--', linewidth=1.5, label="50% Impact Threshold")
legend = plt.legend(title="Group of Factors", bbox_to_anchor=(1.05, 1), loc='upper left', borderaxespad=0)
legend.get_title().set_fontweight('bold')

# Customize x-ticks and y-ticks labels
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f"{x:.0f}%")) # Format x-ticks as percentages
yticks = [f"{factor}, {impact:.1f}%" for factor, impact in zip(sorted_data["Factor"], sorted_data["Impact on hair lose (%)"])]
ax.set_yticklabels(yticks)
for tick_label in ax.get_yticklabels()[:5]:
    tick_label.set_fontweight("bold") # Make top 5 y-tick labels bold

# Set title and labels with formatting
ax.set_title("Impact of various factors on hair loss probability", fontsize=15, loc='left', fontweight='bold', pad=15)
ax.set_xlabel("Likelihood of hair loss", fontsize=12, fontweight='bold')
ax.set_ylabel("Factor", fontsize=12, fontweight='bold')

plt.tight_layout()
plt.show()

What does hair loss look like under different stress levels?

Stress levels, categorized as Low, Moderate, and High, were analyzed to determine their relationship with hair loss. The percentage of individuals experiencing hair loss was calculated within each stress category, offering a comparative view of their impact. The results were visualized in a bar chart with a threshold line added at the 50% mark to provide context and highlight variations. This approach clarified whether stress levels had a measurable effect on hair loss, while the clear visualization ensured the results were easy to interpret.

# Calculate percentage of 'Yes' Hair Loss for each Stress level
hair_loss_percentages = (
    data[data["Hair Loss"] == "Yes"]
    .groupby("Stress")["Hair Loss"]
    .count()
    .div(data.groupby("Stress")["Hair Loss"].count())  # Divide by total count per stress level
    .mul(100)
    .reset_index()
)

# Rename columns for clarity and define stress order
hair_loss_percentages.columns = ["Stress", "Percentage"]
hair_loss_percentages["Stress"] = pd.Categorical(hair_loss_percentages["Stress"], categories=["Low", "Moderate", "High"], ordered=True)
hair_loss_percentages = hair_loss_percentages.sort_values("Stress")

# Visualization
plt.figure(figsize=(6, 6))
ax = sns.barplot(
    data=hair_loss_percentages, 
    x="Stress", 
    y="Percentage", 
    palette="coolwarm"
)

# Add percentage labels to each bar
for bar, percentage in zip(ax.patches, hair_loss_percentages["Percentage"]):
    height = bar.get_height()
    ax.text(
        bar.get_x() + bar.get_width() / 2, 
        height - 5,
        f"{percentage:.1f}%", 
        ha="center", 
        va="center", 
        color="black", 
        fontsize=10
    )

# Add reference line, titles and labels
plt.axhline(y=50, color="red", linestyle="--", linewidth=1.5, label="50% Impact Threshold")
plt.title("Hair loss by stress level", fontsize=15, fontweight='bold', pad=15)
plt.xlabel("Stress level", fontsize=12, fontweight='bold')
plt.ylabel("Hair loss likelihood", fontsize=12, fontweight='bold')

# Format y-ticks as percentages
plt.gca().yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f"{x:.0f}%"))

# Add legend
plt.legend()
plt.ylim(0, 100)
plt.tight_layout()
plt.show()