Skip to content
0

Hair Loss Patterns and Predictions: Age as the Leading Factor

Executive Summary

Business Description

Hair loss is a common concern that increases with age, impacting not only appearance but also overall health. Understanding the factors contributing to hair loss can guide health management strategies, medical interventions, and industry developments.

This analysis aims to explore potential correlations between hair loss and factors such as genetics, hormonal changes, medical conditions, medications, nutritional deficiencies, stress, and lifestyle habits. Insights gained will provide a reference for personalized health solutions and related applications.

Main Questions & Findings

Level 1: Descriptive statistics

  • What is the average age? What is the age distribution?
    The age of the respondents in the survey is approximately normally distributed, with an average age of 34 years. There are at least 21 respondents and an average of 30 respondents by age.

  • What types of nutritional deficiencies are there and how often do they occur?
    The most common medical condition reported is No Data, accounting for 11% of responses, followed by Alopecia Areata (10.71% of responses) and Psoriasis (10.01%), while the remaining categories have less than 10% of responses.

  • Which medical conditions are the most common? How often do they occur?
    The top deficiency identified is Zinc Deficiency, which accounts for 10.81% of responses, followed by Vitamin D (10.41%), while the remaining categories have less than 10% of responses.

Level 2: Visualization

  • What is the proportion of respondents with hair loss in different age groups?
    The average hair loss rate across individual ages is 49.69% (SD = 9.55%), indicating that, on average, about half of respondents experience hair loss across all ages. The hair loss rate of survey respondents across quartile-based age groupings are: 18-26 (49.60%), 27-34 (54.72%, highest), 35-42 (51.41%) and 43-50 (43.03%, lowest). Notably, hair loss rate peaks occur at ages 21 (66.67%), 28 (64.71%), and 29 (64.52%).

  • What does hair loss look like under different stress levels?
    Moderate stress exhibits a slight increase in hair loss rates, and distribution patterns across ages suggest that while stress may influence hair loss, its impact is subtle.

  • What factors are associated with hair loss?
    Age, and certain health factors, such as specific conditions and deficiencies, show potential associations with slightly higher hair loss rates. However, no significant statistical results were found, only weak correlations, suggesting these factors alone may not determine hair loss.

Level 3: Machine learning modeling

  • Feature Importance Analysis: to identify the key factors that best predict hair loss.
    The top predictors of hair loss, as identified by the Random Forest model, are Age (17.93% explained variance), followed by Hormonal Changes (4.11%), Environmental Factors (4.08%), Weight Loss (4.02%), and Genetics (3.89%). Remaining features contributed less than 3.26%.

  • Cluster Analysis: to explore whether there are different types of hair loss groups in the data set.
    Clusters revealed weak and subtle patterns, with no strong statistical significance in explaining hair loss trends.

    • Cluster 1 (Younger Respondents): Higher hair loss rates linked to environmental and weight loss factors; diverse patterns with greater variability in nutritional deficiencies and hair care habits.

    • Cluster 2 (Older Respondents): More uniform patterns with higher rates linked to genetics, hormonal changes, and poor hair care habits. Environmental factors had less impact compared to Cluster 1.

    • Between Clusters: Environmental factors showed higher impact on hair loss in Cluster 1 than Cluster 2, indicating age-related differences.

  • Binary Classification Model: to predict whether an individual will suffer from hair loss based on given factors.
    XGB Classifier marginally outperformed Logistic Regression in terms of macro and weighted averages, with scores of 50.6% versus 48.7%, respectively.

Recommendations

  • Expand Dataset: Collect more diverse and detailed respondent data.
  • Investigate Confounding Factors: Identify and address variables influencing results.
  • Cluster Analysis: For exploratory purposes or further detailed analysis, a three-cluster solution may also be considered, as it provides good data representation without excessively increasing complexity.
  • Enhance Models: Explore advanced classification methods and feature engineering to improve predictive performance.
#Main Install and imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import statsmodels.api as sm
from scipy.stats import skew, pearsonr, pointbiserialr, chi2_contingency, fisher_exact
from scipy.stats import shapiro, levene, mannwhitneyu, kruskal, ttest_ind, f_oneway
from statsmodels.stats.multitest import multipletests
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from itertools import combinations
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline 
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report
from scipy.stats import pointbiserialr
from scipy.stats import skew
from scipy.stats import spearmanr
import statsmodels.api as sm
from xgboost import XGBClassifier

SEED = 42
n_top = 10

# Threshold line parameters
threshold_line_kwargs1 = {
    'linewidth': 1,
    'linestyle': '--',
    'color': 'red'
}
threshold_line_kwargs2 = {
    'linewidth': 1,
    'linestyle': '--',
    'color': 'black'
}

# Generate a colorblind-friendly palette with 20 colors
tol_palette = [
    "#332288", "#88CCEE", "#44AA99", "#117733", "#999933", "#DDCC77", 
    "#CC6677", "#882255", "#AA4499", "#661100", "#6699CC", "#AA4466", 
    "#4477AA", "#DDDDDD", "#000000", "#F0E442", "#D55E00", "#009E73", 
    "#E69F00", "#56B4E9", "#0072B2"
]
palette = tol_palette
sns.set_palette(palette)
sns.set_style("darkgrid")
#sns.palplot(palette)
custom_palette = {0: palette[2], 1: palette[6]}

def calc_upper_lower_whiskers(df, column):
    # Calculate the quartiles
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

    # Determine the theoretical whisker limits
    lower_whisker_limit = Q1 - 1.5 * IQR
    upper_whisker_limit = Q3 + 1.5 * IQR

    # Find the actual data points within the whisker limits
    lower_whisker = df[column][df[column] >= lower_whisker_limit].min()
    upper_whisker = df[column][df[column] <= upper_whisker_limit].max()

    #print(f"\nLower whisker of {column} {lower_whisker}.")
    #print(f"Upper whisker of {column} {upper_whisker}.")    
    return (lower_whisker, upper_whisker)

def phi_coefficient(x, y):
    # Calculate the Phi coefficient (for binary vs binary correlation)
    confusion_matrix = pd.crosstab(x, y)
    if confusion_matrix.shape == (2, 2):  # Ensure it's a 2x2 contingency table
        return (confusion_matrix.iloc[0, 0] * confusion_matrix.iloc[1, 1] - 
                confusion_matrix.iloc[0, 1] * confusion_matrix.iloc[1, 0]) / np.sqrt(
            (confusion_matrix.iloc[0, 0] + confusion_matrix.iloc[0, 1]) * 
            (confusion_matrix.iloc[0, 0] + confusion_matrix.iloc[1, 0]) * 
            (confusion_matrix.iloc[1, 1] + confusion_matrix.iloc[0, 1]) * 
            (confusion_matrix.iloc[1, 1] + confusion_matrix.iloc[1, 0])
        )
    return None  # In case the matrix is not 2x2 (which shouldn't happen for binary variables)

def get_correlation_stats(data, col1, col2, significance_level=0.05):
    col1_dtype = data[col1].dtype
    col2_dtype = data[col2].dtype

    # Check if either column is binary
    def is_binary_col(col):
        return pd.api.types.is_numeric_dtype(col) and col.nunique() == 2

    # Handle binary vs binary columns with Phi coefficient
    if is_binary_col(data[col1]) and is_binary_col(data[col2]):
        stat = phi_coefficient(data[col1], data[col2])
        p_value = None 
        test_name = "Phi Coefficient"
        
    # Handle numeric vs. binary categorical columns with Point-Biserial correlation
    elif is_binary_col(data[col2]) and pd.api.types.is_numeric_dtype(col1_dtype):
        stat, p_value = pointbiserialr(data[col2], data[col1].astype(int))
        test_name = "Point-Biserial Correlation"
    
    # Handle binary categorical vs. numeric columns with Point-Biserial correlation
    elif is_binary_col(data[col1]) and pd.api.types.is_numeric_dtype(col2_dtype):
        stat, p_value = pointbiserialr(data[col1], data[col2].astype(int))
        test_name = "Point-Biserial Correlation"

    # Handle categorical vs. categorical columns with Chi-Square test
    elif pd.api.types.is_categorical_dtype(data[col1]) or pd.api.types.is_categorical_dtype(data[col2]):
        contingency_table = pd.crosstab(data[col1], data[col2])
        stat, p_value, _, _ = chi2_contingency(contingency_table)
        test_name = "Chi-Square Test"
        
    # Handle numeric vs. numeric columns with Pearson correlation
    elif pd.api.types.is_numeric_dtype(col1_dtype) and pd.api.types.is_numeric_dtype(col2_dtype) and not is_binary_col(data[col1]) and not is_binary_col(data[col2]):
        stat, p_value = pearsonr(data[col1], data[col2])
        test_name = "Pearson Correlation"
        
    else:
        # skip correlation test, a pairwise test is more suitable
        #raise ValueError("Column types not compatible for correlation analysis.")
        stat, p_value, test_name = (None, None, None)
        
    if (p_value == None):
        is_significant = True            
    elif (p_value < significance_level):
        is_significant = True
    else:
        is_significant = False
    
    return stat, p_value, test_name, is_significant

def calculate_hair_loss_proportion(data):
    return data.groupby(['Age', 'Age_Group'])['Hair_Loss'].mean().mul(100).round(2).reset_index(name='hair_loss_proportion')

def get_hair_loss_corr(data, column1, column2):
    stat, p_value, test_name, is_significant = get_correlation_stats(data, column1, column2, significance_level=0.05)
    results = []
    results.append({'Feature': column1, 'Target': column2, 'Type': test_name, 'is_siginificant':is_significant, 'Test Value': stat, 'P-Value': p_value})
    results_df = pd.DataFrame(results)
    return results_df
    
def get_chi_square_test(data, col1, col2):
    """
    Calculates the Chi-Square test statistic between two categorical variables.

    Parameters:
    - data: pd.DataFrame, the dataframe containing the data
    - col1: str, the name of the first categorical column
    - col2: str, the name of the second categorical column

    Returns:
    - chi2: float, the Chi-Square test statistic
    - p_value: float, the p-value indicating statistical significance
    - test_name: str, name of the test performed
    """
    # Create a contingency table
    contingency_table = pd.crosstab(data[col1], data[col2])
    
    # Perform the Chi-Square test
    chi2, p_value, _, _ = chi2_contingency(contingency_table)
    
    # Define the test name
    test_name = "Chi-Square Test"
    
    return chi2, p_value, test_name

# Function to calculate hair loss proportion by group
def calculate_hair_loss_proportion(data, groupby_column, across_col, fixed_col="Age"):
    return data.groupby([fixed_col, groupby_column])[across_col].mean().mul(100).round(2).reset_index(name='hair_loss_rate').dropna()    

# Function to test normality for each group
def test_normality(data, group_col, min_group_size=3):
    results = {}
    for group in data[group_col].unique():
        group_data = data[data[group_col] == group]['hair_loss_rate']
        
        # Check if the group has at least the minimum required size
        if len(group_data) < min_group_size:
            results[group] = 10 #"Insufficient data (< 3 observations)"
        else:
            # Perform Shapiro-Wilk test
            p_value = shapiro(group_data).pvalue
            results[group] = p_value
    
    return results

# Function to test homogeneity of variance
def test_variance_homogeneity(data, group_col, min_group_size=2, significance_level=0.05):
    # Extract groups
    groups = [data[data[group_col] == group]['hair_loss_rate'] for group in data[group_col].unique()]
    
    # Check if all groups have at least the minimum size
    small_groups = [group for group in groups if len(group) < min_group_size]
    
    if small_groups:
        print(f"Warning: {len(small_groups)} group(s) have fewer than {min_group_size} observations. Assuming heterogeneity.")
        return False  
    
    # Perform Levene's test
    p_value = levene(*groups).pvalue
    #print(f"Levene's test p-value: {p_value}")
    return p_value > significance_level  


# Function to perform pairwise comparison based on assumptions
def perform_pairwise_test(data, group_col, group1, group2, normality_p_values, levene_p_value, significance_level=0.05):
    grp1_data = data[data[group_col] == group1]['hair_loss_rate']
    grp2_data = data[data[group_col] == group2]['hair_loss_rate']
    
    if all(p > 0.05 for p in normality_p_values.values()) and levene_p_value > 0.05:
        # Parametric test
        stat, p_value = ttest_ind(grp1_data, grp2_data)
        test_name = "t-test"
    else:
        # Non-parametric test
        stat, p_value = mannwhitneyu(grp1_data, grp2_data)
        test_name = "Mann-Whitney U test"
    
    return group1, group2, stat, p_value, test_name, p_value<significance_level


def pairwise_two_groups(data, binary_column, target, correct_multiple_comparisons=False, significance_level=0.05):
    hair_loss_proportion = calculate_hair_loss_proportion(data, binary_column, target)    
    #data.groupby(['Age', column])[target].mean().mul(100).round(2).reset_index(name='hair_loss_proportion')
   
    # Test for normality within each group in the current category
    normality_results = test_normality(hair_loss_proportion, group_col=binary_column)

    # Test for homogeneity of variances across groups in the current category
    variance_homogeneity_result = test_variance_homogeneity(hair_loss_proportion, group_col=binary_column)

    # Perform Pairwise Tests
    #group1 = hair_loss_proportion[hair_loss_proportion[binary_column] == 0]['hair_loss_rate']
    #group2 = hair_loss_proportion[hair_loss_proportion[binary_column] == 1]['hair_loss_rate']
    group_values = list(data[binary_column].unique())

    # Perform the pairwise test between group1 and group2
    _, _, stat, p_value, test_name, is_significant = perform_pairwise_test(
        hair_loss_proportion,
        group_col=binary_column,
        group1=group_values[0],
        group2=group_values[1],
        normality_p_values=normality_results,
        levene_p_value=variance_homogeneity_result,
        significance_level=significance_level
    )

    #print(result_text)
    return stat, p_value, test_name, is_significant


def single_test_two_groups (data, groupby_column='Cluster', target_column='Hair_Loss', extra_text='', correct_multiple_comparisons=False, significance_level=0.05):
    results = []
    
    # Analyze hair loss for the current column
    stat, p_value, test_name, is_significant = pairwise_two_groups(data, groupby_column, 'Hair_Loss', correct_multiple_comparisons, significance_level=significance_level)
        
    results.append({'Feature': f"{extra_text}", 'Target': target_column, 'Type': test_name, 'is_siginificant':is_significant, 'Test Value': stat, 'P-Value': p_value})

    results_df = pd.DataFrame(results)
    return results_df
    
def singe_test_binary_groups(data, binary_column, filterby='Cluster', filter_values=[1, 2], factor_presence=None, target_column='Hair_Loss', correlation=False, extra_text='', correct_multiple_comparisons=False, significance_level=0.05):
    results = []       
    if filterby != target_column:            
        #data[data['Genetics'] ==1].groupby(['Age', 'Cluster'])['Hair_Loss'].mean().mul(100).round(2).reset_index(name='hair_loss_rate').dropna() 
        if (factor_presence != None):
            hair_loss_proportion = calculate_hair_loss_proportion(data[data[binary_column] == factor_presence], filterby, target_column)  
        else:
            hair_loss_proportion = calculate_hair_loss_proportion(data, filterby, target_column)

        #print(hair_loss_proportion.head()) #debug
        
        # Test for normality within each group in the current category
        normality_results = test_normality(hair_loss_proportion, group_col=filterby)

        # Test for homogeneity of variances across groups in the current category
        variance_homogeneity_result = test_variance_homogeneity(hair_loss_proportion, group_col=filterby)           

        #print(f"Groups {filterby}[{filter_values[0]}] and {filterby}[{filter_values[1]}]") #debug

        # Perform the pairwise test between group1 and group2
        _, _, stat, p_value, test_name, is_significant = perform_pairwise_test(
            hair_loss_proportion,
            group_col=filterby,
            group1=filter_values[0],
            group2=filter_values[1],
            normality_p_values=normality_results,
            levene_p_value=variance_homogeneity_result,
            significance_level=significance_level
        )           

        results.append({'Feature': f"{binary_column} ({extra_text})", 'Target': target_column, 'Type': test_name, 'is_siginificant':is_significant, 'Test Value': stat, 'P-Value': p_value})

        results_df = pd.DataFrame(results)
    return results_df

def analyze_single_binary_feature_groups(data, binary_columns, target_column='Hair_Loss', correlation=False, extra_text='', correct_multiple_comparisons=False, significance_level=0.05):
    results = []

    for column in binary_columns:        
        if column != 'Hair_Loss':
            if (correlation):
                # Calculate the correlation
                corr, p_value, test_name, is_significant = get_correlation_stats(data, column, 'Hair_Loss')
                results.append({'Feature': f"{column} ({extra_text})", 'Target': target_column, 'Type': test_name, 'is_siginificant':is_significant, 'Test Value':  round(corr, 4), 'P-Value': p_value})
            
            # Analyze hair loss for the current column
            stat, p_value, test_name, is_significant = pairwise_two_groups(data, column, 'Hair_Loss', correct_multiple_comparisons, significance_level=significance_level)
            results.append({'Feature': f"{column} ({extra_text})", 'Target': target_column, 'Type': test_name, 'is_siginificant':is_significant, 'Test Value': stat, 'P-Value': p_value})

    # Create a DataFrame from the results list
    results_df = pd.DataFrame(results)
    #pivoted_results_df = results_df.pivot(index='Feature', columns='Type', values=['Value', 'P-Value'])

    # Flatten the MultiIndex columns
    #pivoted_results_df.columns = [f'{type_} {stat}' for type_, stat in pivoted_results_df.columns]
    #pivoted_results_df.reset_index(inplace=True)
    return results_df

def categorical_hair_loss_test_pairs(data, category_columns, target_column='Hair_Loss', correct_pairwise=True, significance_level=0.05):
    results = []
    for column in category_columns:
        corr, p_value, test_name, is_significant = get_correlation_stats(data, column, target_column)
        if (test_name != None):
            results.append({'Feature': column, 'Target': target_column, 'Type': test_name, 'is_siginificant':is_significant, 'Test Value': corr, 'P-Value': p_value})

        if pd.api.types.is_categorical_dtype(data[column]): #and p_value < 0.05:
            pairwise_p_values = []
            unique_values = data[column].unique()
            for val1, val2 in combinations(unique_values, 2):
                subset_data = data[data[column].isin([val1, val2])]
                pairwise_table = pd.crosstab(subset_data[column], subset_data[target_column])

                if pairwise_table.shape == (2, 2):
                    odds_ratio, pairwise_p_value = fisher_exact(pairwise_table)
                    test_name = "Pairwise Fisher's Exact"
                    statistic = odds_ratio  # Use odds ratio as the statistic
                else:
                    chi2_stat, pairwise_p_value= chi2_contingency(pairwise_table)[:2]
                    test_name = "Pairwise Chi-Square"
                    statistic = chi2_stat  # Use chi-square stat as the statistic

                pairwise_p_values.append(pairwise_p_value)
                results.append({
                    'Feature': column,
                    'Target': target_column,
                    'Type': f'{test_name} {val1} vs {val2}',
                    'is_siginificant': p_value < significance_level,
                    'Test Value': statistic,
                    'P-Value': pairwise_p_value
                })

            if correct_pairwise and pairwise_p_values:
                corrected_p_values = multipletests(pairwise_p_values, method='bonferroni')[1]
                for i, corrected_p in enumerate(corrected_p_values):
                    results[-len(pairwise_p_values) + i]['P-Value'] = corrected_p

    results_df = pd.DataFrame(results)
    return results_df
            

def categorical_pairwise_multiples_test(data, category_columns, target_column='Hair_Loss', correct_pairwise=True, significance_level=0.05):
    results = []

    # Iterate over each category column to perform pairwise analysis
    for category_column in category_columns:
        # Calculate hair loss proportion for each level in the category
        hair_loss_proportion = calculate_hair_loss_proportion(data, category_column, target_column)
        
        # Test for normality within each group in the current category
        normality_results = test_normality(hair_loss_proportion, group_col=category_column)
        
        # Test for homogeneity of variances across groups in the current category
        variance_homogeneity_result = test_variance_homogeneity(hair_loss_proportion, group_col=category_column)
        
        # Initialize lists to store p-values for multiple comparison correction
        pairwise_results = []
        p_values = []
        
        # Perform pairwise tests between unique pairs of groups within the category
        unique_groups = hair_loss_proportion[category_column].unique()
        for i, group1 in enumerate(unique_groups):
            for group2 in unique_groups[i+1:]:
                # Perform the pairwise test between group1 and group2
                _, _, stat, p_value, test_name, is_significant = perform_pairwise_test(
                    hair_loss_proportion,
                    group_col=category_column,
                    group1=group1,
                    group2=group2,
                    normality_p_values=normality_results,
                    levene_p_value=variance_homogeneity_result,
                    significance_level=significance_level
                )
                
                # Append the raw p-value for correction
                p_values.append(p_value)
                
                # Collect initial results for this pairwise comparison
                pairwise_results.append({
                    "Category": category_column,
                    "Group1": group1,
                    "Group2": group2,
                    "Type": test_name,
                    "Statistic": stat,
                    "P_Value": p_value,
                    "Is_significant": is_significant
                })
        
        # Apply multiple comparison correction if specified
        if correct_pairwise:
            # Perform multiple comparisons correction on p-values
            corrected_results = multipletests(p_values, alpha=significance_level, method='bonferroni')
            corrected_p_values = corrected_results[1]
            
            # Update pairwise_results with corrected p-values and significance flag
            for i, result in enumerate(pairwise_results):
                result["Corrected_P_Value"] = corrected_p_values[i]
                result["Significant"] = corrected_p_values[i] < significance_level
        else:
            # If no correction, simply mark significance based on raw p-value
            for result in pairwise_results:
                result["Corrected_P_Value"] = result["P_Value"]
                result["Significant"] = result["P_Value"] < significance_level
        
        # Add all pairwise results for this category to the results list
        results.extend(pairwise_results)

    # Convert the results list to a DataFrame
    results_df = pd.DataFrame(results)
    return results_df


def evaluate_column_skewness(dataframe, column, threshold=0.5):
    try:
        column_data = dataframe[column].dropna().astype('float64')
        column_skewness = skew(column_data)
        
    except (RuntimeError, TypeError, NameError):
        column_data = dataframe[column].dropna().astype('float64')
        column_skewness = skew(column_data)
    
    if column_skewness > threshold:
        return ( f"right-skewed (positively skewed) {round(column_skewness, 2)}.")
    elif column_skewness < -threshold:
        return( f"left-skewed (negatively skewed) {round(column_skewness, 2)}.")
    else:
        return( f"symmetric (not skewed) {round(column_skewness, 2)}.")

def calc_upper_lower_whiskers(df, column):
    # Calculate the quartiles
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

    # Determine the theoretical whisker limits
    lower_whisker_limit = Q1 - 1.5 * IQR
    upper_whisker_limit = Q3 + 1.5 * IQR

    # Find the actual data points within the whisker limits
    lower_whisker = df[column][df[column] >= lower_whisker_limit].min()
    upper_whisker = df[column][df[column] <= upper_whisker_limit].max()

    #print(f"\nLower whisker of {column} {lower_whisker}.")
    #print(f"Upper whisker of {column} {upper_whisker}.")    
    return (lower_whisker, upper_whisker)

def calculate_iqr(data):
    """
    Calculate the Interquartile Range (IQR) of a given dataset or column.
    
    Parameters:
    data (array-like): The dataset or column to calculate the IQR for.
    
    Returns:
    float: The IQR of the dataset.
    """
    q1 = np.percentile(data, 25)  # 25th percentile
    q3 = np.percentile(data, 75)  # 75th percentile
    iqr = q3 - q1
    return iqr

def cast_columns (df, type_columns_dict):  
    for key in type_columns_dict.keys():
        for column in type_columns_dict[key]:
            if ((key == "str")):
                df[column] = df[column].astype(str)
            elif ((key == "string") or (key == "category") ): 
                df[column] = df[column].astype(key)
            elif ((key == "int") or (key == "integer") or (key == "int64") ):
                df[column] = pd.to_numeric(df[column], downcast=key, errors='coerce')
                #df[column] = df[column].astype(int)
            elif ((key == "float") or (key == "float64") ):
                df[column] = pd.to_numeric(df[column], downcast=key, errors='coerce')
            elif (key == "bool"): 
                df[column] = df[column].astype(bool)
            elif (key == "boolean"):
                df[column] = df[column].astype('boolean')
            elif (key == 'date'):
                df[column] = pd.to_datetime(df[column], errors='coerce')
    return df

def create_report_df(report, classifier_name):
    # Extract metrics for each class
    metrics = {'Precision': {}, 'Recall': {}, 'F1-score': {}}
    for label in ['0', '1', 'macro avg', 'weighted avg']:
        if label in report:
            metrics['Precision'][label] = report[label].get('precision', None)
            metrics['Recall'][label] = report[label].get('recall', None)
            metrics['F1-score'][label] = report[label].get('f1-score', None)
    
    df = pd.DataFrame(metrics)
    df.index.name = 'Class'
    df['Classifier'] = classifier_name
    return df

# Function to annotate heatmap with counts, percentages by class, and custom labels
def annotate_heatmap(matrix, ax, title, cell_labels=None):
    # Normalize matrix by row to calculate the percentage per class (row)
    row_sums = matrix.sum(axis=1, keepdims=True)
    matrix_percentage = matrix / row_sums * 100
    cell_labels = [["True Negatives", "False Positives"], ["False Negatives", "True Positives"]]

    # Create heatmap
    heatmap = sns.heatmap(matrix, annot=False, fmt='g', cmap='Blues', 
                          xticklabels=['No Hair Loss', 'Hair Loss'], 
                          yticklabels=['No Hair Loss', 'Hair Loss'], ax=ax)
    
    cmap = plt.cm.get_cmap('Blues')
    norm = plt.Normalize(vmin=matrix.min(), vmax=matrix.max())
    
    # Add counts, percentages, and custom labels
    for i in range(matrix.shape[0]):
        for j in range(matrix.shape[1]):
            count = matrix[i, j]
            percent = matrix_percentage[i, j]
            label_text = f"{count}\n({percent:.1f}%)"
            
            # If custom labels are provided, use them
            if cell_labels is not None:
                label_text = f"{cell_labels[i][j]}\n{label_text}"

            # Get the background color of the cell
            color = cmap(norm(matrix[i, j]))
            text_color = "white" if color[0]*0.299 + color[1]*0.587 + color[2]*0.114 < 0.5 else "black"
            
            # Add text with appropriate color
            ax.text(j + 0.5, i + 0.5, label_text, ha='center', va='center', color=text_color, fontsize=12)
    
    ax.set_title(title.upper(), fontsize=16)
#data = pd.read_csv('data/Predict Hair Fall.csv').sort_values('Id')
#data.info()
data = pd.read_csv('data/Predict Hair Fall.csv').sort_values('Id')
#data.info()
#sns.palplot(palette)
Hidden code

The data

This analysis is based on survey data available in the "Predict Hair Fall.csv" file (accessible via Kaggle). The dataset comprises 999 observations and 13 columns, capturing individual-level information related to potential contributors to hair loss. Each row represents one respondent.

However, the dataset lacks crucial contextual information regarding the data collection process, including details about how the data were obtained and the sampling methodology used. Without insights into whether the survey participants were selected randomly, stratified, or through convenience sampling, it is difficult to assess the representativeness and generalizability of the findings. The absence of such metadata raises concerns about potential biases in the sample, such as overrepresentation or underrepresentation of certain age groups, demographics, or health conditions, which may impact the validity and applicability of the analysis.

ColumnsDescription
IdA unique identifier for each person.
GeneticsWhether the person has a family history of baldness.
Hormonal ChangesIndicates whether the individual has experienced hormonal changes (Yes/No).
Medical ConditionsMedical history that may lead to baldness; alopecia areata, thyroid problems, scalp infections, psoriasis, dermatitis, etc.
Medications & TreatmentsHistory of medications that may cause hair loss; chemotherapy, heart medications, antidepressants, steroids, etc.
Nutritional DeficienciesLists nutritional deficiencies that may contribute to hair loss, such as iron deficiency, vitamin D deficiency, biotin deficiency, omega
StressIndicates the stress level of the individual (Low/Moderate/High).
AgeRepresents the age of the individual.
Poor Hair Care HabitsIndicates whether the individual practices poor hair care habits (Yes/No).
Environmental FactorsIndicates whether the individual is exposed to environmental factors that may contribute to hair loss (Yes/No).
SmokingIndicates whether the individual smokes (Yes/No).
Weight LossIndicates whether the individual has experienced significant weight loss (Yes/No).
Hair LossBinary variable indicating the presence (1) or absence (0) of baldness in the individual.

Analysis

This analysis aims to explore potential correlations between hair loss and factors such as genetics, hormonal changes, medical conditions, medications, nutritional deficiencies, stress, and lifestyle habits. Insights gained can provide information and guide personalized health solutions.

The data consists mostly of binary and categorical features.

Definitions

  • Hair Loss Rate: Calculated as the percentage of individuals experiencing hair loss within specific groups (e.g., age or other features).
  • Binary Factor Rate: Percentage of individuals with a particular binary feature (e.g., presence of genetics or hormonal changes) within specific groups.

Data Preparation and Missing Value Handling

  • Missing Data: Entries labeled as "No Data" were retained as a distinct category, representing the absence of reported conditions or treatments. This preserved data diversity and accounted for responses where no relevant factors were reported.
  • Inconsistencies: Duplicate IDs from different age groups were identified but retained, as they reflected unique observations across age categories.
Hidden code
Hidden code
Hidden code

Level 1: Descriptive statistics

  • The age of the respondents in the survey is approximately normally distributed, with an average age of 34 years. There are at least 21 respondents and average of 30 respondents by age.
  • The most common medical condition reported is No Data, accounting for 11% of responses, followed by Alopecia Areata (10.71% of responses) and Psoriasis (10.01%), while the remaining categories have less than 10% of responses.
  • The top deficiency identified is Zinc Deficiency, which accounts for 10.81% of responses, followed by Vitamin D (10.41%), while the remaining categories have less than 10% of responses.
What is the average age? What is the age distribution?

The age of the respondents in the survey is approximately normally distributed, with an average age of 34 years. The distribution is symmetric, showing a skewness of -0.03, and varies from 18 to 50 years. There are at least 21 respondents and average of 30 respondents by age. Half of the respondents fall within the interquartile range of ages, spanning from 26 to 42 years. Notably, the most frequently reported age among respondents is 32 years, with 38 responses.

Hidden code
Which medical conditions are the most common? How often do they occur?

The top medical condition is No Data where approximately 11% of respondents reported no medical condition or chose not to disclose one. The followed most prevalent medical conditions were Alopecia Areata (10.71% of responses) and Psoriasis (10.01%). The remaining medical conditions showed less of ten percent of responses. While Alopecia Areata ranks highest reported medical condition, other conditions are comparably common, reflecting a balanced distribution across the group. The proportion of responses per condition is roughly symmetric (skewness -0.37), with a range from 6.91% to 11.01%, an average of 9.09%, and an interquartile of 1.6%.

What types of nutritional deficiencies are there and how often do they occur?

For nutritional deficiencies, the most common is Zinc Deficiency (10.81%), followed by Vitamin D (10.41%). The nutritional deficiencies conditions showed less of ten percent of responses. Similar to medical conditions, other deficiencies closely follow the highest one, suggesting no single dominant deficiency. The response distribution is symmetric (skewness 0.34), with values ranging from 7.81% to 10.81%, an average of about 9%, and an interquartile of 1.65%.

Hidden code

Level 2: Visualization

  • The average hair loss rate across individual ages is 49.69% (SD = 9.55%), indicating that, on average, about half of respondents experience hair loss across all ages. The hair loss rate of survey respondents across quartile-based age groupings are: 18-26 (49.60%), 27-34 (54.72%, highest), 35-42 (51.41%) and 43-50 (43.03%, lowest).
  • Moderate stress exhibits a slight increase in hair loss rates, and distribution patterns across ages suggest that while stress may influence hair loss, its impact is subtle.
  • In summary, despite some variations in medians and IQRs across the dataset features, statistical testing shows no significant differences, only weak correlations, implying that other factors or interactions might contribute to hair loss outcomes.
What is the proportion of respondents with hair loss in different age groups?

The distribution of hair loss rates across individual ages is approximately symmetric (skewness = -0.45), with a range from 23.33% to 66.67% and notable peaks in proportions at ages 21, 28, and 29 years. The average hair loss rate is 49.69% (SD = 9.55%), indicating that, on average, about half of respondents experience hair loss across all ages.

When grouping ages into four ranges based on quartiles (18-26, 27-34, 35-42, 43-50), the variability in hair loss rate is reduced, resulting in a narrower range of proportions (43.03% to 54.72%) and a slight left-skew (skewness = -0.53). This quartile-based grouping suggests higher-than-average hair loss rates in certain groups, especially the 27-34 range (54.72%, highest) compared to the 43-50 group (43.03%, lowest). This aggregated view provides a more stable central tendency but may mask variability seen in specific ages.

Hidden code