The Anatomy of Baldness: Statistical & Machine Learning Perspectives on Key Factors of Hair Loss

0. Background

0.1. Introduction

Hair loss is a growing health concern that affects both appearance and overall well-being for many individuals. Gaining insights into the factors that contribute to its occurrence is essential for the development of personal health management, medical interventions, and related industries.

This report analyzes survey data on potential contributors to hair loss, with a focus on three specific objectives:


Explore the characteristics of surveyed individuals.
Examine the relationships between hair loss and its factors.
Determine key predictors of hair loss through machine learning models.

0.2. Data Used

The dataset contains a total of 999 rows of observations, each of which represents a single respondent. The table below provides a list of variables and their corresponding descriptions:

Variable	Description
Id	A unique identifier for each person.
Genetics	Whether the person has a family history of baldness.
Hormonal Changes	Indicates whether the individual has experienced hormonal changes (Yes/No).
Medical Conditions	Medical history that may lead to baldness; alopecia areata, thyroid problems, scalp infections, psoriasis, dermatitis, etc.
Medications & Treatments	History of medications that may cause hair loss; chemotherapy, heart medications, antidepressants, steroids, etc.
Nutritional Deficiencies	Lists nutritional deficiencies that may contribute to hair loss, such as iron deficiency, vitamin D deficiency, biotin deficiency, omega-3 fatty acid deficiency, etc.
Stress	Indicates the stress level of the individual (Low/Moderate/High).
Age	Represents the age of the individual.
Poor Hair Care Habits	Indicates whether the individual practices poor hair care habits (Yes/No).
Environmental Factors	Indicates whether the individual is exposed to environmental factors that may contribute to hair loss (Yes/No).
Smoking	Indicates whether the individual smokes (Yes/No).
Weight Loss	Indicates whether the individual has experienced significant weight loss (Yes/No).
Hair Loss	Binary variable indicating the presence (1) or absence (0) of baldness in the individual.

# Load pre-installed, required packages
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library(scales))
# suppressPackageStartupMessages(library(readr))
suppressPackageStartupMessages(library(DT))

# Install and load the 'ggthemes' package
suppressWarnings(suppressMessages(install.packages("ggthemes", quite=TRUE)))
suppressPackageStartupMessages(library(ggthemes))

# Read the 'Predict Hair Fall' dataset
data <- read_csv("data/Predict Hair Fall.csv", show_col_types=FALSE)

# Encode data types and factors
data <- data %>%
	mutate(Id = as.integer(Id),
		   Age = as.integer(Age),
		   across(where(is.character), as.factor),
		   Stress = factor(Stress, levels=c("Low", "Moderate", "High"), ordered=TRUE),
		   `Hair Loss` = as.factor(`Hair Loss`)
		   )

# Backup original dataset
data1 <- read_csv("data/Predict Hair Fall.csv", show_col_types=FALSE)

# Function to install missing packages
install_if_missing <- function(p) {
  if (!require(p, character.only = TRUE)) {
    suppressWarnings(suppressMessages(install.packages(p, dependencies = TRUE)))
    suppressPackageStartupMessages(library(p, character.only = TRUE))
  }
}

# Function to tabularize summary statistics of a given variable in the dataset
summarize_variable <- function(data, variable=NULL, caption=NULL, dom="t", searching=FALSE, rownames=FALSE, pageLength=10) {
  if (is.null(variable)) {
    datatable(data, 
                  class="cell-border stripe", 
                  options = list(dom=dom, searching=searching, pageLength=pageLength),
                  rownames=rownames,
                  caption=if (!is.null(caption)) htmltools::tags$caption(
                    style = 'caption-side: top; text-align: center;',
                    caption
                  ) else NULL)
  } else {
	summarized_data <- data %>%
      filter(!!sym(variable) != "No Data") %>%
      group_by(!!sym(variable)) %>%
      summarise(Count = n(), .groups="drop") %>%
      {
        if (is.ordered(data[[variable]])) {
          arrange(., desc(!!sym(variable)))
        } else {
          arrange(., desc(Count))
        }
      } %>%
      mutate(Percentage = label_percent(accuracy=0.01)(Count/sum(Count)))
    
    datatable(summarized_data %>% rename(!!variable := !!sym(variable)),
                  class="cell-border stripe", 
                  options = list(dom=dom, searching=searching, pageLength=pageLength),
			      rownames=rownames,
                  caption=if (!is.null(caption)) htmltools::tags$caption(
                    style = 'caption-side: top; text-align: center;',
                    caption
                  ) else NULL)
  }
}
			  
# Function to check if a package is installed, and install it if not
install_if_missing <- function(package) {
  if (!require(package, character.only = TRUE)) {
    install.packages(package, dependencies = TRUE)
    library(package, character.only = TRUE)
  }
}
			  
# Function to create a propotional stack bar plot (used for subplots of multiple variables)
create_plotly_stackbar <- function(data = data, var, showlegend=FALSE, yaxis_text=NULL, yaxis_titleSize=11.5, yaxis_tickSize=10, tickangle=0) {
  yaxis_title <- ifelse(is.null(yaxis_text), var, yaxis_text)
  
  plot_data <- as.data.frame(table(data %>% filter(!!sym(var) != "No Data") %>%
								   		select(`Hair Loss`, all_of(var)))) %>%
    rename_with(~ gsub("\\.", " ", .)) %>%
	group_by(!!sym(var)) %>%
    mutate(Percentage = 100*Freq/sum(Freq),
           `Hair Loss` = ifelse(`Hair Loss` == 1, "With baldness", "Without baldness")
          ) %>%
    rename(Count = Freq)
  
  plot <- plot_ly(
    data = plot_data,
    x = ~Percentage,
    y = as.formula(paste0("~`", var, "`")),
    type = "bar",
    color = ~`Hair Loss`,
    colors = c("#FF6347", "#8c8c8c"),
    showlegend = showlegend,
    text = ~paste0("<b>Percentage:</b> ", label_percent(accuracy=0.01)(Percentage/100), "\n<b>Count:</b> ", Count),
    hoverinfo = "text"
  ) %>%
    layout(
      xaxis = list(title = list(text="Percentage (%)", font = list(family="ITC Officina Sans", size=18))),
      yaxis = list(title = list(text = paste("\n<b>", yaxis_title, "</b>"),
								font = list(family="ITC Officina Sans", size = yaxis_titleSize), 
								standoff=43),
                   tickfont = list(family="ITC Officina Sans", size = yaxis_tickSize), tickangle=tickangle
				  ),
      barmode="stack",
      legend = list(orientation="h", xanchor="center", x=0.438, y=1.11, traceorder="normal",
					font = list(family="ITC Officina Sans", size=15)),
	  font = list(size = 14),
	  hoverlabel = list(font = list(family="ITC Officina Sans", size=15))
    )
  
  return(plot)
}			  
			  
# Function to create a data frame for contingency table of a variable			  
create_contingency_data <- function(data, var) {
	
  # Select relevant columns and filter out "No Data" values
  selected_data <- data %>% 
    select(`Hair Loss`, all_of(var)) %>%
    filter(!!sym(var) != "No Data")
  
  # Create contingency table
  contingency_data <- data.frame(matrix(
    with(selected_data, table(selected_data[[var]], `Hair Loss`)),
    nrow = dim(with(selected_data, table(selected_data[[var]], `Hair Loss`)))[1], 
    dimnames = dimnames(with(selected_data, table(selected_data[[var]], `Hair Loss`)))
  ))
  
  return(contingency_data)
}
			  
# Function to create stack bar plot for a variable
create_stackbar_plot1 <- function(data, var, yaxis_title, yaxis_text=NULL, plot_title=NULL, tickangle=45) {
  stackbar_data <- data %>% filter(!!sym(var) != "No Data") %>%
    mutate(!!sym(var) := factor(!!sym(var),
								levels = rownames(create_contingency_data(data, var) %>% 
                                                    mutate(X01 = X0+X1, X0 = X0/X01, X1 = X1/X01) %>%
													arrange(X1, desc(rownames(.)))
												 )
    ))
  
  plot <- create_plotly_stackbar(data = stackbar_data, 
								 var = var, 
								 showlegend=TRUE,
								 yaxis_text = yaxis_text,
								 yaxis_titleSize=20,
								 yaxis_tickSize=15,
								 tickangle = tickangle)
  
  plot %>%
    layout(
      plot_bgcolor = "#D5E4EB",
      paper_bgcolor = "#D5E4EB",
      title = list(text = plot_title, 
                   font = list(family = "ITC Officina Sans", size = 18), 
                   y = 0.93),
      margin = list(t = 95),
      xaxis = list(tickfont = list(family = "ITC Officina Sans", size = 15))
    )
}			  

# Function to create a contingency table of a variable				  
create_contingency_table <- function(data, var, first_colname=var) {
  # Append the proportion to each value in the dataframe
  data <- as.data.frame(
    apply(data %>% arrange(desc(X1)), c(1, 2), function(x) {
      percentage <- label_percent(accuracy=0.01)(x / sum(data))
      sprintf("%d (%s)", x, percentage)
    })
  ) %>%
    rownames_to_column(var=first_colname) 
  
  # Rename rows and columns
  colnames(data) <- c(first_colname, "Without baldness", "With baldness")
  
  # Reorder columns
  data <- data[,c(first_colname, "With baldness", "Without baldness")]								
  
  return(data)
}
			  
# Function to compute Point-Biserial Correlation
compute_point_biserial <- function(data, x, y, conf=0.95) {
  # Ensure the x and y are in the data
  if (!(x %in% colnames(data)) | !(y %in% colnames(data))) {
    stop("The variable x or y not found in the data.")
  }
  
  # Ensure the y contains only the specified values
  unique_y <- unique(data[[y]])
  if (length(unique_y) != 2) {
    stop("The variable y is not binary.")
  }
  
  # Extract values for individuals with and without the binary outcome
  with_group <- data[[x]][data[[y]] == 1]
  without_group <- data[[x]][data[[y]] == 0]
  
  # Calculate means
  X1 <- mean(with_group)
  X0 <- mean(without_group)
  
  # Calculate standard deviation of the entire x data
  sn <- sd(data[[x]])
  
  # Calculate sample sizes
  n1 <- length(with_group)
  n0 <- length(without_group)
  n <- n1 + n0
  
  # Compute the point biserial correlation coefficient
  rpb <- (X1 - X0) / sn * sqrt((n1 * n0) / (n^2))
  
  # Compute confidence intervals using cor.test
  rpb_cortest <- cor.test(data[[x]], data[[y]], conf.level=conf)
  lower_ci <- rpb_cortest$conf.int[1]
  upper_ci <- rpb_cortest$conf.int[2]
  
  rpb_df <- data.frame(x=x, y=y, X1=X1, X0=X0, rpb=rpb, lower_ci=lower_ci, upper_ci=upper_ci
                      ) %>%
    mutate(across(where(is.numeric) & !where(is.integer), ~ ifelse(abs(.) >= 0.01, 
                                          label_number(accuracy = 0.01)(.), 
                                          label_scientific(digits = 3)(.)))
    ) %>%
    rename(`Correlation coefficient` = rpb) %>%
	mutate(!!paste0(conf * 100, "% CI") := paste0("[", lower_ci, ", ", upper_ci, "]"),
		   Method="Point-biserial") %>%
	select(-c("lower_ci", "upper_ci"))
  
  return(rpb_df)
}			

# Function to compute Rank-Biserial Correlation
compute_rank_biserial <- function(data, x, y, o=0, l=1, conf=0.95) {
  # Install and load the 'rcompanion' package if not already installed
  if (!requireNamespace("rcompanion", quietly = TRUE)) {
    suppressWarnings(suppressMessages(install.packages("rcompanion", quiet=TRUE)))
  }
  suppressPackageStartupMessages(library(rcompanion))
  
  # Ensure the x and y are in the data
  if (!(x %in% colnames(data)) | !(y %in% colnames(data))) {
    stop("The variable x or y not found in the data.")
  }
  
  # Ensure the y contains only the specified values
  unique_y <- unique(data[[y]])
  if (length(unique_y) != 2) {
    stop("The variable y is not binary.")
  }
    
  # Rank the predictor variable
  data$ranked_factor <- rank(data[[x]], ties.method = "average") 
  with_group <- data$ranked_factor[data[[y]] == 1]
  without_group <- data$ranked_factor[data[[y]] == 0]	
	
  # Calculate mean ranks for individuals with and without the binary outcome
  R1 <- mean(with_group)
  R0 <- mean(without_group)
 
  # Compute outputs using the wilcoxonRG function
  group <- factor(c(rep("With group", length(with_group)), rep("Without group", length(without_group))))	
  rrb_wilcoxonRG <- wilcoxonRG(x = c(with_group, without_group), g = group, ci=TRUE, conf=conf)
	
  # Extract coefficient and confidence intervals
  rrb <- rrb_wilcoxonRG[[1]]
  lower_ci <- rrb_wilcoxonRG[[2]]
  upper_ci <- rrb_wilcoxonRG[[3]]
  
  rrb_df <- data.frame(x=x, y=y, R1=R1, R0=R0, rrb=rrb, lower_ci=lower_ci, upper_ci=upper_ci
					  ) %>%
    mutate(across(where(is.numeric) & !where(is.integer), ~ ifelse(abs(.) >= 0.01, 
                                          label_number(accuracy = 0.01)(.), 
                                          label_scientific(digits = 3)(.)))
    ) %>%
    rename(`Correlation coefficient` = rrb) %>%
	mutate(!!paste0(conf * 100, "% CI") := paste0("[", lower_ci, ", ", upper_ci, "]"),
		   Method="Rank-biserial") %>%
	select(-c("lower_ci", "upper_ci")) 
  
  return(rrb_df)
}

# Function to compute Phi Coefficient
compute_phi <- function(data, x, y, o=0, l=1, u, t, conf=0.95) {
  # Install and load the 'statpsych' package if not already installed
  if (!requireNamespace("statpsych", quietly = TRUE)) {
    suppressWarnings(suppressMessages(install.packages("statpsych", quiet=TRUE)))
  }
  suppressPackageStartupMessages(library(statpsych))
  
  # Ensure the x and y are in the data
  if (!(x %in% colnames(data)) | !(y %in% colnames(data))) {
    stop("The variable x or y not found in the data.")
  }
  
  # Ensure the x and y contain only binary values
  unique_x <- unique(data[[x]])
  unique_y <- unique(data[[y]])
  if (length(unique_x) != 2 | length(unique_y) != 2) {
    stop("The variables x and y must be binary.")
  }
  
  # Create contingency table
  contingency_table <- table(data[[x]], data[[y]])
  
  # Convert o, l, u, t to character
  o <- as.character(o)
  l <- as.character(l)
  u <- as.character(u)
  t <- as.character(t)
  
  # Extract counts from the contingency table using provided levels
  a <- contingency_table[t, l]
  b <- contingency_table[t, o]
  c <- contingency_table[u, l]
  d <- contingency_table[u, o]
  
  # Compute outputs using the function ci.phi
  ci_phi <- ci.phi(alpha = 1-conf, d, c, b, a)
  
  # Compute confidence interval limits
  phi <- ci_phi[[1]]
  ci_lower <- ci_phi[[3]]
  ci_upper <- ci_phi[[4]]
  
  # Create a dataframe to return the results
  phi_df <- data.frame(x = x, y = y, counts=paste(a,b,c,d, sep=", "), phi = phi, 
                       ci = paste0("[", round(ci_lower, 2), ", ", round(ci_upper, 2), "]")) %>%
    mutate(across(where(is.numeric) & !where(is.integer), ~ ifelse(abs(.) > 0.01, 
                                                                  label_number(accuracy = 0.01)(.), 
                                                                  label_scientific(digits = 3)(.)))
    ) %>%
    rename(`a, b, c, d` = counts,
           `Phi coefficient` = phi,
		   !!paste0(conf * 100, "% CI") := ci)
  
  return(phi_df)
}		

# Function to compute Cramer's V
compute_cramersV <- function(data, x, y, conf=0.95) {
  # Ensure the x and y are in the data
  if (!(x %in% colnames(data)) | !(y %in% colnames(data))) {
    stop("The variable x or y not found in the data.")
  }
  
  # Create contingency table
  contingency_table <- table(data[[x]], data[[y]])
  
  # Compute chi-square statistic
  chi2 <- chisq.test(contingency_table, simulate.p.value = TRUE)$statistic
  
  # Get the number of rows and columns in the contingency table
  n <- sum(contingency_table)
  r <- nrow(contingency_table)
  c <- ncol(contingency_table)
  rc <- paste(r, c, sep=", ")	
  
  # Compute outputs using the cramerV function
  cramersV <- cramerV(contingency_table, ci=TRUE, conf=conf)
  
  # Extract coefficient and confidence intervals
  cramers_v <- cramersV[[1]]
  lower_ci <- cramersV[[2]]
  upper_ci <- cramersV[[3]]
  
  # Create a dataframe to return the results
  cramers_v_df <- data.frame(
    n = n, 
    rc = rc, 
    x = x, 
    y = y, 
    chi2 = chi2, 
    cramers_v = cramers_v, 
    lower_ci = lower_ci, 
    upper_ci = upper_ci
  ) %>%
    mutate(across(
      where(is.numeric) & !where(is.integer), 
      ~ ifelse(abs(.) > 0.01, label_number(accuracy = 0.01)(.), label_scientific(digits = 3)(.))
    )) %>%
    rename(
      `Chi-square statistic` = chi2,
      `r, c` = rc,
      `Cramer's V` = cramers_v
    ) %>%
    mutate(!!paste0(conf * 100, "% CI") := paste0("[", lower_ci, ", ", upper_ci, "]")) %>%
    select(-c("lower_ci", "upper_ci"))
  
  rownames(cramers_v_df) <- c(1:nrow(cramers_v_df))
  
  return(cramers_v_df)
}
			
# Function for evaluating model performance
evaluate_model <- function(model, test, y, model_name) {
  
  # Check if pROC package is installed, if not, install it
  if (!requireNamespace("pROC", quietly = TRUE)) {
    install.packages("pROC")
  }
  suppressPackageStartupMessages(library(pROC))
  
  # Perform predictions using the model and test data
  prob <- predict(model, test, type = "prob")[, 2]
  predicted <- predict(model, test)
  
  # Create a confusion matrix to evaluate the model's performance
  conf_matrix <- confusionMatrix(predicted, test[[y]])
  
  # Compute model metrics
  roc <- roc(test[[y]], prob)
  auc <- auc(roc)
  accuracy <- conf_matrix$overall['Accuracy']
  precision <- conf_matrix$byClass['Pos Pred Value']
  recall <- conf_matrix$byClass['Sensitivity']
  f1 <- 2 * (precision * recall) / (precision + recall)
  
  # Return performance metrics as a dataframe
  results <- data.frame(
    AUC = round(auc, 4),
    Accuracy = round(accuracy, 4),
    Precision = round(precision, 4),
    Recall = round(recall, 4),
    f1 = round(f1, 4)
  ) %>%
  rename(`F1 Score` = f1)
  
  rownames(results) <- model_name
  
  return(results)
}			  
			  		  
# Function for cluster statistics
cstats.table <- function(dist, tree, k) {
clust.assess <- c("cluster.number","n","within.cluster.ss","average.within","average.between",
                  "wb.ratio","dunn2","avg.silwidth")
clust.size <- c("cluster.size")
stats.names <- c()
row.clust <- c()
output.stats <- matrix(ncol = k, nrow = length(clust.assess))
cluster.sizes <- matrix(ncol = k, nrow = k)
for(i in c(1:k)){
  row.clust[i] <- paste("Cluster-", i, " size")
}
for(i in c(2:k)){
  stats.names[i] <- paste("Test", i-1)
  
  for(j in seq_along(clust.assess)){
    output.stats[j, i] <- unlist(cluster.stats(d = dist, clustering = cutree(tree, k = i))[clust.assess])[j]
    
  }
  
  for(d in 1:k) {
    cluster.sizes[d, i] <- unlist(cluster.stats(d = dist, clustering = cutree(tree, k = i))[clust.size])[d]
    dim(cluster.sizes[d, i]) <- c(length(cluster.sizes[i]), 1)
    cluster.sizes[d, i]
    
  }
}
output.stats.df <- data.frame(output.stats)
cluster.sizes <- data.frame(cluster.sizes)
cluster.sizes[is.na(cluster.sizes)] <- 0
rows.all <- c(clust.assess, row.clust)
output <- rbind(output.stats.df, cluster.sizes)[ ,-1]
colnames(output) <- stats.names[2:k]
rownames(output) <- rows.all
is.num <- sapply(output, is.numeric)
output[is.num] <- lapply(output[is.num], round, 2)
output
}			  		  
			  		  
# Tabularize first five rows of the dataset			  
summarize_variable(slice(data, 1:5), dom="t")

0.3. Executive Summary

The surveyed individuals are, on average, 34 years old, with their ages evenly distributed between 18 to 50 years.
Alopecia areata emerges as the most common medical condition associated with baldness, consistent with its established medical background. Other notable conditions include psoriasis, thyroid problems, androgenetic alopecia, and dermatitis, each reported by approximately one-third of the 889 individuals with documented medical conditions.
Among the 919 respondents with data on nutritional deficiencies, zinc and vitamin D deficiencies are the most prevalent. Biotin, omega-3 fatty acids, and vitamin A deficiencies also rank in the top five, each reported by at least 10% of these respondents.
Middle-aged individuals, aged 31–45 years, constitute the largest proportion of those who have experienced baldness.
Hair loss exhibits near-zero associations with all factors, indicating that it results from a combination of multiple factors rather than a single one.
Stress levels—categorized as low, moderate, and high—are found to be evenly distributed among individuals with and without baldness.
Among the three classification models evaluated, a random forest model, which demonstrates an initial AUC score of ~0.58 and an accuracy of ~0.59, is recommended for further tuning and optimization.
Cluster analysis identifies two groups of individuals with shared key underlying features related to hair loss, including scalp infection, psoriasis, androgenetic alopecia, hormonal changes, and the use of immunomodulators.
Through the importance analysis of the model, stress, age, and smoking are identified as major predictors of hair loss, while several features related to lifestyle and medical history are found to have considerable influence on the model's predictive performance.

1. Descriptive Analysis

This section provides an overview of the characteristics of the surveyed individuals using summary statistics, focusing on age, medical history, stress, and lifestyle factors.

1.1. Age

Range: The youngest individual in the survey is 18 years old, and the oldest is 50 years old.
1st Quartile (Q1): Twenty-five percent of respondents are younger than 26 years.
Mean: The average age of the respondents is 34.19 years.
Median (Q2): Half of the individuals are younger (or older) than 34 years. This value is nearly the same as the mean, indicating a relatively symmetrical distribution of ages among the respondents.
3rd Quartile (Q3): Seventy-five percent of the surveyed individuals are younger than 42 years.
Standard Deviation: 9.38 years, indicating a moderate spread of ages around the mean.
Skewness: -0.03, indicating a nearly symmetrical age distribution with no significant skew toward younger or older groups.

# Install and load the 'moments' package for calculating skewness
suppressWarnings(suppressMessages(install.packages("moments", quiet=TRUE)))
suppressPackageStartupMessages(library(moments))

# Summary statistics of the age
age_stats <- data.frame(Value = unclass(summary(data$Age))) %>%
  bind_rows(data.frame(Value = c(sd(data$Age), skewness(data$Age)), row.names = c("Std. Dev.", "Sk\u209a"))) %>%
  mutate_if(is.numeric, round, 2)

# Age distribution plot
age_dist_plot <- ggplot(data, aes(x = Age, text = paste0("<b>Count:</b> ", after_stat(count)))) +
  geom_histogram(breaks = hist(data$Age, breaks="FD", plot = FALSE)$breaks) +
  geom_vline(aes(xintercept = mean(Age), text = paste0("<b>Mean: </b>", round(mean(Age), 2))), col="#FFA53E", linewidth=0.5) +
  geom_vline(aes(xintercept = median(Age), text = paste0("<b>Median: </b>", round(median(Age), 2))), col="#4C6DB4", linewidth=0.5) +
  theme_economist() + 
  scale_color_economist() +
  scale_x_continuous(breaks = seq(15, 50, by=5)) +
  ggtitle("\nFIG. 1﹕HISTOGRAM OF THE AGE DISTRIBUTION OF SURVEYED INDIVIDUALS") +
  theme(
    plot.title = element_text(size=13, hjust=0.5),
    axis.title.x = element_text(family="ITC Officina Sans", size=13),
    axis.title.y = element_text(family="ITC Officina Sans", size=13)
  ) +
  labs(y="Count")

# Convert ggplot to plotly
age_dist_plotly <- ggplotly(age_dist_plot, tooltip=c("text")) %>%
  layout(
    hoverlabel = list(font = list(family="ITC Officina Sans", size=15)),
    xaxis = list(tickfont = list(family="ITC Officina Sans", size=13)),
    yaxis = list(tickfont = list(family="ITC Officina Sans", size=13))
  ) %>%
  add_annotations(
    text = paste0(
      "<b>Std. Deviation (<i>\u03c3</i>): </b>", round(sd(data$Age), 2), 
      ";  <b>Skewness (<i>Sk\u209a</i>): </b>", round(skewness(data$Age), 2)
    ),
    x = 0.037, y = 0.97, xref = "paper", yref = "paper",
    showarrow = FALSE, font = list(family="ITC Officina Sans", size=13.5), align="left"
  )

Hidden code

1.2. Medical History

This report defines an individual’s medical history as their medical background and experiences, particularly those related to hair loss. The following statistics highlight key characteristics of the respondents' medical histories:

Genetics: A total of 52.25% of respondents reported having a family history of baldness.
Hormonal Changes: Approximately 51% of respondents have experienced hormonal changes.
Medical Conditions: Among the 889 individuals with recorded medical conditions potentially leading to baldness, alopecia areata is the most common, affecting 12.04% of the group. It is followed by psoriasis (11.25%), thyroid problems (11.14%), and androgenetic alopecia (11.02%). Dermatitis (10.35%) also ranks highly, rounding out the top five conditions.

Hidden code

# Tabularize summary statistics of Hormonal Changes
summarize_variable(
  data %>% rename(`Has experienced hormonal changes` = `Hormonal Changes`),
  variable = "Has experienced hormonal changes",
  rownames = FALSE
)

# Tabularize summary statistics of Medical Conditions
summarize_variable(
  data = data %>% rename(`Medical condition that may lead to baldness` = `Medical Conditions`),
  variable = "Medical condition that may lead to baldness",
  dom = "tip",
  pageLength = 5
)

Medications & Treatments: Rogaine is the most commonly used medication or treatment associated with hair loss among the respondents, with 11.63% reporting its use. It is followed by antidepressants (11.03%), steroids (10.73%), heart medication (10.43%), and Accutane (10.23%).

Hidden code

Nutritional Deficiencies: Most of the documented 919 individuals with nutritional deficiencies that may contribute to hair loss are zinc deficient (11.75%). Vitamin D deficiency follows closely at 11.32%, while both Biotin and vitamin A deficiencies impact 10.77% each. The deficiency of omega-3 fatty acids completes the top five, affecting 10.01% of the group.

Hidden code

1.3. Stress

Stress levels among respondents are found to be almost evenly distributed, with 32.73% reporting low stress, 35.14% experiencing moderate stress, and 32.13% dealing with high stress.

1.4. Lifestyle

In this report, lifestyle factors are defined as various habits and behaviors that may influence hair loss. The statistics below provide an overview of key lifestyle traits among the respondents:

Poor Hair Care Habits: Nearly half (49.25%) of the respondents reported engaging in poor hair care practices.
Environmental Factors: A total of 50.85% indicated exposure to environmental factors associated with hair loss.
Smoking: Approximately 52% of the respondents are identified as smokers.
Weight Loss: Significant weight loss is reported by 47.25% of the respondents.

2. Measures of Association

2 hidden cells

‌
‌
‌

The Anatomy of Baldness: Statistical & Machine Learning Perspectives on Key Factors of Hair Loss

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}The Anatomy of Baldness: Statistical & Machine Learning Perspectives on Key Factors of Hair Loss

0. Background

0.1. Introduction

0.2. Data Used

0.3. Executive Summary

1. Descriptive Analysis

1.1. Age

1.2. Medical History

1.3. Stress

1.4. Lifestyle

2. Measures of Association

The Anatomy of Baldness: Statistical & Machine Learning Perspectives on Key Factors of Hair Loss