Skip to content

Summary

This project is designed to study and understand the causes behind the incidence of COVID-19 vaccine hesitancy amongst the US population on a county level. It specifically investigates the factors that influence people's decision to avoid the vaccine. The data used for this project is retrieved from several sources such as the United States Census Bureau, Centers for Disease Control and Prevention (CDC), United States Department of Agriculture (USDA), Harvard Dataverse, Household Pulse Survey (HPS), Surgo Ventures, and data repository websites such as livingatlas.com. The project looked at the significance of variables such as race, geographic location of counties, Social Vulnerability Index (SVI), Concern for COVID Vaccine Rollout (CVAC), political affiliation, socioeconomic status, and obesity in explaining the status and variance in COVID-19 Vaccine Hesitancy (VH).

The analysis used for this project consists of both descriptive and predictive methods. To identify the relationship between VH and the selected variables, methods such as correlation analysis and K-means clustering were employed. This process enabled the identification of clusters of counties that are the most similar in terms of VH and other variables of interest. Furthermore, a multiple regression model and a regression tree was created to acquire more conclusive results to assess the direction and the strength of relationship between the dependent and independent variables.

The regression equation used in this report had the lowest Root Mean Squared Error (RMSE) and the highest Multiple R-Squared value, amongst the various regression models that were tested. All the variables used in the final regression equation were statistically significant. The variables with a strong positive relationship with VH are Concern for Vaccine Rollout, Race, and Obesity. On the other hand, Unemployment Rate has a negative impact on Vaccine Hesitancy. Moreover, counties with Democratic majority tend to have lower Vaccine Hesitancy. Counties in the West tend to have lower vaccine hesitancy than counties in Midwest, Northeast and South.

The regression tree used in this report also had the lowest RMSE amongst the regression tree models that were tested. It highlighted those counties with CVAC higher than 60% have higher levels of VH.

The Question of Interest:

As four international students who have noticed a dearth of data on COVID hesitancy in our own respective countries, we remain curious about learning the factors affecting this phenomenon in the United States. The question(s) we would like to investigate is whether the COVID vaccine hesitancy is an outcome of social vulnerability? Does it vary by race? Does the vaccine coverage have any impact on the same? Does age, income, or geographical location play a role in COVID vaccine hesitancy?

Data Sources:

1 https://aspe.hhs.gov/sites/default/files/migrated_legacy_files//200816/aspe-ib-vaccine-hesitancy.pdf

2 https://healthdata.gov/dataset/Vaccine-Hesitancy-for-COVID-19-County-and-local-es/pu4g-5454

3 https://www.census.gov/programs-surveys/household-pulse-survey/data.html#phase3.2

Description of the Variables

While the final dataset contained 27 variables, the most significant ones are described below:

1. Vaccine Hesitancy (VH): This is an index created by the CDC using the Household Pulse Survey (HPS). HPS is a collaborative effort of federal agencies to collect timely data about the impact of coronavirus pandemic on American households. In trying to assess the unique socioeconomic impact of the pandemic on every household, the set of standardized questions are then aggregated to form an index. Out of the three distinct categories from the survey question for getting the vaccine, the category selected for this variable is “Strongly Hesitant” that includes survey responses indicating that the participants would “definitely not” receive a COVID-19 vaccine when available. This index can take values ranging from 0 to 1 and estimates people's COVID vaccine hesitancy as a continuous variable in terms of percentages (CDC, 2021) (Household Pulse Survey, 2021).

2. Social Vulnerability Index (SVI): This index is also created by the CDC. It summarizes the extent to which a community is socially vulnerable to disasters and is an important metric in assessing the support required in the wake of a public health emergency. SVI is an aggregate of 14 social factors grouped into four themes that include the Socioeconomic Status, Minority Status, Housing Type, etc. SVI takes values between 0 (lowest vulnerability) to 1 (highest vulnerability) (CDC, 2021).

3. COVID-19 Vaccine Coverage (CVAC): This is an index created by Surgo Ventures. It captures supply- and demand- related challenges that may prevent large scale COVID-19 vaccine coverage in U.S. counties, through five specific themes: historic under-vaccination, sociodemographic barriers, resource-constrained healthcare system, healthcare accessibility barriers, and irregular care-seeking behaviors. The CVAC measures the level of concern for a difficult rollout on a range from 0 (lowest concern) to 1 (highest concern) (Surgo Ventures,2021).

4. Race: CDC also provides county level data for the percentage of people belonging to a particular race. The race variables include Hispanic, non-Hispanic White, non-Hispanic Asian, non-Hispanic Black, non-Hispanic American Indian/Alaska Native, and non-Hispanic Native Hawaiian/Pacific Islander (CDC, 2021).

5. Political Affiliation: This data is collected from the Harvard Dataverse. It identifies the majority party in a county, i.e. Democrat, Republican or Other. For the sake of the analysis, the values for this variable have been converted into dummy variables (Harvard Dataverse,2020).

6. Region: This variable identifies the region that a particular county is situated in. The regions include South, West, Midwest, and Northeast (US Census,2021).

7. Unemployment Rate 2020 and Median Household Income 2019: These variables are collected from a dataset by the U.S. Department of Agriculture. It represents the county level unemployment rate for 2020 and Median Household Income for 2019 (US Department of Agriculture,2019,2020).

8. Diabetes and Obesity: These two variables were collected from 2018 data available at data repository site livingatlas.com. It represents county level estimates of the percentage of people suffering from diabetes and obesity (Berry, 2018).

Exploratory Data Analysis

Let us Explore the data and get some understanding on what the data highlights.

#Lets load all the required packages
library(tidyverse)
df3 <- read.csv("df3.csv")
df3%>%
  group_by(Region)%>%
  summarise(Average_Estimated.hesitant = mean(Estimated.hesitant + Estimated.strongly.hesitant))%>%
  ggplot(aes(x = Region, y = Average_Estimated.hesitant)) + 
  geom_col(aes(fill = Region)) + theme_bw() +
  ggtitle("Mean Estimated Hesitant Rate as per Region") +
  theme(plot.title = element_text(hjust = 0.5)) 

All U.S. counties are categorized into four regions: Midwest, Northeast, South, and West. In examining these categorical variables, I aimed to determine which region has the highest vaccine hesitancy rate.

To do this, I calculated the average estimated hesitancy rate by grouping the counties according to their respective regions. The findings indicate that, on average, the South region exhibits the highest vaccine hesitancy rate, followed by the Midwest, West, and Northeast regions.

#2. SVI Category vs estimated Hesitant


#boxplot -

ggplot(df3, aes(x = SVI.Category, y = Estimated.hesitant, fill = SVI.Category)) + 
  geom_boxplot() + theme_bw() +
  ggtitle("SVI Category vs Estimated.hesitant") +
  theme(plot.title = element_text(hjust = 0.5)) 

# we can add region but it seems hard to explain 

As the second step in the process of exploration, I aimed to investigate whether the Social Vulnerability Index (SVI) category index is directly related to vaccine hesitancy.

The boxplot analysis reveals a clear trend: as the level of concern increases, vaccine hesitancy also appears to rise.

#3. Estimated Hesitancy vs Obesity Percentage(Quant vs Quant)

ggplot(df3, aes(x = Obesity_Percent, 
                y = Estimated.hesitant, color = Obesity_Percent)) + 
  geom_point() + 
  geom_smooth(fill = NA, color = "red") + 
  expand_limits(y=0) + 
  theme_bw() +
  ggtitle("Obesity Percent vs Estimated.hesitant") +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(legend.position = "none") # Remove legend

This section is particularly intriguing, as this variable exhibits the highest correlation value of 0.46 with the dependent variable of vaccine hesitancy. A noticeable trend has emerged: counties with higher obesity rates tend to have elevated vaccine hesitancy rates.

#4. Estimated Hesitant based on Party


df3 %>%
  rename(Republican = "party_REPUBLICAN", 
         Democratic = "party_DEMOCRAT" ) %>%
  pivot_longer(c(Democratic, Republican), names_to = "Party", values_to = "Majority") -> df9

ggplot(df9, aes(x= Party, y = Majority, color = Party)) + 
  geom_col() + facet_wrap(~ Estimated.hesitant) +
  theme_bw() +
  ggtitle("Hesitancy rate wrt Party in Rule") +
  theme(plot.title = element_text(hjust = 0.5)) 

X represents the political party, while Y indicates the number of counties, with red denoting Democratic and blue denoting Republican affiliations. The Republican correlation is +0.27, whereas the Democratic correlation is -0.27.

As we explored the dataset, we aimed to examine the relationship between vaccine hesitancy rates and political affiliation. The findings indicate that estimated vaccine hesitancy is positively correlated with counties where the Republican Party holds a majority, and negatively correlated with those where the Democratic Party is predominant.

# 5. Estimated hesitancy based on Race

# Renaming columns for better readability
df3 %>% rename(Hispanic = "Percent.Hispanic", 
               American_Indian_Alaska_Native = "Percent.non.Hispanic.American.Indian.Alaska.Native",      
               Asian = "Percent.non.Hispanic.Asian",
               Black = "Percent.non.Hispanic.Black",
               Hawaiian_Pacific_Islander = "Percent.non.Hispanic.Native.Hawaiian.Pacific.Islander", 
               White = "Percent.non.Hispanic.White") %>%
pivot_longer(c(Hispanic, American_Indian_Alaska_Native, Asian, Black, Hawaiian_Pacific_Islander, White), 
             names_to = "Race",
             values_to = "Percent") -> df7

# Plotting the data
ggplot(df7, aes(x = Percent, y = Estimated.hesitant)) +
  geom_point(aes(color = Race)) + 
  facet_wrap(~ Race) + 
  theme_bw() + 
  ggtitle("Hesitancy rate wrt. Percent of people in the Race") +
  theme(plot.title = element_text(hjust = 0.5))