COVID-19 Project
This project involves the analysis of COVID-19 data from the World Health Organisation website and was part of my Postgraduate in Bioinformatics & Biostatistics, provided through the Centre of European Master's Programmes.
The dataset used, sourced from the World Health Organisation(WHO), includes COVID-19 data for European countries, showing death rates per 100,000 people (Deaths_100000) and vaccination doses per 100 people (Doses_100)
# Load data from txt file
data <- read.delim("covid_q_data.txt", stringsAsFactors = FALSE, header = TRUE)
# Check the data
data
Calculation of the overall sample mean + standard deviation for both the Deaths_100000 + Doses_100 variables:
# Mean for Deaths_100000
death_rate_mean <- mean(data$Deaths_100000)
cat("The mean deaths per 100,000 population is: ", death_rate_mean, "\n")
# Mean for Doses_100
vacc_rate_mean <- mean(data$Doses_100)
cat("The mean doses per 100 population is: ", vacc_rate_mean, "\n")
# Standard deviation for Deaths_100000
death_rate_sd <- sd(data$Deaths_100000)
cat("The standard deviation of deaths per 100,000 population is: ", death_rate_sd, "\n")
# Standard deviation for Doses_100
vacc_rate_sd <- sd(data$Doses_100)
cat("The standard deviation of doses per 100 population: ", vacc_rate_sd, "\n")
Calculation of the confidence interval for the sample mean of Deaths_100000 + Doses_100:
# Use the z-score for the 95% calculation level (typically 1.96)
z <- qnorm(0.975)
# Apply the formula to calculate the confidence interval for the mean
# Confidence interval for mean of the death rate:
n <- length(data$Deaths_100000)
margin_of_error_dr <- z * death_rate_sd / sqrt(n)
lower_ci <- death_rate_mean - margin_of_error_dr
upper_ci <- death_rate_mean + margin_of_error_dr
cat("The confidence interval for the mean death rate is: ", lower_ci, "-", upper_ci, "\n")
# Confidence interval for mean of vaccination rate:
margin_of_error_vr <- z * vacc_rate_sd / sqrt(n)
lower_ci <- vacc_rate_mean - margin_of_error_vr
upper_ci <- vacc_rate_mean + margin_of_error_vr
cat("The confidence interval for the mean vaccination rate is: ", lower_ci, "-", upper_ci, "\n")
Calculation of the confidence interval for the sample standard deviation of Deaths_100000 + Doses_100:
# Determine the degrees of freedom (n - 1)
n <- length(data$Deaths_100000)
dof <- n - 1
# Use the chi-square values for the 95% confidence interval
alpha <- 0.05
chi_1 <- qchisq(alpha/2, dof)
chi_2 <- qchisq(1 - alpha / 2, dof)
# Apply the formula to calculate the confidence interval for the standard deviation
# Confidence interval for the standard deviation of the Death Rate:
numerator <- dof * death_rate_sd^2
denom_lower <- chi_2
denom_upper <- chi_1
ci_lower <- sqrt(numerator / denom_lower)
ci_upper <- sqrt(numerator / denom_upper)
cat("The confidence interval for the standard deviation of the sample death rate is: ", ci_lower, "-", ci_upper, "\n")
# CI for the SD of the Vaccination Rate:
numerator2 <- dof * vacc_rate_sd^2
denom_lower2 <- chi_2
denom_upper2 <- chi_1
ci_lower2 <- sqrt(numerator2 / denom_lower2)
ci_upper2 <- sqrt(numerator2 / denom_upper2)
cat("The confidence interval for the standard deviation of the sample vaccination rate is: ", ci_lower2, "-", ci_upper2, "\n")
A regression model was applied to establish the effect between Deaths_10000 + Doses_100 of European Union countries:
# Convert to data frame
data <- data.frame(data)
# Apply a regression model
data_model <- lm(Deaths_100000 ~ Doses_100, data = data)
data_model
# Gain more information about the model
summary(data_model)
# Make a scatter dot plot graph representing the linear regression between the two sets of data
library(ggplot2)
ggplot(data, aes(x = Doses_100, y = Deaths_100000)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
labs(x = "Vaccine doses per 100 population", y = "Deaths per 100,000 population", title = "Linear Regression: Vaccine Doses ~ Deaths")
The estimate of the ordinate at the orgin is 603.49 (expected deaths per 100,000 population for vaccine doses per 100 population of 0). The associated p-value is very small showing that this value is likely non-zero.
The slope estimate is -1.74 meaning that with an increase of 1 vaccine dose per 100 population, the deaths per 100,000 population would decrease by 1.74. The associated p-value is very small showing this value is statistically significant and that the slope is likely non-zero.
The estimate of the standard deviation of the model error is 84.08 so the estimate of the variance of the error is 84.08^2 = 7069.45.
The R-squared value gives us an indication of the goodness of fit of the model. This R-squared value of 0.52 suggests that the model explains 52% of the variability in the dependent variable.
We can perform a Shapiro Wilk test to the residuals of the regression to check that they are normally distributed:
shapiro.test(data_model$residuals)
The p-value is 0.45 so the null hypothesis for the normality of the residuals is accepted.
Next, the Deaths_10000 variable was split into two groups, those with low and high vaccination rates:
Low vaccination rate (< 202 doses per 100) : Bulgaria, Croatia, Czechia, Estonia, Greece, Hungary, Latvia, Lithuania, Poland, Romania, Slovakia, Slovenia
High vaccination rate(>= 202 doses per 100) : Austria, Belgium, Cyprus, Denmark, Finland, France, Germany, Ireland, Italy, Luxembourg, Netherlands, Portugal, Spain, Sweden