Skip to content

COVID-19 Project

This project involves the analysis of COVID-19 data from the World Health Organisation website and was part of my Postgraduate in Bioinformatics & Biostatistics, provided through the Centre of European Master's Programmes.

The dataset used, sourced from the World Health Organisation(WHO), includes COVID-19 data for European countries, showing death rates per 100,000 people (Deaths_100000) and vaccination doses per 100 people (Doses_100)

# Load data from txt file
data <- read.delim("covid_q_data.txt", stringsAsFactors = FALSE, header = TRUE)

# Check the data
data

Calculation of the overall sample mean + standard deviation for both the Deaths_100000 + Doses_100 variables:

# Mean for Deaths_100000
death_rate_mean <- mean(data$Deaths_100000)     
cat("The mean deaths per 100,000 population is: ", death_rate_mean, "\n")

# Mean for Doses_100
vacc_rate_mean <- mean(data$Doses_100)         
cat("The mean doses per 100 population is: ", vacc_rate_mean, "\n")


# Standard deviation for Deaths_100000
death_rate_sd <- sd(data$Deaths_100000)       
cat("The standard deviation of deaths per 100,000 population is: ", death_rate_sd, "\n")

# Standard deviation for Doses_100
vacc_rate_sd <- sd(data$Doses_100)           
cat("The standard deviation of doses per 100 population: ", vacc_rate_sd, "\n")

Calculation of the confidence interval for the sample mean of Deaths_100000 + Doses_100:

# Use the z-score for the 95% calculation level (typically 1.96)

z <- qnorm(0.975) 

# Apply the formula to calculate the confidence interval for the mean

# Confidence interval for mean of the death rate:

n <- length(data$Deaths_100000)   

margin_of_error_dr <- z * death_rate_sd / sqrt(n)   

lower_ci <- death_rate_mean - margin_of_error_dr   
upper_ci <- death_rate_mean + margin_of_error_dr    

cat("The confidence interval for the mean death rate is: ", lower_ci, "-", upper_ci, "\n")

# Confidence interval for mean of vaccination rate:

margin_of_error_vr <- z * vacc_rate_sd / sqrt(n)   

lower_ci <- vacc_rate_mean - margin_of_error_vr   
upper_ci <- vacc_rate_mean + margin_of_error_vr  

cat("The confidence interval for the mean vaccination rate is: ", lower_ci, "-", upper_ci, "\n")

Calculation of the confidence interval for the sample standard deviation of Deaths_100000 + Doses_100:

# Determine the degrees of freedom (n - 1)

n <- length(data$Deaths_100000) 

dof <- n - 1    

# Use the chi-square values for the 95% confidence interval

alpha <- 0.05

chi_1 <- qchisq(alpha/2, dof)   

chi_2 <- qchisq(1 - alpha / 2, dof)   

# Apply the formula to calculate the confidence interval for the standard deviation

# Confidence interval for the standard deviation of the Death Rate:

numerator <- dof * death_rate_sd^2
denom_lower <- chi_2
denom_upper <- chi_1

ci_lower <- sqrt(numerator / denom_lower) 

ci_upper <- sqrt(numerator / denom_upper) 

cat("The confidence interval for the standard deviation of the sample death rate is: ", ci_lower, "-", ci_upper, "\n")
  
# CI for the SD of the Vaccination Rate:

numerator2 <- dof * vacc_rate_sd^2
denom_lower2 <- chi_2
denom_upper2 <- chi_1

ci_lower2 <- sqrt(numerator2 / denom_lower2) 

ci_upper2 <- sqrt(numerator2 / denom_upper2)  

cat("The confidence interval for the standard deviation of the sample vaccination rate is: ", ci_lower2, "-", ci_upper2, "\n")

A regression model was applied to establish the effect between Deaths_10000 + Doses_100 of European Union countries:

# Convert to data frame
data <- data.frame(data)

# Apply a regression model
data_model <- lm(Deaths_100000 ~ Doses_100, data = data)     
data_model

# Gain more information about the model
summary(data_model)

# Make a scatter dot plot graph representing the linear regression between the two sets of data
library(ggplot2)
ggplot(data, aes(x = Doses_100, y = Deaths_100000)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = TRUE) + 
  labs(x = "Vaccine doses per 100 population", y = "Deaths per 100,000 population", title = "Linear Regression: Vaccine Doses ~ Deaths")

The estimate of the ordinate at the orgin is 603.49 (expected deaths per 100,000 population for vaccine doses per 100 population of 0). The associated p-value is very small showing that this value is likely non-zero.

The slope estimate is -1.74 meaning that with an increase of 1 vaccine dose per 100 population, the deaths per 100,000 population would decrease by 1.74. The associated p-value is very small showing this value is statistically significant and that the slope is likely non-zero.

The estimate of the standard deviation of the model error is 84.08 so the estimate of the variance of the error is 84.08^2 = 7069.45.

The R-squared value gives us an indication of the goodness of fit of the model. This R-squared value of 0.52 suggests that the model explains 52% of the variability in the dependent variable.

We can perform a Shapiro Wilk test to the residuals of the regression to check that they are normally distributed:

shapiro.test(data_model$residuals)

The p-value is 0.45 so the null hypothesis for the normality of the residuals is accepted.

Next, the Deaths_10000 variable was split into two groups, those with low and high vaccination rates:

Low vaccination rate (< 202 doses per 100) : Bulgaria, Croatia, Czechia, Estonia, Greece, Hungary, Latvia, Lithuania, Poland, Romania, Slovakia, Slovenia

High vaccination rate(>= 202 doses per 100) : Austria, Belgium, Cyprus, Denmark, Finland, France, Germany, Ireland, Italy, Luxembourg, Netherlands, Portugal, Spain, Sweden