Notebook Begin
The tasks in this exercise are:
- What are the average reading scores for students with/without the test preparation course?
- What are the average scores for the different parental education levels?
- Create plots to visualize findings for questions 1 and 2.
- [Optional] Look at the effects within subgroups. Compare the average scores for students with/without the test preparation course for different parental education levels (e.g., faceted plots).
- [Optional 2] The principal wants to know if kids who perform well on one subject also score well on the others. Look at the correlations between scores.
- Summarize your findings.
Task: What are the average reading scores for students with/without the test preparation course?
#load the required packages
library(tidyverse)
#load the dataset
dt <- read_csv('./data/exams.csv')
#check the first few rows of the data
head(dt)Now that we have our data loaded. Let us evaluate the values or categories under different variables in our data.
unique(dt$gender)
unique(dt$`race/ethnicity`)
unique(dt$parent_education_level)
unique(dt$test_prep_course)
#finding the average scores grouped by preparation course for all the subjects tested
summary_prep <- dt %>% group_by(test_prep_course) %>%
summarise(av_math = mean(math), av_reading = mean(reading), av_writing = mean(writing))
#view the tibble
summary_prep
We now have a tibble that contains the average score of students in math, reading and writing, grouped by test preparation course. The results are stored in a tibble because it will be used later for preparation of graphs. The students who completed the preparatory course, on an average, scored more in all the subjects, in comparison to the students who did not complete the preparatory course.
Visualizing the data The data is not in the 'tidy' form (long form). Thus, we need to pivot it first and then make graphs.
#graphing task 1
theme_set(theme_classic())
summary_prep %>% pivot_longer(cols = c(av_math, av_reading, av_writing), names_to = "Subject", values_to = "scores") %>% ggplot(aes(x = Subject, y = scores, fill = test_prep_course)) +
geom_col(position = "dodge", alpha = 0.7, width = 0.7) +
scale_x_discrete(labels=c("av_math" = "Math", "av_reading" = "Reading", "av_writing" = "Writing"), name = "Subjects tested") +
scale_y_continuous(name = "Average scores", expand = c(0,0), limits = c(0, 90)) +
scale_fill_discrete(name = "Prep Course Status", labels = c("Completed", "Incomplete or \n not taken")) +
theme(legend.position = c(0.5, 0.9), legend.box.background = element_rect(colour = "black"), legend.direction = "horizontal", axis.text = element_text(colour= "black")) +
ggtitle("Fig. 1: Effect of preparatory course on the average scores") + theme(plot.title =element_text(size = 18), axis.title = element_text(size = 16), axis.text = element_text(size = 16))
The task 1 specifically asks to find the effects of preparatory course on reading scores.
We had prepared the previous graph from the summary tibble. Taking the difficulty level a notch up, I am going to prepare a plot of reading score directly from the raw data. This will require a bit of wrangling and adding some functions withing the ggplot arguments. I will also add some annotations to add clarity.
long_dt <- dt %>% mutate(no = row_number()) %>% #adding row number for indexing purposes
pivot_longer(cols = c("math" : "writing"), names_to = "Exam", values_to = "score") %>% select(-no) %>%
filter(Exam == "reading") %>% group_by(test_prep_course)
#graphing the data
ggplot(long_dt, aes(y = score, x = Exam, colour = test_prep_course)) +
geom_point(position = position_jitter(width = 0.15), alpha = 0.5) +
theme(legend.position = c(0.15, 0.6), legend.box.background = element_rect(colour = "black"), axis.text = element_text(colour= "black"), axis.text.x = element_blank()) +
ggtitle("Fig. 2: Effect of preparatory course on the average scores") +
labs(colour = "Prep Course Status", x = "Reading", y = "Average Score") +
labs() +
scale_colour_discrete(labels = c("Completed", "Incomplete or \n not taken")) +
stat_summary(fun = "mean", geom = "point", size = 5, alpha = 1, shape = 21, colour = "black", aes(fill = test_prep_course)) + scale_fill_discrete(guide = "none") + #add annotations
annotate(
geom = "curve", x = 1.2, y = 80, xend = 1, yend = 74,
curvature = -0.2, arrow = arrow(length = unit(2, "mm"))
) +
annotate(
geom = "curve", x = 1.2, y = 60, xend = 1, yend = 66.5,
curvature = 0.2, arrow = arrow(length = unit(2, "mm"))
) + theme(plot.title =element_text(size = 17), axis.title = element_text(size = 16), axis.text = element_text(size = 16)) +
annotate(geom = "text", x = 1.2, y = 80, label = "mean scores of students \n who completed \n the prep course", hjust = "left") +
annotate(geom = "text", x = 1.2, y = 60, label = "mean scores of students \n who did not complete \n the prep course", hjust = "left")The average score of students who completed the preparatory course is 73.9, in comparison to 66.5 scored by the students who did not complete the preparatory course.
Let's run a t-test to check if there is a statistically significant difference between the two groups (completed and none).
#t-test
t <- t.test(reading ~ test_prep_course, data = dt)
tThus, we can say with a high degree of confidence that preparatory course had a positive effect on the scores of students in their reading test (95 % CI = 5.554635 9.164539, p-value < 0.001)
Moving on to the task 2.
Task: What are the average scores for the different parental education levels?
This will involve grouping the data by parental education levels.
summary_parental_edu <- dt %>% group_by(parent_education_level) %>%
summarise(av_math = mean(math), av_reading = mean(reading), av_writing = mean(writing))
summary_parental_edu %>% arrange(desc(av_math))The results show that students whose parents have a master's or a bachelor's degree have higher grades in all the tested subjects. Thus, the parental education matters.
Let's get this graphed.