Skip to content
Investigating the effectiveness of prep courses on exam scores
Analyzing exam scores
📖 Background
Your best friend is an administrator at a large school. The school makes every student take year-end math, reading, and writing exams.
Since you have recently learned data manipulation and visualization, you suggest helping your friend analyze the score results. The school's principal wants to know if test preparation courses are helpful. She also wants to explore the effect of parental education level on test scores.
💾 The data
The file has the following fields (source):
- "gender" - male / female
- "race/ethnicity" - one of 5 combinations of race/ethnicity
- "parent_education_level" - highest education level of either parent
- "lunch" - whether the student receives free/reduced or standard lunch
- "test_prep_course" - whether the student took the test preparation course
- "math" - exam score in math
- "reading" - exam score in reading
- "writing" - exam score in writing
💪 Challenge
Create a report to answer the principal's questions. Include:
- What are the average reading scores for students with/without the test preparation course?
- What are the average scores for the different parental education levels?
- Create plots to visualize findings for questions 1 and 2.
- [Optional] Look at the effects within subgroups. Compare the average scores for students with/without the test preparation course for different parental education levels (e.g., faceted plots).
- [Optional 2] The principal wants to know if kids who perform well on one subject also score well on the others. Look at the correlations between scores.
- Summarize your findings.
# import modules
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pingouin as pg
%matplotlib inline# load data
df = pd.read_csv("data/exams.csv")
# set parent education level as an ordered categorical variable
df["parent_education_level"] = pd.Categorical(df["parent_education_level"],
categories=["some high school",
"high school",
"some college",
"associate's degree",
"bachelor's degree",
"master's degree"],
ordered=True)
# print info and some rows to get a sense of the data
display(df.info())
display(df.head())# Q1 - What are the average reading scores for students with/without the test preparation course?
q1 = df.groupby("test_prep_course").agg(sample_size=("gender","count"),
mean_reading_score=("reading","mean"),
std_reading_score=("reading","std"))
display(q1)
# Related Q3 - Create plot showing the results
sns.catplot(kind="point",data=df,x="test_prep_course",y="reading")
plt.show()
# Result (Q6): completed 73.89, none 66.53# Q1 deep dive: is the mean reading score of students from the "completed" group significantly higher?
# since reading scores are constrained in an interval between 0 and 100 they cannot properly distribute normally
# therefore we prefer to use a non parametric test
display(pg.mwu(x=df[df["test_prep_course"] == "completed"]["reading"],
y=df[df["test_prep_course"] == "none"]["reading"],
alternative="greater"))
# Result (Q6): pvalue is zero so students who completed the prep course get higher reading scores on average.
# Q2 - What are the average scores for the different parental education levels?
df = df.sort_values("parent_education_level")
q2 = df.groupby("parent_education_level").agg(mean_math_score=("math","mean"),
mean_reading_score=("reading","mean"),
mean_writing_score=("writing","mean"))
display(q2)
# Related Q3 - Create plot showing the results
tdf = df.melt(id_vars=["parent_education_level"],
value_vars=["math","reading","writing"],
var_name="subject",
value_name="score")
g = sns.catplot(kind="point",data=tdf,x="parent_education_level",y="score",col="subject")
g.set_xticklabels(rotation=90)
plt.show()
# Result (Q6): it looks like the higher is the parental education level, the higher are the average scores# Q2 deep dive: are the differences in mean significant?
# similarly to what previuosly done we go on with non-parametric tests
for score in ["math","reading","writing"]:
print(f"\n---Tests for {score} scores---")
display(pg.pairwise_ttests(data=df,
dv=score,
between="parent_education_level",
parametric=False,
padjust="bonf"))
"""
Results (Q6):
- generally speaking, there seems to be a significant difference in average scores between students who have parents with at least some college education and the students whose parents never went to college.
- moreover, when it comes to reading and writing scores, it seems that students whose parents got a bachelor's or a master's degree have, in most cases, higher average scores than students with parents who just had some college or got an associate's degree.
(Confidence 95%)
"""
print("")# Q4 - Look at the effects within subgroups. Compare the average scores for students with/without the test preparation course for different parental education levels (e.g., faceted plots).
tdf = df.melt(id_vars=["test_prep_course","parent_education_level"],
value_vars=["math","reading","writing"],
var_name="subject",
value_name="score")
g = sns.catplot(kind="point",data=tdf,x="parent_education_level",y="score",col="subject",row="test_prep_course")
g.set_xticklabels(rotation=90)
plt.show()
# Result (Q6): prep courses boost students'performance whatever the parental education level# Q5 - The principal wants to know if kids who perform well on one subject also score well on the others. Look at the correlations between scores.
# make scatterplots to show the relation between each pair of scores
sns.pairplot(df[["math","reading","writing"]])
plt.show()
# compute correlations
display(pg.pairwise_corr(data=df,
columns=["math","reading","writing"],
padjust="bonf"))
# Result (Q6): each pair of scores is strongly correlated (Pearson coefficient never below 0.8). The stronger correlation is between writing and reading scores (Pearson coeffiecient equals 0.95).