Analyzing Exam Scores
ðŸ“– Background
My best friend is an administrator at a large school. The school makes every student take yearend math, reading, and writing exams.
Since you have recently learned data manipulation and visualization, we're going to assist my friend analyze the score results. We have a couple of key considerations to investigate:

The school's principal wants to know if test preparation courses are helpful.

She also wants to explore the effect of parental education level on test scores.
To be more specific, to investigate these ideas, we're going to do the following:
 Find the average reading scores for students with/without the test preparation course.
 Find the average test scores for the different parental education levels.
 Create plots to visualize findings for the above analyses.
 We'll look at some effects within subgroups. In particular, we'll compare the average scores for students with/without the test preparation course for different parental education levels.
 The principal wants to know if kids who perform well on one subject also score well on the others. We'll Look at the correlations between scores.
 We'll summarize our findings at the end of the report.
Let's review the data set variables, load some typical modules, and take a glance at the dataset first:
ðŸ’¾ The data
The file has the following fields (source):
 "gender"  male / female
 "race/ethnicity"  one of 5 combinations of race/ethnicity
 "parent_education_level"  highest education level of either parent
 "lunch"  whether the student receives free/reduced or standard lunch
 "test_prep_course"  whether the student took the test preparation course
 "math"  exam score in math
 "reading"  exam score in reading
 "writing"  exam score in writing
# Importing the pandas module
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Set the style to "darkgrid"
sns.set_style("whitegrid")
# Change the context to "notebook"
sns.set_context("notebook")
# Reading in the data
df = pd.read_csv('data/exams.csv')
Let's start by looking at the data frame overall and seeing if there's any missing values.
print('Data frame dimensions:', df.shape)
print(df.isnull().sum())
Great, we don't have any missing values to worry about. Let's take a glance at the first several rows of the data frame:
1. What are the average reading scores for students with/without the test preparation course?
We'll begin our analysis by investigating whether students average reading scores differ by whether they took a test preparation course or not. We'll create barplots and boxplots to look for differences between groups and distributions within each group. We'll also conduct a twosample
The principal is interested in the reading scores so let's look at the mean reading scores across test preparation completion:
# Calculate the mean reading score across test_prep_course
df.groupby('test_prep_course')[['reading']].mean()
There appears to be a difference between the two groups. let's look across the differences in mean scores for each test type for test preparation completion and not:
# Make a couple variables for easy subsetting here and in future calculations
# Group variables
prep = df["test_prep_course"] == 'completed'
noprep = df["test_prep_course"] == 'none'
# Calculate the difference in mean score across test_prep_course
df[prep].mean()df[noprep].mean()
Observation: across each test type, the mean score for those students who completed the test preparation course is higher than those who did not.
In particular, the largest difference was seen in writing scores while the reading scores mean difference is about 7.4 points higher for test prep students than nontestprep students. A natural question now is: are these differences significant? We'll conduct a twosample ttest and estimate the differences with confidence intervals soon.
â€Œ
â€Œ