Analyzing exam scores
Now let's now move on to the competition and challenge.
📖 Background
Your best friend is an administrator at a large school. The school makes every student take year-end math, reading, and writing exams.
Since you have recently learned data manipulation and visualization, you suggest helping your friend analyze the score results. The school's principal wants to know if test preparation courses are helpful. She also wants to explore the effect of parental education level on test scores.
💪 Challenge
Create a report to answer the principal's questions. Include:
- What are the average reading scores for students with/without the test preparation course?
- What are the average scores for the different parental education levels?
- Create plots to visualize findings for questions 1 and 2.
- [Optional] Look at the effects within subgroups. Compare the average scores for students with/without the test preparation course for different parental education levels (e.g., faceted plots).
- [Optional 2] The principal wants to know if kids who perform well on one subject also score well on the others. Look at the correlations between scores.
- Summarize your findings.
💾 The data
The file has the following fields (source):
- "gender" - male / female
- "race/ethnicity" - one of 5 combinations of race/ethnicity
- "parent_education_level" - highest education level of either parent
- "lunch" - whether the student receives free/reduced or standard lunch
- "test_prep_course" - whether the student took the test preparation course
- "math" - exam score in math
- "reading" - exam score in reading
- "writing" - exam score in writing
Importing
We will use pandas and numpy for our analysis.
# Importing the pandas and numpy module
import pandas as pd
import numpy as np
# Reading in the data
df = pd.read_csv('data/exams.csv')
Preproccessing
The data is first checked for cleanliness. The test scores are continuous features. We check the description for anything that stands out. The rest of the features are categorical. We can check the unique values of each column.
#Unique Values
print(
" Gender:",df.gender.unique(),
"\n Race/Ethnicity:",df["race/ethnicity"].unique(),
"\n Parent Education Level:", df["parent_education_level"].unique(),
"\n Lunch:", df["lunch"].unique(),
"\n Test Prep Course:", df["test_prep_course"].unique()
)
#Continuous Values
print(df.describe())
The categorical features all only contain relevant values. The continous features are all above 0 and have a maximum of 100, which is what we expect for test scores.
1. Reading Score Averages Seperated By Test Prep Course Status
#1
reading_avg_prep = df.groupby(by=["test_prep_course"])["reading"].aggregate(np.mean)
print("Reading average of students that took the test prep course:",reading_avg_prep["completed"])
print("Reading average of students that DID NOT take the test prep course:",reading_avg_prep["none"])
2. All Test Scores Seperated By Parent's Education Level
#2
order = ["some high school", "high school", "some college", "associate's degree", "bachelor's degree", "master's degree"]
scores_split_parent = df.groupby(by=["parent_education_level"]).aggregate(np.mean).loc[order]
print("Test Score Averages:")
print(scores_split_parent)
3. Visualizations
#3
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.lines import Line2D
sns.set_style("darkgrid")
sns.set(rc={'figure.figsize':(13,10)}, font_scale = 1.75)
‌
‌