## Does extra tutoring erase student background disadvantage?

#### Executive summary

In order to conduct an efficient extra tutoring school policy, the conclusions we will draw from the following data analysis are the following:

**Preparation class does improve the test scores**of students.- Extra tutoring is efficient
**not only on average but also in terms of flattening discrepancies and reducing gaps**between students. **It is most effective in improving writing skills**, and least effective in mathematics.**Students who benefit from lunch assistance should clearly benefit from preparation class**as well, in order to improve the homogeneity of the scores within a class.- Another efficient option would be to specifically
**target the ethnic group A**, whose scores tend to be the lowest while the benefits from preparation class are by far the highest. - Unexpectingly,
**parental level of education and gender might not be the best targeting criteria**, as it would improve some students' performance to the detriment of the others, generating an imbalance within the class.

#### Introduction

The background of a student can be of a decisive influence on its education. Factors such as the level of household income, the parents' educational background or ethnicity are still shaping the children studies and future career. This leads educational institutions to look for solutions to improve student exam scores, in order to standardize the grades of a class independently of the origin of the students.

Extra tutoring has often been seen as an answer to this problem, based on the idea that some students might benefit from additional support to strengthen their acquisition of knowledge. Many reasons can explain that they need more time: they may experience difficulties in specific disciplines, suffer from a lack of support at home to do their homework, start a new year with gaps that need to be filed... or maybe just having better grades to get a scolarship!

**Some schools provide extra turoting through the form of preparation classes. But does it work, and does it benefit to every student equally?**

In this demonstration, we will use Exploratory Data Analysis (EDA) and hacker statistics techniques to find it out. We'll often use random sampling with remplacement (also called *boostraping*) to give strengthened conclusions about the efficiency of preparation class, statistically speaking. To that end, the test statistic (or metric) that we'll most regularly use will be one of our own, namely the * Difference of Mean Average Score (DoMAS)*, which measures the points gained thanks to preparation class within a group of students. Better to give it an accronym right away as we'll use it a lot throughout the demonstration to compare the performance between each subgroup!

We will successivelly answer the following questions:

`I - Do students who took the preparation class score better on average? II - Does preparation class favors everyone equally? III - Does preparation class help better in some disciplines than others? IV - What characteristics of the students' background has the most influence on the benefits of preparation class?`

As an introduction, let's set up all the objects (packages, global variables and functions) that we will need throughout the essay, and explore a bit our dataset.

```
# import necessary modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
import itertools
from IPython.display import display
```

```
# read file
df = pd.read_csv('data/exams.csv')
# create new column with the overall average of each student
df['average'] = df[['math', 'reading','writing']].mean(axis=1)
# split the data in two subgroups, to compare data with/without preparation course
prep_df = df[df['test_prep_course']=='completed']
noprep_df = df[df['test_prep_course']=='none']
```

```
# check the size and datatypes of the whole group's DataFrame
print(df.shape)
print('\n')
print(df.info())
```

```
# check the first rows of the DataFrame
display(df.head())
```

```
# for all categorical data columns, print categories and the number of times they appear in the dataset
for col in df.iloc[:,:5]:
print(df[col].value_counts())
print('\n')
```

Now, let's define global variables that we will use regularly throughout the analysis. By convention, these global variables are in uppercase and all defined at the begining of the script, before getting more into details. This avoids spreading variables that we may use more than one time accross the code.

```
# list all categories of parental education and rearange the order from lowest to highest level of education
PARENT_EDUC_CAT = list(df['parent_education_level'].unique())
print(PARENT_EDUC_CAT)
PARENT_EDUC_CAT = ['some high school', 'high school', 'some college', "associate's degree", "bachelor's degree", "master's degree"]
print(PARENT_EDUC_CAT)
# list all ethnic groups and order them alphabetically
ETH_GROUPS = list(df['race/ethnicity'].unique())
ETH_GROUPS.sort()
print(ETH_GROUPS)
```

```
# calculate the mean of the average test score
# for the whole group
GROUP_MEAN = np.mean(df['average'])
# and for the two subgroups
PREP_MEAN = np.mean(prep_df['average'])
NOPREP_MEAN = np.mean(noprep_df['average'])
# compute the mean score in each discipline for both subgroups
PREP_MATH_MEAN = np.mean(prep_df['math'])
NOPREP_MATH_MEAN = np.mean(noprep_df['math'])
PREP_READING_MEAN = np.mean(prep_df['reading'])
NOPREP_READING_MEAN = np.mean(noprep_df['reading'])
PREP_WRITING_MEAN = np.mean(prep_df['writing'])
NOPREP_WRITING_MEAN = np.mean(noprep_df['writing'])
# same for Standard Deviation (STD) and Interquantile Range (IQR) of each subgroup's average test score
PREP_STD = np.std(prep_df['average'])
NOPREP_STD = np.std(noprep_df['average'])
PREP_IQR = scipy.stats.iqr(prep_df['average'])
NOPREP_IQR = scipy.stats.iqr(noprep_df['average'])
# compute the difference of the test statistics (mean, STD and IQR) between the two subgroup
OBS_DIFF_MEAN = PREP_MEAN - NOPREP_MEAN
OBS_DIFF_STD = PREP_STD - NOPREP_STD
OBS_DIFF_IQR = PREP_IQR - NOPREP_IQR
```

```
# function to compute the Cumulative Distribution Function (CDF) of an array
def compute_cdf(data):
'''returns the sorted data and its respective distribution probability'''
x = np.sort(data)
y = np.arange(1, len(x)+1) / len(x)
return x, y
# function to return the percentile of an array
def compute_perc(data, perc=10, dir='sup'):
'''returns the part of an array after or before a given percentile
"perc" defines the range of percentiles wantes.
Set by default to 10 (deciles), but can be set to 20 (quantile), 25 (quartile), etc.
"dir" input must be:
"sup" for percentiles starting from the highest value of the data
"inf" for percentiles starting from the lowest value of the data'''
if dir == 'sup':
res = data[data>=np.percentile(data, perc)]
elif dir == 'inf':
res = data[data<=np.percentile(data, perc)]
else:
return None
return res
# function to calculate x replicates of the test statistics of a boostrap sample
def create_bs_rep(data, foo, size=1):
'''creates boostrap replicates of the test statistic from a boostrap sample'''
bs_rep_array = np.empty(size)
for i in range(size):
bs_sample = np.random.choice(data, size=len(data))
bs_rep_i = foo(bs_sample)
bs_rep_array[i] = bs_rep_i
return bs_rep_array
# function to find the index of a bin containing a value
def hist_idx_finder(value, bins):
'''iterates over the bins of a histogram to find the index of the one containing a value'''
for i,v in enumerate(bins):
if i == len(bins)-1:
# then the value is in the last bin (upper limit if infinity)
return i
elif value > bins[i] and value < bins[i+1]:
return i
```

## I - Do students who took the preparation class score better on average?

Now that we have defined all we will need throughout our exploration, let's start with the most fundamental question:

**Does extra tutoring really help student to get better exams scores?**

Let's see first how many students took the prep class and how the averages within each subgroup (with/without preparation) are distributed, thanks to a nice and simple histogram of the Probability Density Function (PDF).

```
# print out the number of students in each group
nb_total_studs = len(df)
nb_prep_studs = len(prep_df)
prop_prep_studs = nb_prep_studs/nb_total_studs*100
print(f"{nb_prep_studs} out of {nb_total_studs} students ({prop_prep_studs}%) have taken the preparation class.")
```

```
# set visualization style to ggplot and size of the figure to 10,7
plt.style.use('ggplot')
fig, ax = plt.subplots(figsize=[10,7])
# plot histogram comparing the density distribution of the averages in each groups
_ = plt.hist(prep_df['average'],
histtype='step',
hatch='/',
density=True,
color='blue',
bins=10,
linestyle='--',
linewidth=2,
)
_ = plt.hist(noprep_df['average'],
density=True,
color='orange',
bins=10,
)
# put a title, label the axes, define the legend and display
_ = plt.suptitle("PDF of the students' average scores",
fontsize=18,
color='black',
x=0.52,
y=0.92,
)
_ = plt.xlabel('Average scores')
_ = plt.ylabel('Probability Density Function (PDF)')
_ = plt.legend(['Prep group', 'No-prep group'])
plt.show()
```