## ๐ **Overview**

**Overview**

Education plays a vital role in shaping the future of our society, and it is crucial to evaluate the effectiveness of various factors that contribute to student success. In this research report, by leveraging data manipulation and visualization techniques, we will undertake an analysis of a large school's year-end exam score results in the subjects of math, reading, and writing. Namely, we will assist the school's administration in making data-driven decisions regarding the efficacy of test preparation courses, as well as exploring the relationship bwetween parental education and student performance.

The research objectives are thus the following:

- To determine the average reading scores for students who've taken the test preparation course compared to those who haven't.
- To assess the average scores across different parental education levels, in order to identify any notable trends or disparities.
- To examine within-subgroup effects, by comparing the average scores of students with and without test preparation courses for different parental education levels.
- To explore the correlations between scores in math, reading, and writing, in addition to investigating whether high exam performance in one subject translates to a high performance in others.

The dataset used for this analysis contains the following fields:

- "
**gender**" - male / female - "
**race/ethnicity**" - one of 5 combinations of race/ethnicity - "
**parent_education_level**" - highest education level of either parent - "
**lunch**" - whether the student receives free/reduced or standard lunch - "
**test_prep_course**" - whether the student took the test preparation course - "
**math**" - exam score in math - "
**reading**" - exam score in reading - "
**writing**" - exam score in writing

The report is structured into three distinct sections: an exploratory data analysis section, a main analysis section, and a final section for conclusions and recommendations.

## ๐ **Exploratory Data Analysis**

**Exploratory Data Analysis**

The purpose of this section is to gain familiarity with the dataset and acquire a preliminary understanding of its characteristics. Below we can see the first five rows of the dataset, with the headers adapted for greater readability:

Now we can explore the dataset's variables, in order to identify any missing or erroneous data. From the table below, it is possible to see that the dataset doesn't contain any missing values and the data types are assigned correctly. There are 1000 entries, divided into eight columns: five categorical variables (gender, race/ethnicity, parent_education_level, lunch, and test_prep_course) and three numerical variables (math, reading, and writing).

### Analysis of Categorical Variables

Analyzing categorical variables in an EDA is crucial for understanding the composition and distribution of different groups or categories within the dataset. This can be achieved using descriptive statistics:

At a first glance, gender, lunch, and test preparation course are variables with only two unique categories: these do not need to be visualized in a countplot in order to be analyzed. From the table we can conclude that the school's student population is well-balanced in terms of gender โthe percentage of males is only 4 p.p higher than that of femalesโ most students receive the standard lunch (approx. 65%), and a large proportion (64%) hasn't completed the test preparation course before the year-end examinations.

For the variables race/ethnicity and parent education level, we can tell that "group C" and "some college" are the most recurring categories, respectively. By visualizing them, however, we can see the frequency distribution of the other categories.

We can make a number of observations using the countplots. A good 44% of students have parents who've either started college and haven't finished it, or have an associate's degree. There's an almost equally large percentage of students whose parents haven't finished high school, or have at least received their diploma (32%). Only 18% of students have parents who are highly educated, having achieved a bachelor's or a master's degree. In terms of race/ethnicity, a third of the students are part of group C.

### Analysis of Numerical Variables

By analyzing the numerical variables we can obtain some quantitative insights about the exams dataset. Namely, via the use of measures of central tendency and dispersion, we can get an idea of the distribution and main characteristics of the data. It also helps us also uncover the existence of any outliers, which may significantly impact the analysis and interpretation of the results. The summary statistics table provides us with valuable information regarding how students performed in the year-end math, reading, and writing exams:

The distributions of exam scores for reading, writing, and math present some similarities but aren't identical: math has the lowest mean exam score (66), reading the highest (69), and writing is in between (68). This suggests that, on average, students performed slightly better in reading compared to the other two subjects. This disparity, however, doesn't apply to the variability in exam performance across subjects: all test scores tend to fluctuate by around 15 points, regardless of whether they relate to reading, writing, or math. Finally, examining the percentiles, these are also relatively close across the board: 25% of participants achieved an exam score of 58 and below, while 75% of 78 and below.

In light of these observations, it is important to consider these consistencies when interpreting individual student performance. Firstly, because it ensures a fair and balanced evaluation of students' inter-subject exam results, facilitating a comprehensive view of students' overall capabilities rather than attributing differences to efforts or abilities related to a specific subject. Secondly, it enables the school administration to identify those students who may require targeted support or interventions.

For example, if a student always performs poorly in all subjects, rather than supporting them in resolving only subject-specific challenges, they may be in need of tutoring or additional instructional support to help bridge gaps in foundational knowledge. On the contrary, students who excel should be provided with enrichment programs to challenge and extend their learning, fostering continued academic progress and engagement. Both of these cases, at opposite ends of the spectrum, can be identified in the outlier analysis that will follow.

### Outlier Analysis

Considering the overall average grade as a benchmark, although it is evident that there is substantial room for improvement in the year-end exam performance of most students at this school, in this outlier analysis we will identify any students whose scores deviate significantly, both positively and negatively. We can already see, from the boxplots below, that there are students who are outliers, with exam scores that are well below a "fail" grade, potentially for more than one subject.

โ

โ