Skip to content

Photo by Jannis Lucas on Unsplash.

Every year, American high school students take SATs, which are standardized tests intended to measure literacy, numeracy, and writing skills. There are three sections - reading, math, and writing, each with a maximum score of 800 points. These tests are extremely important for students and colleges, as they play a pivotal role in the admissions process.

Analyzing the performance of schools is important for a variety of stakeholders, including policy and education professionals, researchers, government, and even parents considering which school their children should attend.

You have been provided with a dataset called schools.csv, which is previewed below.

You have been tasked with answering three key questions about New York City (NYC) public school SAT performance.

# Re-run this cell 
import pandas as pd

# Read in the data
schools = pd.read_csv("schools.csv")


#First, we should define the result needed to be classified as a best math score
best_math = 800 * 0.8
#Once we have the best math results, we must create the best_math_schools dataframe as a subset of the original schools dataframe. We must also specify the columns that this dataframe should include.
best_math_schools = pd.DataFrame(schools[schools["average_math"] > best_math], columns = ["school_name", "average_math"])
#Since the dataframe needs to be sorted by descending order, we must ensure we sort it first before proceeding with the next steps.
best_math_schools = best_math_schools.sort_values(by = "average_math", ascending = False)


#To tackle the second part of the problem, we must first establish each school's total SAT score, by adding the three parts that compromise the total score. We may then order the dataset by descending order of the total SAT score.
schools["total_SAT"] = schools["average_math"] + schools["average_reading"] + schools["average_writing"]
total_SAT_ordered = schools.sort_values(by = "total_SAT", ascending = False)
#We then may narrow it down to the specific dataframe by isolating the two columns required and then slice the dataframe to include just the top 10 rows.
top_10_schools = pd.DataFrame(total_SAT_ordered, columns = ["school_name", "total_SAT"])
top_10_schools = top_10_schools[:10]


#To tackle the third part of the problem discussing borough statistics, we may use the groupby function to group the dataframe by the borough column. The agg function may be used in conjunction with this to allow us to perform multiple tasks on the grouped data.
borough_stats = total_SAT_ordered.groupby("borough").agg(
    num_schools = ("total_SAT", "count"),
    average_SAT = ("total_SAT", "mean"),
    std_SAT = ("total_SAT", "std")
).reset_index()
#We may now round off all the values to two decimal places
borough_stats[["num_schools", "average_SAT", "std_SAT"]] = (
    borough_stats[["num_schools", "average_SAT", "std_SAT"]].round(2)
)
#Now we can derive our final desired result by creating a dataframe with the borough with the largest std_SAT from the nlargest function, and then including the desired columsn we want in this one row dataframe.
largest_std_dev = borough_stats.nlargest(
    1, "std_SAT"
)[["borough", "num_schools", "average_SAT", "std_SAT"]]