Skip to content

Photo by Jannis Lucas on Unsplash.

Every year, American high school students take SATs, which are standardized tests intended to measure literacy, numeracy, and writing skills. There are three sections - reading, math, and writing, each with a maximum score of 800 points. These tests are extremely important for students and colleges, as they play a pivotal role in the admissions process.

Analyzing the performance of schools is important for a variety of stakeholders, including policy and education professionals, researchers, government, and even parents considering which school their children should attend.

You have been provided with a dataset called schools.csv, which is previewed below.

You have been tasked with answering three key questions about New York City (NYC) public school SAT performance.

# Re-run this cell 
import pandas as pd
import matplotlib.pyplot as plt

# Read in the data
schools = pd.read_csv("schools.csv")

# Preview the data
schools.head()

Question : Which NYC schools have the best math results?

Solution :

  • filter on rows where "average_math" is at least 80% of 800 (i.e. "average_math" >= 0.8*800)
  • select columns "school_name" and "average_math"
  • sort by "average_math" descending
  • print result
best_math_schools = schools[schools["average_math"] >= 0.8*800][["school_name","average_math"]].sort_values("average_math", ascending = False)
print(best_math_schools)

Question : What are the top 10 performing schools based on the combined SAT scores?

Solution :

  • create new columns
    • "total_SAT" : sum of math, reading, and writing scores
    • "school_name_full" : school_name prefixed with borough
  • create "top_10_schools" :
    • sort by "total_SAT" descending (i.e. ascending = False)
    • select columns "school_name" and "total_SAT"
    • select 10 first rows using index : [0:10]
  • visualize result with a horizontal bar chart (requires "import matplotlib.pyplot as plt" in main cell)
    • create plot (x = school_name, y = total_SAT, without legend)
    • limit x-axis to add more contrast
    • add title
    • show plot
schools["total_SAT"] = schools["average_math"]+schools["average_writing"]+schools["average_reading"]
schools["school_name_full"] = "[" + schools["borough"] + "] " + schools["school_name"]
top_10_schools = schools.sort_values("total_SAT", ascending = True)[["school_name_full","total_SAT"]][0:10]

top_10_schools.plot(x='school_name_full', y='total_SAT', kind='barh', legend = False, ylabel="")
plt.xlim([min(top_10_schools["total_SAT"]) - 50, max(top_10_schools["total_SAT"])+10])
plt.title("Top 10 NYC Schools by Total SAT Score")
plt.show()

Question : Which single borough has the largest standard deviation in the combined SAT score?

Solution :

  • create new column "total_SAT" (same as before, could have been in a single cell)
  • create variable "column_names" (just to make code more readable)
  • group by "borough"
  • calculate number of school (count), avegrage SAT (mean) and standard deviation (std)
  • sort by descending (i.e. ascending = False) standard deviation (std)
  • round to two decimal places : .round(2)
  • select first row using index : [0:1]
  • move borough from index to columns : .reset_index()
  • print result
schools["total_SAT"] = schools["average_math"] + schools["average_writing"] + schools["average_reading"]
column_names = {"count":"num_schools","mean":"average_SAT","std":"std_SAT"}

largest_std_dev = schools.groupby("borough")["total_SAT"].agg(["count", "mean", "std"]).sort_values("std", ascending=False).rename(columns=column_names).round(2)[0:1]

largest_std_dev.reset_index(inplace=True)

print(largest_std_dev)