Photo by Jannis Lucas on Unsplash.
Every year, American high school students take SATs, which are standardized tests intended to measure literacy, numeracy, and writing skills. There are three sections - reading, math, and writing, each with a maximum score of 800 points. These tests are extremely important for students and colleges, as they play a pivotal role in the admissions process.
Analyzing the performance of schools is important for a variety of stakeholders, including policy and education professionals, researchers, government, and even parents considering which school their children should attend.
You have been provided with a dataset called schools.csv, which is previewed below.
You have been tasked with answering three key questions about New York City (NYC) public school SAT performance.
# Re-run this cell
import pandas as pd
import matplotlib.pyplot as plt
# Read in the data
schools = pd.read_csv("schools.csv")
# Preview the data
schools.head()Question : Which NYC schools have the best math results?
Solution :
- filter on rows where "average_math" is at least 80% of 800 (i.e. "average_math" >= 0.8*800)
- select columns "school_name" and "average_math"
- sort by "average_math" descending
- print result
best_math_schools = schools[schools["average_math"] >= 0.8*800][["school_name","average_math"]].sort_values("average_math", ascending = False)
print(best_math_schools)Question : What are the top 10 performing schools based on the combined SAT scores?
Solution :
- create new columns
- "total_SAT" : sum of math, reading, and writing scores
- "school_name_full" : school_name prefixed with borough
- create "top_10_schools" :
- sort by "total_SAT" descending (i.e. ascending = False)
- select columns "school_name" and "total_SAT"
- select 10 first rows using index : [0:10]
- visualize result with a horizontal bar chart (requires "import matplotlib.pyplot as plt" in main cell)
- create plot (x = school_name, y = total_SAT, without legend)
- limit x-axis to add more contrast
- add title
- show plot
schools["total_SAT"] = schools["average_math"]+schools["average_writing"]+schools["average_reading"]
schools["school_name_full"] = "[" + schools["borough"] + "] " + schools["school_name"]
top_10_schools = schools.sort_values("total_SAT", ascending = True)[["school_name_full","total_SAT"]][0:10]
top_10_schools.plot(x='school_name_full', y='total_SAT', kind='barh', legend = False, ylabel="")
plt.xlim([min(top_10_schools["total_SAT"]) - 50, max(top_10_schools["total_SAT"])+10])
plt.title("Top 10 NYC Schools by Total SAT Score")
plt.show()Question : Which single borough has the largest standard deviation in the combined SAT score?
Solution :
- create new column "total_SAT" (same as before, could have been in a single cell)
- create variable "column_names" (just to make code more readable)
- group by "borough"
- calculate number of school (count), avegrage SAT (mean) and standard deviation (std)
- sort by descending (i.e. ascending = False) standard deviation (std)
- round to two decimal places : .round(2)
- select first row using index : [0:1]
- move borough from index to columns : .reset_index()
- print result
schools["total_SAT"] = schools["average_math"] + schools["average_writing"] + schools["average_reading"]
column_names = {"count":"num_schools","mean":"average_SAT","std":"std_SAT"}
largest_std_dev = schools.groupby("borough")["total_SAT"].agg(["count", "mean", "std"]).sort_values("std", ascending=False).rename(columns=column_names).round(2)[0:1]
largest_std_dev.reset_index(inplace=True)
print(largest_std_dev)