Skip to content
Analyzing the SAT performance of schools is important for a variety of stakeholders, including policy and education professionals, researchers, government, and even parents considering which school their children should attend. I have been provided with a dataset called schools.csv, which is previewed below. I have been tasked with answering three key questions about New York City (NYC) public school SAT performance.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# read in & preview the data
schools = pd.read_csv("schools.csv")
schools.head()

Plot subject averages by borough

# unpivot tables for easier plotting
borough_breakdown = schools.melt(
    id_vars='borough',
    value_vars=['average_math', 'average_reading', 'average_writing'],
    var_name='subject',
    value_name='score'
)

# plot subject averages by borough
sns.catplot(
    data=borough_breakdown,
    kind='bar',
    x='borough',
    hue='subject',
    y='score',
    height=4,
    aspect=3
)

# annotate plot
plt.xticks(rotation=45)
plt.title("Average SAT Scores by Borough and Subject")
plt.show()

Which NYC schools have the best math results?

The best math results are at least 80% of the maximum possible score of 800 for math.

# to get started, I will store the minimum best score as a variable
min_best_mscore = 0.80 * 800 

# create a dataframe for the best math schools (subsetting school name and avg math)
math_average = schools[['school_name', 'average_math']].sort_values(by='average_math', ascending=False)

# select only schools with an average equal or greated to 80% of max score
best_math_schools = math_average[math_average['average_math'] >= min_best_mscore]

# review output
best_math_schools

What are the top 10 performing schools based on the combined SAT scores?

We need to add a new column account for the combined scores.

# shortening schools variable, combining column values by individual row
s = schools
total_SAT = s['total_score'] = s[['average_math', 'average_reading', 'average_writing']].sum(axis=1)

# adding the above as a column
s['total_SAT'] = total_SAT

#outputting the top 10 by descending value
top_10_schools = s[['school_name', 'total_SAT']].sort_values(by='total_SAT', ascending=False).head(10)

top_10_schools

Which single borough has the largest standard deviation in the combined SAT score?

We are looking at the central tendency for a single borough.

(Not sure what a 'borough' actually is, but it will be fine...)

# selecting columns and their appropriate values with aggregation 
largest_std_dev = s.groupby('borough')['total_SAT'].agg(
    num_schools='count',
    average_SAT='mean',
    std_SAT='std',
).sort_values(by='std_SAT',ascending=False).round(2).head(1) # rounding the values by two decimal places

largest_std_dev

Correlations between SAT subjects

# subset average columns
scores = schools[['average_math', 'average_reading', 'average_writing']]

# use .corr() to get Pearson's correlation coefficients between them
scores_corr = scores.corr()

scores_corr
# visualise the correlations (and distributions by borough)
sns.pairplot(
    data=schools,
    vars=['average_math', 'average_reading', 'average_writing'],
    hue='borough',
    plot_kws={'alpha': 0.8}
)
plt.suptitle("Pairwise Relationships Between SAT Subjects", y=1.02)
plt.show()