Skip to content
Exploring SAT Scores (basic pandas)
Analyzing the SAT performance of schools is important for a variety of stakeholders, including policy and education professionals, researchers, government, and even parents considering which school their children should attend. I have been provided with a dataset called schools.csv, which is previewed below. I have been tasked with answering three key questions about New York City (NYC) public school SAT performance. |
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# read in & preview the data
schools = pd.read_csv("schools.csv")
schools.head()Plot subject averages by borough
# unpivot tables for easier plotting
borough_breakdown = schools.melt(
id_vars='borough',
value_vars=['average_math', 'average_reading', 'average_writing'],
var_name='subject',
value_name='score'
)
# plot subject averages by borough
sns.catplot(
data=borough_breakdown,
kind='bar',
x='borough',
hue='subject',
y='score',
height=4,
aspect=3
)
# annotate plot
plt.xticks(rotation=45)
plt.title("Average SAT Scores by Borough and Subject")
plt.show()Which NYC schools have the best math results?
The best math results are at least 80% of the maximum possible score of 800 for math.
# to get started, I will store the minimum best score as a variable
min_best_mscore = 0.80 * 800
# create a dataframe for the best math schools (subsetting school name and avg math)
math_average = schools[['school_name', 'average_math']].sort_values(by='average_math', ascending=False)
# select only schools with an average equal or greated to 80% of max score
best_math_schools = math_average[math_average['average_math'] >= min_best_mscore]
# review output
best_math_schoolsWhat are the top 10 performing schools based on the combined SAT scores?
We need to add a new column account for the combined scores.
# shortening schools variable, combining column values by individual row
s = schools
total_SAT = s['total_score'] = s[['average_math', 'average_reading', 'average_writing']].sum(axis=1)
# adding the above as a column
s['total_SAT'] = total_SAT
#outputting the top 10 by descending value
top_10_schools = s[['school_name', 'total_SAT']].sort_values(by='total_SAT', ascending=False).head(10)
top_10_schoolsWhich single borough has the largest standard deviation in the combined SAT score?
We are looking at the central tendency for a single borough.
(Not sure what a 'borough' actually is, but it will be fine...)
# selecting columns and their appropriate values with aggregation
largest_std_dev = s.groupby('borough')['total_SAT'].agg(
num_schools='count',
average_SAT='mean',
std_SAT='std',
).sort_values(by='std_SAT',ascending=False).round(2).head(1) # rounding the values by two decimal places
largest_std_devCorrelations between SAT subjects
# subset average columns
scores = schools[['average_math', 'average_reading', 'average_writing']]
# use .corr() to get Pearson's correlation coefficients between them
scores_corr = scores.corr()
scores_corr# visualise the correlations (and distributions by borough)
sns.pairplot(
data=schools,
vars=['average_math', 'average_reading', 'average_writing'],
hue='borough',
plot_kws={'alpha': 0.8}
)
plt.suptitle("Pairwise Relationships Between SAT Subjects", y=1.02)
plt.show()