Photo by Jannis Lucas on Unsplash.
Every year, American high school students take SATs, which are standardized tests intended to measure literacy, numeracy, and writing skills. There are three sections - reading, math, and writing, each with a maximum score of 800 points. These tests are extremely important for students and colleges, as they play a pivotal role in the admissions process.
Analyzing the performance of schools is important for a variety of stakeholders, including policy and education professionals, researchers, government, and even parents considering which school their children should attend.
You have been provided with a dataset called schools.csv
, which is previewed below.
You have been tasked with answering three key questions about New York City (NYC) public school SAT performance.
# Re-run this cell
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Read in the data
schools = pd.read_csv('schools.csv')
# Start coding here...
# Add as many cells as you like...
# Top NYC Schools with the Best Math Results
best_math_schools = schools[schools['average_math'] >= 640][['school_name', 'average_math']].sort_values(by='average_math',ascending=False).reset_index(drop=True)
# Top 10 Performing Schools Based on Combined SAT Scores
schools['total_SAT'] = schools['average_math'] + schools['average_reading'] + schools['average_writing']
top_10_schools = schools[['school_name','total_SAT']].sort_values(by='total_SAT',ascending=False).head(10).reset_index(drop=True)
# Combined SAT Scores by Borough: Mean, Count, and Standard Deviation
borough_SAT = schools.groupby('borough')['total_SAT'].agg(['count' , 'mean' ,'std']).round(2).sort_values(by='std' , ascending=False)
# Largest Standard Deviation
large_std_dev = borough_SAT[borough_SAT['std'] == borough_SAT['std'].max()]
large_std_dev = large_std_dev.rename(columns={'count':'num_schools','mean':'average_SAT','std':'std_SAT'})
large_std_dev.reset_index(inplace=True)
# Scatter plot with regression line
sns.regplot(data=schools, x='percent_tested', y='average_math', scatter_kws={'s': 50}, line_kws={'color': 'red'})
plt.title('Math Average vs Percent Tested with Regression Line')
plt.xlabel('Percent Tested')
plt.ylabel('Average Math')
plt.figure()
# Histogram of Math average scores
sns.histplot(schools['average_math'], kde=True, bins=20)
plt.title('Distribution of Average Math Scores')
plt.xlabel('Average Math')
plt.ylabel('Frequency')
plt.figure()
# Histogram for Reading and Writing scores
sns.histplot(schools['average_reading'], kde=True, bins=20)
plt.title('Distribution of Average Reading Scores')
plt.xlabel('Average Reading')
plt.ylabel('Frequency')
plt.figure()
sns.histplot(schools['average_writing'], kde=True, bins=20)
plt.title('Distribution of Average Writing Scores')
plt.xlabel('Average Writing')
plt.ylabel('Frequency')
plt.figure()
# Boxplot for Math scores
sns.boxplot(x='borough', y='average_math', data=schools)
plt.title('Distribution of Math Scores by Borough')
plt.xlabel('Borough')
plt.ylabel('Average Math Scores')
plt.xticks(rotation=45, ha='right')
plt.figure()
plt.show()
print("# Best Math School")
print(best_math_schools)
print("\n# Top 10 Performing Schools Based on Total SAT")
print(top_10_schools)
print("\n# Borough with the Highest Standard Deviation in Total SAT Scores")
print(large_std_dev)