Photo by Jannis Lucas on Unsplash.
Every year, American high school students take SATs, which are standardized tests intended to measure literacy, numeracy, and writing skills. There are three sections - reading, math, and writing, each with a maximum score of 800 points. These tests are extremely important for students and colleges, as they play a pivotal role in the admissions process.
Analyzing the performance of schools is important for a variety of stakeholders, including policy and education professionals, researchers, government, and even parents considering which school their children should attend.
You have been provided with a dataset called schools.csv, which is previewed below.
You have been tasked with answering three key questions about New York City (NYC) public school SAT performance.
# Re-run this cell
import pandas as pd
# Read in the data
schools = pd.read_csv("schools.csv")
# Preview the data
schools.head()
# Filter and sort schools based on average math scores
best_math_schools = schools[schools["average_math"] >= 0.8 * 800].sort_values("average_math", ascending = False)[["school_name", "average_math"]] #[["school_name", "average_math"]] we add these because we want only these two columns n not all the columns
print(best_math_schools.head())
# #adding 3 columns n storing result in a new column
# Calculate total SAT scores by summing math, reading, and writing scores
schools["total_SAT"] = schools["average_math"] + schools["average_reading"] + schools["average_writing"]
print(schools.head())
# Find top 10 schools with highest total SAT scores
top_10_schools = schools[["school_name", "total_SAT"]].sort_values("total_SAT", ascending = False).head(10)
print(top_10_schools.head())
# Group schools by borough and calculate count, mean, and standard deviation of total SAT scores
borough_stats = schools.groupby('borough')['total_SAT'].agg(['count', 'mean', 'std']).reset_index()
# Find borough with largest standard deviation
largest_std_dev_row = borough_stats.loc[borough_stats['std'].idxmax()]
# Round numeric values to two decimal places
largest_std_dev_row['mean'] = round(largest_std_dev_row['mean'], 2)
largest_std_dev_row['std'] = round(largest_std_dev_row['std'], 2)
# Rename columns for clarity and create DataFrame with one row
largest_std_dev = pd.DataFrame({
'borough': [largest_std_dev_row['borough']],
'num_schools': [largest_std_dev_row['count']],
'average_SAT': [largest_std_dev_row['mean']],
'std_SAT': [largest_std_dev_row['std']]
}, index=[largest_std_dev_row["borough"]])
# Create DataFrame with only one row
#largest_std_dev = pd.DataFrame(largest_std_dev).transpose()
print("Borough with the largest standard deviation of total SAT scores:")
print(largest_std_dev)
Here's a breakdown of the code line by line:
-
Reading and Previewing Data:
import pandas as pd schools = pd.read_csv("schools.csv") schools.head()This part reads a CSV file named "schools.csv" into a Pandas DataFrame called
schoolsand displays the first few rows of the DataFrame to inspect its structure. -
Filtering and Sorting for Best Math Schools:
best_math_schools = schools[schools["average_math"] >= 0.8 * 800].sort_values("average_math", ascending=False)[["school_name", "average_math"]] print(best_math_schools.head())This code filters schools where the average math score is at least 80% of 800, sorts them in descending order based on the math score, and selects only the columns "school_name" and "average_math" for display.
-
Calculating Total SAT Scores:
schools["total_SAT"] = schools["average_math"] + schools["average_reading"] + schools["average_writing"] print(schools.head())Here, a new column "total_SAT" is added to the
schoolsDataFrame, which sums up the average math, reading, and writing scores for each school. This new column is then displayed. -
Finding Top 10 Schools by Total SAT Scores:
top_10_schools = schools[["school_name", "total_SAT"]].sort_values("total_SAT", ascending=False).head(10) print(top_10_schools.head())This code extracts the "school_name" and "total_SAT" columns, sorts the schools by total SAT scores in descending order, and selects the top 10 schools based on these scores.
-
Calculating Borough Statistics:
borough_stats = schools.groupby('borough')['total_SAT'].agg(['count', 'mean', 'std']).reset_index()Here, the DataFrame
schoolsis grouped by the "borough" column. For each borough, it calculates the count of schools, the mean total SAT score, and the standard deviation of the total SAT scores. -
Finding Borough with Largest Standard Deviation:
largest_std_dev_row = borough_stats.loc[borough_stats['std'].idxmax()]This line identifies the row in
borough_statswhere the standard deviation (std) of total SAT scores (total_SAT) is maximum. -
Formatting Results:
largest_std_dev_row['mean'] = round(largest_std_dev_row['mean'], 2) largest_std_dev_row['std'] = round(largest_std_dev_row['std'], 2)It rounds the mean and standard deviation values to two decimal places for clarity.
-
Creating Summary DataFrame for Largest Standard Deviation:
largest_std_dev = pd.DataFrame({ 'borough': [largest_std_dev_row['borough']], 'num_schools': [largest_std_dev_row['count']], 'average_SAT': [largest_std_dev_row['mean']], 'std_SAT': [largest_std_dev_row['std']] }, index=[largest_std_dev_row["borough"]])This constructs a new DataFrame
largest_std_devwith one row, containing the borough with the largest standard deviation of total SAT scores along with the number of schools, average SAT score, and standard deviation. -
Printing Results:
print("Borough with the largest standard deviation of total SAT scores:") print(largest_std_dev)Finally, it prints out the borough with the largest standard deviation and its corresponding statistics in a formatted manner.