We'll need mosaic for basic descriptive analysis and ggplot2 for visualization.
install.packages("mosaic")
library(mosaic)
library(ggplot2)
colleges <- read.csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRKiYHNVKs8sWMo1FpG_whoNhiMGR5NQ36hiBqJbOtKnvpzStY9g-dLjAyPCDywnHVH_zOFoyWQPpyD/pub?gid=1276561522&single=true&output=csv")
head(colleges)
count(colleges)
There are a total of 2,199 schools in this dataset. The variables or columns we'll be looking at are :
- name <- name of the school
- tier <- school tier (1-12), a lower number means a more prestigious tier of school
- type <- School type (private non-profit, private for-profit, public)
- median_family_income <- median family income of students at the school
- percent_from_bottom_20 <- percent of students at the school who come from the bottom 20% income households
- percent_from_bottom_20_and_reached_top_20 <- percent of all students at the school who came from bottom 20% income households and who reached the top 20% of incomes as adults
What is the Median Family Income of Students at Different School Tiers?
As expected, when we randomly filter for 10 schools that are tier 1 (again lower number means more prestigious), we come across many Ivy League schools. We notice that the median family income for these schools is pretty high at around 132k-218k. In context, the median national household income around the time this data was collected was around $57,500 (HUD,2004). We also notice that the percentage of students in these schools who come from the bottom 20% income households is relatively low at around 2-5%, compared to what we'll see soon in tier 12 schools.
head((colleges %>%
filter(tier == 1) %>%
sample_frac(1)),n=10)
Here when we can see that when we randomly filter for 10 schools that are tier 12 (again higher number means less prestigious), we can see that the median family income for these schools is much lower than the tier 1 schools, at around 25k-77k. We also notice that the percentage of students in these schools who come from the bottom 20% income households is around 14-49%, much higher than what we previously saw in tier 1 schools.
head((colleges %>%
filter(tier == 12) %>%
sample_frac(1)),n=10)
We can visualize our insights using a boxplot. We do this by putting the median family income on the y-axis and the school tier on the x-axis. We can clearly see a relationship. The lower the tier (more prestigious), the higher the median family income of students. The higher the tier (less prestigious), the lower the median family income of students.
ggplot(colleges, aes(x=factor(tier), y=median_family_income)) +
geom_boxplot()
Here we visualize our second insight into the percentage of students who come from the bottom 20% of income households. We can see a relationship. The lower the tier (more prestigious), the lower the number of poor students. The higher the tier (less prestigious), the higher the number of poor students.
ggplot(colleges, aes(x=factor(tier), y=percent_from_bottom_20)) +
geom_boxplot()
What is the Mobility Rate for Different School Tiers?