Skip to content

Measles

This data contains the overall and measles, mumps, and rubella immunization rates for schools across the United States. Each row corresponds to one school and includes a number of variables including the latitude, longitude, name, and vaccination rates.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

measles = pd.read_csv("data/measles.csv")
print(measles.shape)
measles.head(10)

Data Dictionary

ColumnExplanation
indexIndex ID
stateSchool's state
yearSchool academic year
nameSchool name
typeWhether a school is public, private, charter
cityCity
countyCounty
districtSchool district
enrollEnrollment
mmrSchool's Measles, Mumps, and Rubella (MMR) vaccination rate
overallSchool's overall vaccination rate
xrelPercentage of students exempted from vaccination for religious reasons
xmedPercentage of students exempted from vaccination for medical reasons
xperPercentage of students exempted from vaccination for personal reasons

Source and license of the dataset.

Citation: This data was completed by staff of The Wall Street Journal: Dylan Moriarty, Taylor Umlauf, & Brianna Abbot.

# Initial exploratory analysis

measles.describe()
measles.shape
# -1 is used for missing data in overall and mmr vaccination rate columns

# Noticed in .head() that Arizona is the first state
# alphabetically should be Alabama
measles[measles['state'] == 'Alabama'] # no data for Alabama
# how many schools from each state?
measles['state'].value_counts() # good to keep in mind based on state population
# how many states are included?
print(measles['state'].nunique(), 'unique states are included in this dataset.') # 32 states only

# missing data
measles.isna().sum().plot(kind = 'bar', title = 'Missing Data, by Column', rot = 45)
plt.xlabel('Column Name')
plt.ylabel('Observation Count')
# Q1: Which states have the highest average mmr vaccination rates?

## Missing Data & Cleaning

# I noticed that some states have an average mmr vax rate of -1, likely due to missing data
# let's drop those for now and do further analysis on them later
measles_f = measles[measles['mmr']>=0] # subsetting only valid vax rates
print(measles_f[['state','name','mmr']].sort_values(by = 'mmr',ascending = True))
# now all mmr vaccination rates are positive, so we can proceed
print('The number of states with data for MMR vaccination rates is', measles_f['state'].nunique()) # check we still have 32 states

# only 21 states left: this means that 11 states had only missing data for mmr vaccination rates
# which states were excluded? (i.e. have no data available for mmr vax rates)
states_list = measles['state'].unique() # unique states in original dataset
states_f_list = measles_f['state'].unique() # unique states after removing missing data
states_m_list = [] # empty list of missing states

# using for loop to find missing states 
for i in states_list:
    if i not in states_f_list:
        states_m_list.append(i)
# print missing states
print('States without MMR vaccination rate data are', states_m_list)

# Grouping by state and taking mean mmr vaccination rate
measles_states = measles_f.groupby('state')['mmr'].mean()
measles_states = measles_states.to_frame()
# sorting in descending order
sort_measles_states = measles_states.sort_values(by = 'mmr',ascending = False)

# Visualizations
measles_states.plot(kind = 'bar', title = 'Average MMR Vaccination Rate, by State', rot = 75)
plt.ylim(75,100)
plt.xlabel('State')
plt.show()
## Trends in Types of Schools w/MMR vax data

# mean vaccination rate by school type (only mmr >= 0)
measles_type = measles_f.groupby('type')['mmr'].mean()
measles_type = measles_type.to_frame()

# visualization
measles_type.plot(kind = 'bar', title = 'Average MMR Vaccination Rate, by School Type', rot = 45)
plt.xlabel('School Type')
plt.ylabel('Vaccination Rate (%)')
plt.show()

# BOCES schools have the highest average MMR vaccination rate, likely because these schools are only in NY which
# has a high vaccination rate as a whole. Charter schools have slightly lower MMR vaccination rates than other 
# school types, but this difference doesn't seem to be significant.

# Are certain types of schools more likely to have missing data on MMR vaccination rates?

# take all missing MMR data
measles_missing = measles[measles['mmr'] < 0]
# counting missing data by school type. divide by total of each type
measles_type['missing'] = measles_missing['type'].value_counts() / measles['type'].value_counts() * 100
# replacing NaN with 0
measles_type_filled = measles_type.fillna(0)
# plotting
measles_type_filled.plot(kind = 'bar', y = 'missing', title = 'Percentage (%) Missing Data on MMR Vaccination Rates, by School Type', rot = 45)
plt.ylabel('Percentage of MMR Data Missing')
plt.xlabel('School Type')
plt.show()

# In 85% of nonpublic schools, data on MMR vaccination rates is missing. This means our previous finding that
# nonpublic schools have a similar MMR vaccination rate to private and public schools may be inaccurate.
# Additionally, as only ~ 20% of charter schools are missing data on MMR vax rates, the previous finding that charter
# schools have lower average vaccination rates is relatively accurate.

# Nonpublic and private schools have the highest rates of missing data for MMR vaccination rates, which makes sense
# since these schools are not bound to the same legal vaccination & reporting standards as public schools.
## Investigating States with Missing MMR Vaccination data
# Q3: Do states with missing MMR data share any common traits?

# subsetting for states without mmr data (i.e. mmr = -1)
measles_m = measles[measles['state'].isin(states_m_list)].iloc[:,1:14]
print(measles_m.head())
# check what other data is missing (looks like xrel, xmed, xper - makes sense since based on mmr)
measles_m.isna().any() # missing year, type, city, county, and district data too

# what percentage is missing?
print(round(measles_m['city'].isna().sum() / len(measles_m['city']) *100, 2), '% of schools are missing data on city.')
print(round(measles_m['type'].isna().sum() / len(measles_m['type']) *100, 2), '% of schools are missing data on school type.')
print(round(measles_m['year'].isna().sum() / len(measles_m['year']) *100, 2), '% of schools are missing data on year.')
print(round(measles_m['county'].isna().sum() / len(measles_m['county']) *100, 2), '% of schools are missing data on county.')
print(round(measles_m['district'].isna().sum() / len(measles_m['district']) *100, 2), '% of schools are missing data on district.')

# All schools in states that do not have any data on MMR vaccination rates are also missing data on the type of school (public, private, charter) and on which school district they belong to, with most (78%) also missing data on their city, and few (17%, 21%) missing data on county and year, respectively.
# Would be interesting to investigate co-missingness trends!
# Predicting Vaccination Rates

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

  • 🗺️ Explore: What types of schools have the highest overall and mmr vaccination rates?
  • 📊 Visualize: Create a plot that visualizes the overall and mmr vaccination rates for the ten states with the highest number of schools.
  • 🔎 Analyze: Does location affect the vaccination percentage of a school?

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

You are working for a public health organization. The organization has a problem: this year, the overall vaccination rate information for schools is not yet available. To gain an initial idea of the rates, your manager has asked you whether it is possible to use other data to predict the overall vaccination rate of a school. This includes such information as the mmr vaccination rate, the location, and the type of school. Your manager also wants to know how reliable your predictions are.

You will need to prepare a report that is accessible to a broad audience. It should outline your motivation, steps, findings, and conclusions.


✍️ If you have an idea for an interesting Scenario or Challenge, or have feedback on our existing ones, let us know! You can submit feedback by pressing the question mark in the top right corner of the screen and selecting "Give Feedback". Include the phrase "Content Feedback" to help us flag it in our system.