Skip to content

Exploratory Data Analysis in Python

Run the hidden code cell below to import the data used in this course.~

# Importing the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
import scipy.interpolate
import statsmodels.formula.api as smfa

# Importing the course datasets
brfss = pd.read_hdf('datasets/brfss.hdf5', 'brfss') # Behavioral Risk Factor Surveillance System (BRFSS) 
gss = pd.read_hdf('datasets/gss.hdf5', 'gss') # General Social Survey (GSS) 
nsfg = pd.read_hdf('datasets/nsfg.hdf5', 'nsfg') # National Survey of Family Growth (NSFG)

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Add your notes here

# Add your code snippets here

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • Begin by calculating the number of rows and columns and displaying the names of columns for each DataFrame. Change any column names for better readability.
  • Experiment and compute a correlation matrix for variables in nsfg.
  • Compute the simple linear regression of WTKG3 (weight) and HTM4 (height) in brfss (or any other variables you are interested in!). Then, compute the line of best fit and plot it. If the fit doesn't look good, try a non-linear model.
nsfg.shape
nsfg.columns
nsfg.head()
ounces = nsfg['birthwgt_oz1']
print(ounces.head())
nsfg['outcome'].value_counts()
# Replace the value 8 with NaN
nsfg['nbrnaliv'].replace([8], np.nan, inplace = True)

# Print the values and their frequencies
print(nsfg['nbrnaliv'].value_counts())
# Select the columns and divide by 100
agecon = nsfg['agecon'] / 100
agepreg = nsfg['agepreg'] / 100

# Compute the difference
preg_length = agepreg - agecon

# Compute summary statistics
print(preg_length.describe())
# Plot the histogram
plt.hist(x = agecon, bins = 20, histtype='step')

# Label the axes
plt.xlabel('Age at conception')
plt.ylabel('Number of pregnancies')

# Show the figure
plt.show()
def resample_rows_weighted(df, column='wgt2013_2015'):
    """Resamples a DataFrame using probabilities proportional to given column.
    Args:
        df: DataFrame
        column: string column name to use as weights
    returns: 
        DataFrame
    """
    weights = df[column].copy()
    weights /= sum(weights)
    indices = np.random.choice(df.index, len(df), replace=True, p=weights)
    sample = df.loc[indices]
    return sample