Skip to content

## Exploratory Data Analysis in Python

Run the hidden code cell below to import the data used in this course.~

```
# Importing the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
import scipy.interpolate
import statsmodels.formula.api as smf
# Importing the course datasets
brfss = pd.read_hdf('datasets/brfss.hdf5', 'brfss') # Behavioral Risk Factor Surveillance System (BRFSS)
gss = pd.read_hdf('datasets/gss.hdf5', 'gss') # General Social Survey (GSS)
nsfg = pd.read_hdf('datasets/nsfg.hdf5', 'nsfg') # National Survey of Family Growth (NSFG)
```

### Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

*Add your notes here*

`# Add your code snippets here`

### Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

- Begin by calculating the number of rows and columns and displaying the names of columns for each DataFrame. Change any column names for better readability.
- Experiment and compute a correlation matrix for variables in
`nsfg`

. - Compute the simple linear regression of
`WTKG3`

(weight) and`HTM4`

(height) in`brfss`

(or any other variables you are interested in!). Then, compute the line of best fit and plot it. If the fit doesn't look good, try a non-linear model.