Skip to content
Exploratory Data Analysis in Python
Exploratory Data Analysis in Python
Run the hidden code cell below to import the data used in this course.~
# Importing the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
import scipy.interpolate
import statsmodels.formula.api as smf
# Importing the course datasets
brfss = pd.read_hdf('datasets/brfss.hdf5', 'brfss') # Behavioral Risk Factor Surveillance System (BRFSS)
gss = pd.read_hdf('datasets/gss.hdf5', 'gss') # General Social Survey (GSS)
nsfg = pd.read_hdf('datasets/nsfg.hdf5', 'nsfg') # National Survey of Family Growth (NSFG)
Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
Add your notes here
df.isnull().sum() #to have a sum of missing values in the DataFrame
Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- Begin by calculating the number of rows and columns and displaying the names of columns for each DataFrame. Change any column names for better readability.
- Experiment and compute a correlation matrix for variables in
nsfg
. - Compute the simple linear regression of
WTKG3
(weight) andHTM4
(height) inbrfss
(or any other variables you are interested in!). Then, compute the line of best fit and plot it. If the fit doesn't look good, try a non-linear model.
#sumary statistics
df.describe() #gives a couple of aggregate statistics, count, mean, std, min , 4 quartiles
#when mean and a vg are not close it means the distribution is skewed
n_cancled=df['is_canceled'].sum()
pct_canceled=df['is_canceled'].mean()
pritn(f"{n_canceled} bookings were cancelled, which is {pct_canceled*100:.2f}}% of all bookings")
cancellations=df\
.filter([f'arrival_date_month', 'is canceled'])\
.groupby(by = 'arrival_date_month',as_index=False)\_a
.count()\
.rename()
#calculate caancellation rates every month
merged=pd.merge(cancellations. total_bookings , on ='arrival_date_month')
merged['cancellation_rate']= merged['is_canceled'] / merged['total_bookings']
merged
#create bar hart of cancellation rate every month
px.bar(merged, x='arrival_date', y='cancellatio_rate')
#CCL the month is not a really good variable to predict cancellation
#build correlation plot
df.corr()
#heatmap of the matrix of correlation, this is good to get hints of which variable to test out, look at the pixels with strong correlation orange color
px.imshow(df.corr(), width=900,heigth=900)