Skip to content
Exploratory Data Analysis in Python
Exploratory Data Analysis in Python
Run the hidden code cell below to import the data used in this course.~
# Importing the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats
import scipy.interpolate
import statsmodels.formula.api as smf
# Importing the course datasets
brfss = pd.read_hdf('datasets/brfss.hdf5', 'brfss') # Behavioral Risk Factor Surveillance System (BRFSS)
gss = pd.read_hdf('datasets/gss.hdf5', 'gss') # General Social Survey (GSS)
nsfg = pd.read_hdf('datasets/nsfg.hdf5', 'nsfg') # National Survey of Family Growth (NSFG)Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
# Count the number of missing values in each column
print(planes.isna().sum())
# Find the five percent threshold
threshold = len(planes) * 0.05
# Create a filter. Create cols_to_drop by applying boolean indexing to columns of the DataFrame with missing values less than or equal to the threshold.
cols_to_drop = planes.columns[planes.isna().sum() <= threshold]
# Drop missing values for columns below the threshold. Use the filter to remove missing values and save the updated DataFrame.
planes.dropna(subset=cols_to_drop, inplace=True)
print(planes.isna().sum())Allaoleva koodi laskee mediaanin, jonka jälkeen muuttaa kyseisen datafamen dictionaryksi ja sen jälkeen mäppää mediaanit tyhjien arvojen tilalle.
# Calculate median plane ticket prices by Airline
airline_prices = planes.groupby("Airline")["Price"].median()
print(airline_prices)
# Convert to a dictionary
prices_dict = airline_prices.to_dict()
# Map the dictionary to missing values of Price by Airline
planes["Price"] = planes["Price"].fillna(planes["Airline"].map(prices_dict))
# Check for missing values
print(planes.isna().sum())# Mean Price by Destination
planes["price_destination_mean"] = planes.groupby("Destination")["Price"].transform(lambda x: x.mean())
print(planes[["Destination","price_destination_mean"]].value_counts())# Find the 75th and 25th percentiles
price_seventy_fifth = planes["Price"].quantile(0.75)
price_twenty_fifth = planes["Price"].quantile(0.25)
# Calculate iqr
prices_iqr = price_seventy_fifth - price_twenty_fifth
# Calculate the thresholds
upper = price_seventy_fifth + (1.5 * prices_iqr)
lower = price_twenty_fifth - (1.5 * prices_iqr)
# Subset the data
planes = planes[(planes["Price"] > lower) & (planes["Price"] < upper)]
print(planes["Price"].describe())Add your notes here
# Add your code snippets hereExplore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- Begin by calculating the number of rows and columns and displaying the names of columns for each DataFrame. Change any column names for better readability.
- Experiment and compute a correlation matrix for variables in
nsfg. - Compute the simple linear regression of
WTKG3(weight) andHTM4(height) inbrfss(or any other variables you are interested in!). Then, compute the line of best fit and plot it. If the fit doesn't look good, try a non-linear model.
new_columns = ['Sex', 'Height', 'Weight', 'Income', '_LLCPWT', 'Age?', '_VEGESU1', '_HTMG10', 'Age']
brfss.columns = new_columns
brfss.rename(columns = {'Sex':'SEX', 'Height':'HTM4','Weight':'WTKG3', 'Income':'INCOME2', 'Age?':'_AGEG5YR', 'Age':'AGE'}, inplace = True)
print(brfss.shape)
print(brfss.columns)
print(brfss.head())
print(type(brfss))
print(gss.shape)
print(gss.columns)
print(gss.head())
print(nsfg.shape)
print(nsfg.columns)
print(nsfg.head())correlation = nsfg.corr()
correlation