Course Notes: Exploratory Data Analysis in Python

# Import any packages you want to use here
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Take Notes

Add notes here about the concepts you've learned and code cells with code you want to keep.

A. GETTING TO KNOW YOUR DATA

Course Notes

Use this workspace to take notes, store code snippets, or build your own interactive cheatsheet! For courses that use data, the datasets will be available in the datasets folder.

1. Data summarization and aggregation notes

Data summarization or aggregation is the process of taking summary statistics of the data in other to familiarze with the data.

# Add your code snippets here

# .Agg function is useful whenever two or more summary statistics is being used.

df.groupby("column").agg(["mean", "std"]

# The below code snippets is used to create a column by using the groupby with agg function to take summary stattistics
                                      
continent_summary = df.groupby("column").agg(
    # Create the mean_rate_2021 column
    mean_rate_2021 = ("2021", "mean"),
    # Create the std_rate_2021 column
    std_rate_2021 = ("2021", "std"),
)
                                      
# Create a bar plot of continents and their average unemployment

sns.barplot(data=df, x="column", y="2021")

B. DATA CLEANING & IMPUTATION

1. Why is missing Data a problem?

Missing data affect the distributiions of our data. When the data is not normally distributed, it can lead to incorrect conclusions of the analysis.

Strategies for addressing missing data is to remove the missing obversations if they are less than or equal to 5% of the data observations.

If we have more than 5% of values missing in the dataset, we can Impute summary statistic (mean, median or mode) depending on the distributions and the context of the data. Alternatively, we can Impute by sub-group.

# Add your code snippets here

# Count the number of missing values in each column
df.isna().sum()

# Find the five percent threshold
threshold = len(df) * 0.05

# Filtering the columns to drop
cols_to_drop = df.columns[df.isna().sum() <= threshold]

# Drop missing values for columns below the threshold
df.dropna(subset=cols_to_drop, inplace=True)

# Calculate median of the other columns
median = df.groupby("column")["column"].median()

# Convert to a dictionary
median_dict = median.to_dict()

# Map the dictionary to missing values of the column
df["column"] = df["column"].fillna(df["column"].map(median_dict))

2. Converting & Analysing categorical data

Categorical data are non-numeric data and we can filter for categorical data from a dataframe by calling the "select_dtypes" method on the dataframe and pass an "object" as an argument to it.

We can search a column for a specific strings or multiple strings by using "pd.series.str.contains()" and pass the keyword we are looking for to it which returns boolean (True or False) values as an output.

# Add your code snippets here

# Filter the DataFrame for object columns
non_numeric = df.select_dtypes("object")

# Loop through columns
for col in non_numeric.columns:
  
  # Print the number of unique values
  print(f"Number of unique values in {col} column: ", non_numeric[col].nunique())

# Create a list of categories
flight_categories = ["Short-haul", "Medium", "Long-haul"]

# Create short_flights
short_flights = "0h|1h|2h|3h|4h"

# Create medium_flights
medium_flights = "5h|6h|7h|8h|9h"

# Create long_flights
long_flights = "10h|11h|12h|13h|14h|15h|16h"

# Create conditions for values in flight_categories to be created
conditions = [
    (planes["Duration"].str.contains(short_flights)),
    (planes["Duration"].str.contains(medium_flights)),
    (planes["Duration"].str.contains(long_flights))
]

# Apply the conditions list to the flight_categories
planes["Duration_Category"] = np.select(conditions, 
                                        flight_categories,
                                        default="Extreme duration")

# Plot the counts of each category
sns.countplot(data=planes, x="Duration_Category")
plt.show()