Skip to content
Data Manipulation with pandas
Run the hidden code cell below to import the data used in this course.
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")
Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
Add your notes here
# Add your code snippets here
Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- Print the highest weekly sales for each
department
in thewalmart
DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video. - What was the total
nb_sold
of organic avocados in 2017 in theavocado
DataFrame? If you're stuck, try reviewing this video. - Create a bar plot of the total number of homeless people by region in the
homelessness
DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video. - Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.
Inspecting a DataFrame
# Print the head of the homelessness data
print(homelessness.head())
# Print information about homelessness
print(homelessness.info()
# Print the shape of homelessness
print(homelessness.shape)
# Print a description of homelessness
print(homelessness.describe())
# Print the values of homelessness
print (homelessness.values)
# Print the column index of homelessness
print (homelessness.columns)
# Print the row index of homelessness
print (homelessness.index)
Sorting rows
# Sort homelessness by individuals
homelessness_ind = homelessness.sort_values("individuals")
# Print the top few rows
print(homelessness_ind.head())
# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values("family_members", ascending = False)
# Print the top few rows
print (homelessness_fam.head())
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(["region", "family_members"], ascending=[True, False])
# Print the top few rows
print (homelessness_reg_fam.head())
Subsetting columns
# Select the individuals column
individuals = homelessness["individuals"]
# Print the head of the result
print (individuals.head())
# Select the state and family_members columns
state_fam = homelessness[["state", "family_members"]]
# Print the head of the result
print (state_fam.head())
# Select only the individuals and state columns, in that order
ind_state = homelessness[["individuals", "state"]]
# Print the head of the result
print (ind_state.head())
Subsetting rows
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness["individuals"]>10000]
# See the result
print(ind_gt_10k)
# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness["region"] == "Mountain"]
# See the result
print (mountain_reg)
# Filter for rows where family_members is less than 1000
# and region is Pacific
fam_lt_1k_pac = homelessness[ (homelessness["family_members"] < 1000) & (homelessness["region"] == "Pacific") ]
# See the result
print(fam_lt_1k_pac)
Subsetting rows by categorical variables