Skip to content
Data Manipulation with pandas
Data Manipulation with pandas
Run the hidden code cell below to import the data used in this course.
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")
print(avocado.head())
print('----------------------------------------------------------------------------------------------------------------------')
print(homelessness.head())
print('----------------------------------------------------------------------------------------------------------------------')
print(temperatures.head())
print('----------------------------------------------------------------------------------------------------------------------')
print(walmart.head())
print('----------------------------------------------------------------------------------------------------------------------')Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
Summarizing numerical data
dogs["height_cm"].mean()
Other methods
- .median()
- .min()
- .max()
- .var()
- .std()
- .sum()
- .quantile()
The .agg() method
def pct30(column): return column.quantile(0.3) dogs["weight_kg"].agg(pct30)
Summaries on multiple columns
dogs[["weight_kg", "height_cm"]].agg(pct30)
Multiple summaries
def pct40(column): return column.quantile(0.4) dogs["weight_kg"].agg([pct30, pct40])
Cumulative sum
dogs["weight_kg"].cumsum()
Cumulative statistics
- .cummax()
- .cummin()
- .cumprod()
Dropping duplicate names
vet_visits.drop_duplicates(subset="name")
Dropping duplicate pairs
unique_dogs = vet_visits.drop_duplicates(subset=["name", "breed"]) print(unique_dogs)
Easy as 1, 2, 3
unique_dogs["breed"].value_counts() # Or unique_dogs["breed"].value_counts(sort=True) # print the list from the greatest to the smallest
Proportions
unique_dogs["breed"].value_counts(normalize=True) # Percentage
Grouped by
dogs.groupby("color")["weight_kg"].mean()
Multiple grouped summaries
dogs.groupby("color")["weight_kg"].agg([min, max, sum])
Grouping by multiple variables
dogs.groupby(["color", "breed"])["weight_kg"].mean()
Many groups, many summaries
dogs.groupby(["color", "breed"])[["weight_kg", "height_cm"]].mean()
Group by to pivot table
dogs.pivot_table(values="weight_kg", index="color")
Different statistics
import numpy as np dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)
Multiple statistics
dogs.pivot_table(values="weight_kg", index="color", aggfunc=[np.mean, np.median])
Pivot on two variables
dogs.groupby(["color", "breed"])["weight_kg"].mean() dogs.pivot_table(values="weight_kg", index="color", columns="breed")
Filling missing values in pivot tables
dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)
Summing with pivot tables
dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0, margins=True)
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")
#----------------------------------------------------------------------------------------------------------------------------------
# Methods for numerical datas
print("The mean and the min :")
print("The mean :", avocado["avg_price"].mean())
print("The min :", temperatures['date'].min())
print('----------------------------------------------------------------------------------------------------------------------')
#----------------------------------------------------------------------------------------------------------------------------------
# The .agg() method
def pct30(column):
return column.quantile(0.3)
print("The quantile :")
print(avocado["avg_price"].agg(pct30))
print('----------------------------------------------------------------------------------------------------------------------')
# Add your code snippets hereExplicit indexes
Setting a column as the index
dogs_ind = dogs.set_index("name") print(dogs_ind)
Removing an index
dogs_ind.reset_index()
Dropping an index
dogs_ind.reset_index(drop=True)
Indexes make subsetting simpler
dogs[dogs["name"].isin(["Bella", "Stella"])] dogs_ind.loc[["Bella", "Stella"]]
Index values don't need to be unique
dogs_ind2 = dogs.set_index("breed") print(dogs_ind2)
Subsetting on duplicated index values
dogs_ind2.loc["Labrador"]
Multi-level indexes a.k.a. hierarchical indexes
dogs_ind3 = dogs.set_index(["breed", "color"]) print(dogs_ind3)
Subset the outer level with a list
dogs_ind3.loc[["Labrador", "Chihuahua"]]
Subset inner levels with a list of tuples
dogs_ind3.loc[[("Labrador", "Brown"), ("Chihuahua", "Tan")]]
Sorting by index values
dogs_ind3.sort_index()
Controlling sort_index
dogs_ind3.sort_index(level=["color", "breed"], ascending=[True, False])
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")
walmart_ind = walmart.set_index("type")
print(walmart_ind.head())Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- Print the highest weekly sales for each
departmentin thewalmartDataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video. - What was the total
nb_soldof organic avocados in 2017 in theavocadoDataFrame? If you're stuck, try reviewing this video. - Create a bar plot of the total number of homeless people by region in the
homelessnessDataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video. - Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.