Skip to content

Data Manipulation with pandas

Run the hidden code cell below to import the data used in this course.

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")

print(avocado.head())
print('----------------------------------------------------------------------------------------------------------------------')
print(homelessness.head())
print('----------------------------------------------------------------------------------------------------------------------')
print(temperatures.head())
print('----------------------------------------------------------------------------------------------------------------------')
print(walmart.head())
print('----------------------------------------------------------------------------------------------------------------------')

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Summarizing numerical data

dogs["height_cm"].mean()

Other methods

  • .median()
  • .min()
  • .max()
  • .var()
  • .std()
  • .sum()
  • .quantile()

The .agg() method

def pct30(column): return column.quantile(0.3) dogs["weight_kg"].agg(pct30)

Summaries on multiple columns

dogs[["weight_kg", "height_cm"]].agg(pct30)

Multiple summaries

def pct40(column): return column.quantile(0.4) dogs["weight_kg"].agg([pct30, pct40])

Cumulative sum

dogs["weight_kg"].cumsum()

Cumulative statistics

  • .cummax()
  • .cummin()
  • .cumprod()

Dropping duplicate names

vet_visits.drop_duplicates(subset="name")

Dropping duplicate pairs

unique_dogs = vet_visits.drop_duplicates(subset=["name", "breed"]) print(unique_dogs)

Easy as 1, 2, 3

unique_dogs["breed"].value_counts() # Or unique_dogs["breed"].value_counts(sort=True) # print the list from the greatest to the smallest

Proportions

unique_dogs["breed"].value_counts(normalize=True) # Percentage

Grouped by

dogs.groupby("color")["weight_kg"].mean()

Multiple grouped summaries

dogs.groupby("color")["weight_kg"].agg([min, max, sum])

Grouping by multiple variables

dogs.groupby(["color", "breed"])["weight_kg"].mean()

Many groups, many summaries

dogs.groupby(["color", "breed"])[["weight_kg", "height_cm"]].mean()

Group by to pivot table

dogs.pivot_table(values="weight_kg", index="color")

Different statistics

import numpy as np dogs.pivot_table(values="weight_kg", index="color", aggfunc=np.median)

Multiple statistics

dogs.pivot_table(values="weight_kg", index="color", aggfunc=[np.mean, np.median])

Pivot on two variables

dogs.groupby(["color", "breed"])["weight_kg"].mean() dogs.pivot_table(values="weight_kg", index="color", columns="breed")

Filling missing values in pivot tables

dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0)

Summing with pivot tables

dogs.pivot_table(values="weight_kg", index="color", columns="breed", fill_value=0, margins=True)
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")

#----------------------------------------------------------------------------------------------------------------------------------
# Methods for numerical datas
print("The mean and the min :")
print("The mean :", avocado["avg_price"].mean())
print("The min :", temperatures['date'].min())
print('----------------------------------------------------------------------------------------------------------------------')
#----------------------------------------------------------------------------------------------------------------------------------
# The .agg() method
def pct30(column):
    return column.quantile(0.3)

print("The quantile :")
print(avocado["avg_price"].agg(pct30))

print('----------------------------------------------------------------------------------------------------------------------')
# Add your code snippets here

Explicit indexes

Setting a column as the index

dogs_ind = dogs.set_index("name") print(dogs_ind)

Removing an index

dogs_ind.reset_index()

Dropping an index

dogs_ind.reset_index(drop=True)

Indexes make subsetting simpler

dogs[dogs["name"].isin(["Bella", "Stella"])] dogs_ind.loc[["Bella", "Stella"]]

Index values don't need to be unique

dogs_ind2 = dogs.set_index("breed") print(dogs_ind2)

Subsetting on duplicated index values

dogs_ind2.loc["Labrador"]

Multi-level indexes a.k.a. hierarchical indexes

dogs_ind3 = dogs.set_index(["breed", "color"]) print(dogs_ind3)

Subset the outer level with a list

dogs_ind3.loc[["Labrador", "Chihuahua"]]

Subset inner levels with a list of tuples

dogs_ind3.loc[[("Labrador", "Brown"), ("Chihuahua", "Tan")]]

Sorting by index values

dogs_ind3.sort_index()

Controlling sort_index

dogs_ind3.sort_index(level=["color", "breed"], ascending=[True, False])
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")

walmart_ind = walmart.set_index("type")
print(walmart_ind.head())

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • Print the highest weekly sales for each department in the walmart DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
  • What was the total nb_sold of organic avocados in 2017 in the avocado DataFrame? If you're stuck, try reviewing this video.
  • Create a bar plot of the total number of homeless people by region in the homelessness DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
  • Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.