Skip to content

Data Manipulation with pandas

Run the hidden code cell below to import the data used in this course.

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

**Summary Statistics ** Data Manipulation with Pandas 1. mean - example dogs["height_cm"].mean() 2. median, mode min, max, var, standard deviation, quartile

**The .agg() Method** Cumulative Sum - cumsum() Cumulative Statistics - .cummax() - .cummin() - .cumprod() **Counting** Avoid double counting - The normalize argument can be used to turn the counts into proportions of the total. - (normalize=True)

Grouped Summary Statistics We can group by a variable(s), select a column(s), and take calculate any summary statistic.

Pivot Tables - Data Manipulation with Pandas The "values" argument is the column that you want to summarize, and the index column is the column that you want to group by. ex.: dogs.pivot_table(values='weight_kg',index='color') By default pivot_table takes the mean value for each group.

We can use different summary statistics, by passing different functions to the "aggfunc" argument.

We can also create a pivot table grouped by two variables, by passing the 'columns' argument.

Instead of having missing data we can adjust our pivot table to fill in the missing values with 0 by passing the argument "fill_value=0".

If we set the margins argument to True, the last row and last column of the pivot table contain the mean of all the values in the column or row, not including the missing values that were filled in with 0s.

Explicit Indexes .columns, contains an index object of column names .index, contains an index object of row numbers

Slicing and subsetting with .loc and .iloc

example : breeds[:3], starts from the very beginning up to the index item #2 example : breeds[:], returns the entire list

Sort the index before you slice

example = dogs_srt = dogs.set_index(["breed","color"]).sort_index() To slice rows at the outer level of an index you call .loc example: dogs_srt.loc["Chow Chow": "Poodle"] However, when trying to index inner levels you pass the first and last position as tuples example: dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey")]

Slicing columns

example: dogs_srt.loc[:,"name":"height_cm"]

Slicing by dates

look code below

Subsetting by row/column number using .iloc

print(dogs.iloc[2:5, 1:4])

Working with pivot tables

You can create a pivot table by calling "".pivot_table". The first argument is the column name containing values to aggregate. The index argument lists the columns to group by and display in rows, and the columns argument lists the columns to group by and display in columns. We'll use the default aggregation function, which is mean.

dogs["height_cm"].mean()
#-------------------------------------------------------

#Single Summary
def pct30(column):
    return column.quartile(0.3)

    dogs["weight_kg"].agg(pct30)
    
#Multiple Summaries

def pct40(column):
    return column.quartile(0.4)

dogs["weight_kg"].agg([pct30, pct40])
#-------------------------------------------------------
Cumulative Sum
dogs["weight_kg"]
dogs["weight_kg"].cumsum()

#-------------------------------------------------------
#How to sort rows in a DataFrame
df.sort_values()

#-----------------------------------------------------------
#How to avoid double counting
df.drop_duplicates(subset='col_name') 

#Dropping duplicate pairs
df.drop_duplicates(subset=['col_name','col_name1'])

#How to use the value_counts method
df["col_name"].value_counts() 

#We can also add the sort argument
df["col_name"].value_counts(sort=True)

#How to use the normalize argument
df["col_name"].value_counts(normalize=True)

#-------------------------------------------------------
#How to groupby summary 
.groupby('column')['col_name'].function()

#Multiple grouped summaries
.groupby('column')['col_name'].agg([min, max, sum])

#Group by multiple columns
.groupby(['column1','columns2'])['col_name'].mean()
#Group by multiple columns and aggregate by multiple columns
.groupby(['column1','columns2'])[['col_name1','col_name2']].mean()

#------------------------------------------------------------------
#Pivot Table 
dogs.pivot_table(values='weight_kg',index='color')

#Pivot table with summary statistic
dogs.pivot_table(values='weight_kg',index='color', aggfunc=np.median)

#Pivot table with multiple summary statistics
dogs.pivot_table(values='weight_kg',index='color', aggfunc=[np.mean, np.median, np.min])

#Pivot on two variables
dogs.pivot_table(values='weight_kg',index='color', columns = 'breed')
#--------------------------------------------------------------------
#Setting a column as an index
dogs_ind = dogs.set_index('name')

#Multi-level indexes a.k.a hierarchical indexes
dogs_ind = dogs.set_index(['name', 'color'])

#Removing an index
dog_ind = dogs.reset_index()

#Dropping and index
dog_ind = dogs.reset_index(drop=True)

#Subsetting with indexes
dogs_ind.loc[["Bella", "Stella"]]

#Subset inner levels with a list of tuples
dogs_ind3.loc[[("Labrador","Brown"),("Chihuahua", "Tan")]]

#Sorting by index values
dogs_ind3.sort_index()

#Sorting the index before slicing
dogs_srt = dogs.set_index(["breed","color"]).sort_index()

#Slicing the outer index level
dogs_srt.loc["Chow Chow": "Poodle"]

#Slicing inner index level
dogs_srt.loc[("Labrador", "Brown"):("Schnauzer", "Grey")]

#Slicing columns
dogs_srt.loc[:,"name":"height_cm"]

#Slicing data frames by a range of dates (can use partial dates)
dogs.loc[2014-05-12:2015-04-08] or dogs.loc[2014:2015]

#SLicing by row/column number
print(dogs.iloc[2:5, 1:4])

# ----------------------------------------------------------------------------------
# Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean()

# Filter for the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])

# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis="columns")

# Filter for the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • Print the highest weekly sales for each department in the walmart DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
  • What was the total nb_sold of organic avocados in 2017 in the avocado DataFrame? If you're stuck, try reviewing this video.
  • Create a bar plot of the total number of homeless people by region in the homelessness DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
  • Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.