Skip to content

Data Manipulation with pandas

Run the hidden code cell below to import the data used in this course.


1 hidden cell

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Sorting rows

Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

Sort on ... Syntax

one column df.sort_values("breed")

multiple columns df.sort_values(["breed", "weight_kg"])

By combining .sort_values() with .head(), you can answer questions in the form, "What are the top cases where...?".

# Define the homelessness dataframe
homelessness = pd.read_csv("homelessness.csv")

# Sort homelessness by individuals
homelessness_ind = homelessness.sort_values("individuals")

# Print the top few rows
print(homelessness_ind.head())

Subsetting columns

When working with data, you may not need all of the variables in your dataset. Square brackets ([]) can be used to select only the columns that matter to you in an order that makes sense to you. To select only "col_a" of the DataFrame df, use

df["col_a"]

To select "col_a" and "col_b" of df, use

df[["col_a", col_b"]]

import pandas as pd

# Define the homelessness dataframe
homelessness = pd.read_csv("homelessness.csv")

# Select the individuals column
individuals = homelessness["individuals"]

# Print the head of the results
print(individuals.head())

Subsetting rows

A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return Ture or False for each row, then pass that inside square brackets.

dogs[dogs["height_cm"] > 60]

dogs[dogs["color"] = = "tan"]

You can filter for multiple conditions at once by using the "bitwise and" operator, &

dogs[(dogs["height_cm"] > 60) & (dogs["color"] = = "tan")]

Subsetting rows by categorical variables

subsetting data based on a categorical variable often involves using the 'or' operator (I) to select rows from multiple categories. This can get tedious when you want all states in 1 of 3 different regions, for example. Instead, use the .isin() method, which will allow you to tackle this problem by writing one condition instead of 3 seperate ones.

colors = ["brown", "black", "tan"]

condition = dogs["color"].isin(colors)

dogs[condition]

Adding new columns

You aren't suck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as tranforming, mutating, and feature engineering

You can create new columns from scratch, but it is also common to derive from other columns, for example, by adding columns together or by changing their units.

# Add total col as sum of individuals and family_members
homelessness["total"] = homeless["individuals"] + homelessness["family_memebers"]

# Add p_indidviduals col as proportionof total that are individuals
homelessness["p_individuals"] = homelessness["individuals"] / homelessness["total"]

# See the result
print(homelessness)

Combo-attack!

You've seen the 4 most common types of data manipulation: sorting rows, subsetting columns, subsetting rows, and adding columns. In a real-life data analysis, you can mix and match these 4 manipulations to answer a multitude of questions.

In this exercise, you'll answer the question, "Which state has the highest number of homeless individuals per 10,000 people in the state?" Combine your new pandas skills to find out.

# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending = False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state", "indiv_per_10k"]]

# See the result
print(result)

Mean and median

Summary statistics are exactly what they sound like - they summarize many numbers in one statistic. For example, mean, median, minimum, maximum, and standard deviation are summary statistics. Calculating summary statistics allows you to get a better sense of your data, even if there's a lot of it.

# Print the head of the sales DatFrame
print(sales.head())

# Print the info about the sales DataFrame
print(sales.info())

# Print the mean of weekly_sales
print(sales["weekly_sales"].mean())

# Print the median of weekly_sales
print(sales["weekly_sales"].median())