Skip to content

Data Manipulation with pandas

Run the hidden code cell below to import the data used in this course.

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocados = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Add your notes here

# Print the head of the homelessness data
print(homelessness.head())

# Print information about homelessness
print(homelessness.info())

# Print the shape of homelessness
print(homelessness.shape)

# Print a description of homelessness
print(homelessness.describe())

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • Print the highest weekly sales for each department in the walmart DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
  • What was the total nb_sold of organic avocados in 2017 in the avocado DataFrame? If you're stuck, try reviewing this video.
  • Create a bar plot of the total number of homeless people by region in the homelessness DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
  • Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.
# DataFrames have three components: values, a column index, and a row index

# Import pandas using the alias pd
import pandas as pd

# Print the values of homelessness
print(homelessness.values)
print("Burasi bosluk")

# Print the column index of homelessness
print(homelessness.columns)
print("Burasi bosluk")

# Print the row index of homelessness
print(homelessness.index)
# Sorting rows
# Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to .sort_values().

#  In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

# Sort on …	Syntax
# one column	df.sort_values("breed")
# multiple columns	df.sort_values(["breed", "weight_kg"])

# By combining .sort_values() with .head(), you can answer questions in the form, "What are the top cases where…?".

# Sort homelessness by individuals
homelessness_ind = homelessness.sort_values("individuals")

# Print the top few rows
print(homelessness_ind.head())
# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness["family_members"]<1000) & (homelessness["region"] == "Pacific")]

# See the result
print(fam_lt_1k_pac)
# Sort homelessness first by region (ascending), and then by number of family members (descending). Save this as homelessness_reg_fam.

# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(["region","family_members"], ascending=[True, False])

# Print the top few rows
print(homelessness_reg_fam.head())

Subsetting rows

A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True or False for each row, then pass that inside square brackets.

dogs[dogs["height_cm"] > 60]

dogs[dogs["color"] == "tan"]

You can filter for multiple conditions at once by using the "bitwise and" operator, &.

dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]

#Subsetting rows by categorical variables
#Subsetting data based on a categorical variable often involves using the "or" operator (|) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the .isin() method, which will allow you to tackle this problem by writing one condition instead of three separate ones.


# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]

# See the result
print(mojave_homelessness)
# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"] 

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending=False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[["state", "indiv_per_10k"]]

# See the result
print(homelessness.head())
print(result)
# DATA STATISTICS

# Print the head of the sales DataFrame
print(walmart.head())

# Print the info about the sales DataFrame
print(walmart.info())

# Print the mean of weekly_sales
print(walmart["weekly_sales"].mean())

# Print the median of weekly_sales
print(walmart["weekly_sales"].median())
# Efficient summaries
# While pandas and NumPy have tons of functions, sometimes, you may need a different function to summarize your data.

# The .agg() method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,

# df['column'].agg(function)

#In the custom function for this exercise, "IQR" is short for inter-quartile range, which is the 75th percentile minus the 25th percentile. It's an alternative to standard deviation that is helpful if your data contains outliers.



# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
    
# Print IQR of the temperature_c column
print(walmart["temperature_c"].agg(iqr))


# Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment
print(walmart[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg(iqr))


#Update the aggregation functions called by .agg(): include iqr and np.median in that order.
# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(walmart[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))