Skip to content

Data Manipulation with pandas

.head() returns the first few rows (the “head” of the DataFrame). .info() shows information on each of the columns, such as the data type and number of missing values. .shape returns the number of rows and columns of the DataFrame. .describe() calculates a few summary statistics for each column.

To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

.values: A two-dimensional NumPy array of values. .columns: An index of columns: the column names. .index: An index for the rows: either row numbers or row names.

#Sorting rows

Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

Sort on … Syntax one column df.sort_values("breed") multiple columns df.sort_values(["breed", "weight_kg"]) By combining .sort_values() with .head(), you can answer questions in the form, "What are the top cases where…?".

#Subsetting columns

When working with data, you may not need all of the variables in your dataset. Square brackets ([]) can be used to select only the columns that matter to you in an order that makes sense to you. To select only "col_a" of the DataFrame df, use

df["col_a"] To select "col_a" and "col_b" of df, use

df[["col_a", "col_b"]]

#Subsetting rows

This is sometimes known as filtering rows or selecting rows. There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True or False for each row, then pass that inside square brackets.

dogs[dogs["height_cm"] > 60] dogs[dogs["color"] == "tan"] You can filter for multiple conditions at once by using the "bitwise and" operator, &.

dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]

#Subsetting rows by categorical variables

Subsetting data based on a categorical variable often involves using the or operator (|) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the .isin() method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

colors = ["brown", "black", "tan"] condition = dogs["color"].isin(colors) dogs[condition]

#Adding new columns # Add total col as sum of individuals and family_members

homelessness["total"] = homelessness["individuals"] + homelessness["family_members"]

# Add p_homeless col as proportion of total homeless population to the state population

homelessness["p_homeless"] = homelessness["total"] / homelessness["state_pop"]


2 hidden cells

**Aggregating-dataframes

# Sort sales_1_1 by date

sales_1_1 = sales_1_1.sort_values("date")

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col

sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()

# Get the cumulative max of weekly_sales, add as cum_max_sales col

sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()

# See the columns you calculated

print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])


# Drop duplicate store/type combinations

store_types = sales.drop_duplicates(subset=["store", "type"]) print(store_types.head())

# Drop duplicate store/department combinations

store_depts = sales.drop_duplicates(["store", "department"]) print(store_depts.head())

# Subset the rows where is_holiday is True and drop duplicate dates

holiday_dates = sales[sales["is_holiday"]].drop_duplicates(subset="date")

# Print date col of holiday_dates

print(holiday_dates["date"])


# Count the number of stores of each type

store_counts = store_types["type"].value_counts() print(store_counts)

# Get the proportion of stores of each type

store_props = store_types["type"].value_counts(normalize=True) print(store_props)

# Count the number of stores for each department and sort

dept_counts_sorted = store_depts["department"].value_counts(sort=True) print(dept_counts_sorted)

# Get the proportion of stores in each department and sort

dept_props_sorted = store_depts["department"].value_counts(sort=True, normalize=True) print(dept_props_sorted)


# Import numpy with the alias np

import numpy as np

# For each store type, aggregate weekly_sales: get min, max, mean, and median

sales_stats = sales.groupby("type")["weekly_sales"].agg([np.min, np.max, np.mean, np.median])

# Print sales_stats

print(sales_stats)

# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median

unemp_fuel_stats = sales.groupby("type")["unemployment", "fuel_price_usd_per_l"].agg([np.min, np.max, np.mean, np.median])

# Print unemp_fuel_stats

print(unemp_fuel_stats)

Slicing and indexing dataframes

# Subset rows from India, Hyderabad to Iraq, Baghdad

print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad")])

# Subset columns from date to avg_temp_c

print(temperatures_srt.loc[:, "date":"avg_temp_c"])

# Subset in both directions at once

print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad"), "date":"avg_temp_c"])


# Use Boolean conditions to subset temperatures for rows in 2010 and 2011

temperatures_bool = temperatures[(temperatures["date"] >= "2010-01-01") & (temperatures["date"] <= "2011-12-31")] print(temperatures_bool)

# Set date as the index and sort the index

temperatures_ind = temperatures.set_index("date").sort_index()

# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011

print(temperatures_ind.loc["2010":"2011"])

# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011

print(temperatures_ind.loc["2010-08-01":"2011-02-01"])


# Get 23rd row, 2nd column (index 22, 1)

print(temperatures.iloc[22, 1])

# Use slicing to get the first 5 rows

print(temperatures.iloc[:5])

# Use slicing to get columns 3 to 4

print(temperatures.iloc[:, 2:4])

# Use slicing in both directions at once

print(temperatures.iloc[:5, 2:4])


# Add a year column to temperatures

temperatures["year"] = temperatures["date"].dt.year

# Pivot avg_temp_c by country and city vs year

temp_by_country_city_vs_year = temperatures.pivot_table("avg_temp_c", index=["country", "city"], columns="year")

# See the result

print(temp_by_country_city_vs_year)


# Subset for Egypt to India

temp_by_country_city_vs_year.loc["Egypt":"India"]

# Subset for Egypt, Cairo to India, Delhi

temp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India","Delhi")]

# Subset for Egypt, Cairo to India, Delhi, and 2005 to 2010

temp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India","Delhi"), "2005":"2010" ]


# Get the worldwide mean temp by year

mean_temp_by_year = temp_by_country_city_vs_year.mean(axis=0)

# Filter for the year that had the highest mean temp

print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])

# Get the mean temp by city

mean_temp_by_city = temp_by_country_city_vs_year.mean(axis="columns")

# Filter for the city that had the lowest mean temp

print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])