Data Manipulation with pandas
.head() returns the first few rows (the “head” of the DataFrame). .info() shows information on each of the columns, such as the data type and number of missing values. .shape returns the number of rows and columns of the DataFrame. .describe() calculates a few summary statistics for each column.
To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:
.values: A two-dimensional NumPy array of values. .columns: An index of columns: the column names. .index: An index for the rows: either row numbers or row names.
#Sorting rows
Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to .sort_values().
In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.
Sort on … Syntax one column df.sort_values("breed") multiple columns df.sort_values(["breed", "weight_kg"]) By combining .sort_values() with .head(), you can answer questions in the form, "What are the top cases where…?".
#Subsetting columns
When working with data, you may not need all of the variables in your dataset. Square brackets ([]) can be used to select only the columns that matter to you in an order that makes sense to you. To select only "col_a" of the DataFrame df, use
df["col_a"] To select "col_a" and "col_b" of df, use
df[["col_a", "col_b"]]
#Subsetting rows
This is sometimes known as filtering rows or selecting rows. There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True or False for each row, then pass that inside square brackets.
dogs[dogs["height_cm"] > 60] dogs[dogs["color"] == "tan"] You can filter for multiple conditions at once by using the "bitwise and" operator, &.
dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]
#Subsetting rows by categorical variables
Subsetting data based on a categorical variable often involves using the or operator (|) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the .isin() method, which will allow you to tackle this problem by writing one condition instead of three separate ones.
colors = ["brown", "black", "tan"] condition = dogs["color"].isin(colors) dogs[condition]
#Adding new columns # Add total col as sum of individuals and family_members
homelessness["total"] = homelessness["individuals"] + homelessness["family_members"]
# Add p_homeless col as proportion of total homeless population to the state population
homelessness["p_homeless"] = homelessness["total"] / homelessness["state_pop"]
2 hidden cells
**Aggregating-dataframes
# Sort sales_1_1 by date
sales_1_1 = sales_1_1.sort_values("date")
# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1["cum_weekly_sales"] = sales_1_1["weekly_sales"].cumsum()
# Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1["cum_max_sales"] = sales_1_1["weekly_sales"].cummax()
# See the columns you calculated
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])
# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=["store", "type"]) print(store_types.head())
# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(["store", "department"]) print(store_depts.head())
# Subset the rows where is_holiday is True and drop duplicate dates
holiday_dates = sales[sales["is_holiday"]].drop_duplicates(subset="date")
# Print date col of holiday_dates
print(holiday_dates["date"])
# Count the number of stores of each type
store_counts = store_types["type"].value_counts() print(store_counts)
# Get the proportion of stores of each type
store_props = store_types["type"].value_counts(normalize=True) print(store_props)
# Count the number of stores for each department and sort
dept_counts_sorted = store_depts["department"].value_counts(sort=True) print(dept_counts_sorted)
# Get the proportion of stores in each department and sort
dept_props_sorted = store_depts["department"].value_counts(sort=True, normalize=True) print(dept_props_sorted)
# Import numpy with the alias np
import numpy as np
# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby("type")["weekly_sales"].agg([np.min, np.max, np.mean, np.median])
# Print sales_stats
print(sales_stats)
# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median
unemp_fuel_stats = sales.groupby("type")["unemployment", "fuel_price_usd_per_l"].agg([np.min, np.max, np.mean, np.median])
# Print unemp_fuel_stats
print(unemp_fuel_stats)
Slicing and indexing dataframes
# Subset rows from India, Hyderabad to Iraq, Baghdad
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad")])
# Subset columns from date to avg_temp_c
print(temperatures_srt.loc[:, "date":"avg_temp_c"])
# Subset in both directions at once
print(temperatures_srt.loc[("India", "Hyderabad"):("Iraq", "Baghdad"), "date":"avg_temp_c"])
# Use Boolean conditions to subset temperatures for rows in 2010 and 2011
temperatures_bool = temperatures[(temperatures["date"] >= "2010-01-01") & (temperatures["date"] <= "2011-12-31")] print(temperatures_bool)
# Set date as the index and sort the index
temperatures_ind = temperatures.set_index("date").sort_index()
# Use .loc[] to subset temperatures_ind for rows in 2010 and 2011
print(temperatures_ind.loc["2010":"2011"])
# Use .loc[] to subset temperatures_ind for rows from Aug 2010 to Feb 2011
print(temperatures_ind.loc["2010-08-01":"2011-02-01"])
# Get 23rd row, 2nd column (index 22, 1)
print(temperatures.iloc[22, 1])
# Use slicing to get the first 5 rows
print(temperatures.iloc[:5])
# Use slicing to get columns 3 to 4
print(temperatures.iloc[:, 2:4])
# Use slicing in both directions at once
print(temperatures.iloc[:5, 2:4])
# Add a year column to temperatures
temperatures["year"] = temperatures["date"].dt.year
# Pivot avg_temp_c by country and city vs year
temp_by_country_city_vs_year = temperatures.pivot_table("avg_temp_c", index=["country", "city"], columns="year")
# See the result
print(temp_by_country_city_vs_year)
# Subset for Egypt to India
temp_by_country_city_vs_year.loc["Egypt":"India"]
# Subset for Egypt, Cairo to India, Delhi
temp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India","Delhi")]
# Subset for Egypt, Cairo to India, Delhi, and 2005 to 2010
temp_by_country_city_vs_year.loc[("Egypt", "Cairo"):("India","Delhi"), "2005":"2010" ]
# Get the worldwide mean temp by year
mean_temp_by_year = temp_by_country_city_vs_year.mean(axis=0)
# Filter for the year that had the highest mean temp
print(mean_temp_by_year[mean_temp_by_year == mean_temp_by_year.max()])
# Get the mean temp by city
mean_temp_by_city = temp_by_country_city_vs_year.mean(axis="columns")
# Filter for the city that had the lowest mean temp
print(mean_temp_by_city[mean_temp_by_city == mean_temp_by_city.min()])