Skip to content

Data Manipulation with pandas

Run the hidden code cell below to import the data used in this course.


1 hidden cell

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Add your notes here

# Add your code snippets here

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • Print the highest weekly sales for each department in the walmart DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
  • What was the total nb_sold of organic avocados in 2017 in the avocado DataFrame? If you're stuck, try reviewing this video.
  • Create a bar plot of the total number of homeless people by region in the homelessness DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
  • Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.

Chapter 1: DataFrames pandas is built on NumPy and Matplotlib. Tabular data is the most common form of data. Rectangular data (tabular data) = DataFrame in pandas Every column has the same data type; different columns can contain different types (strings vs ints vs floats) df.head() = first row of DataFrame df.info() = names of columns, data types they contain, and whether they have any missing values --> df.shape = # of rows and # of columns df.describe() = count/mean/std/min/25%/75%/max for each column --> df.values = data values in a 2d NumPy array --> df.columns = column names --> df.index = # of rows

Python Philosophy: there should be one--and preferably only one--obvious way to do it.

pandas gives you many different ways to solve problems...pandas is like a swiss army knife. This course = the most important ways of doing things.

  • Sorting and subsetting df.sort_values("column name") df.sort_values("column name", ascending=False) df.sort_values(["column name", "column name"]) df.sort_values(["column name", "column name"], ascending=[True, False])

Zoom in on one column df["column name"] df[["column name", "column name"]]

df["column name"] > x

df[df["column name"]] > x

  • Creating new columns df["new column"] = df["column name"] / x

Chapter 2: Aggregating Data

  • Summary statistics df["column_name"].max()/min()/median()/mean()/sum()/cumsum()/cummax()/cummin()

  • Counting

df.drop_duplicates(subset=["name of column where you want to drop duplicates", "second name"]) df["column name"].value_counts(normalize=True)

  • Grouped summary statistics df[df["column"] == "variable 1"]["variable_2"].mean() This would be the average variable 2 of items in the named column that match variable 1

df.groupby("column")[variable_2"].mean() same dif, but without having to name all the different variable 1s

df.groupby("column")[variable_2"].agg([min, max, sum]) similar, and now the agg() let's us call multiple stats at once

df.groupby(["column 1", "column 2"])["column 3"].mean()/.agg([min, max, sum])

df.groupby(["column 1", "column 2"])[["column 3", "column 4"]].mean()/,agg([multiple fxns])

PIVOT TABLES

df.pivot_table(values="column 1", index="column 2")

Column 1 = column you want to summarize column 2 = column you want to sort by

df.pivot_table(values="column 1", index = "column 2", aggfunc=[np.mean, np.median])

df.pivot_table(values="column 1", index = column 2", columns = "column 3", fill_value=0, margins=True)

column 3 = additional columns to table to split out further margins=True adds mean to the rows and columns

Chapter 3: Slicing and Indexing Data

  • Indexes and subsetting using indexes df_ind = df.set_index("column you want as index") .reset_index(drop=True) removes all of the column used as an index

with .loc, and an indexed df, you can easily find specific variables within the index. Index variables don't need to be unique Multi-level indexes also exist Can also sort by index values with .sort_index() .sort_index(level="index name") .sort_index(level=["name 1", "name 2"], ascending = [True, False])

Index values are just data Indexes violate "tidy data" principles

  • Subsetting using slicing

df.loc["name":"through name"] (requires indexed df, and only works with the first index in that form) df.loc[("name 1", "name 2"):("name 3", "name 4")]

^^ slices rows

to slice columns and keep all rows df.loc[:, "column name":"column name"]

to do both at the same time: df.loc[("row name 1", row name 2"):("row name 3", "row name 4"), "column name":"column name"]

Subset by range

df = df.set_index("range to index").sort_index() df.loc["row name 1":"row name 2"] with dates, can slice by partial dates (IE just years, ex "2014":"2016")

df.iloc[row#:row#, column#:column#] <-- final values NOT INCLUDED

More work with Pivot Tables

Chapter 4: Creating and Visualizing Data

  • Plotting
  • Handling missing data
  • Reading data into a DataFrame