Data Manipulation with pandas
Run the hidden code cell below to import the data used in this course.
1 hidden cell
Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
Add your notes here
# Add your code snippets hereExplore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- Print the highest weekly sales for each
departmentin thewalmartDataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video. - What was the total
nb_soldof organic avocados in 2017 in theavocadoDataFrame? If you're stuck, try reviewing this video. - Create a bar plot of the total number of homeless people by region in the
homelessnessDataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video. - Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.
Chapter 1: DataFrames pandas is built on NumPy and Matplotlib. Tabular data is the most common form of data. Rectangular data (tabular data) = DataFrame in pandas Every column has the same data type; different columns can contain different types (strings vs ints vs floats) df.head() = first row of DataFrame df.info() = names of columns, data types they contain, and whether they have any missing values --> df.shape = # of rows and # of columns df.describe() = count/mean/std/min/25%/75%/max for each column --> df.values = data values in a 2d NumPy array --> df.columns = column names --> df.index = # of rows
Python Philosophy: there should be one--and preferably only one--obvious way to do it.
pandas gives you many different ways to solve problems...pandas is like a swiss army knife. This course = the most important ways of doing things.
- Sorting and subsetting df.sort_values("column name") df.sort_values("column name", ascending=False) df.sort_values(["column name", "column name"]) df.sort_values(["column name", "column name"], ascending=[True, False])
Zoom in on one column df["column name"] df[["column name", "column name"]]
df["column name"] > x
df[df["column name"]] > x
- Creating new columns df["new column"] = df["column name"] / x
Chapter 2: Aggregating Data
-
Summary statistics df["column_name"].max()/min()/median()/mean()/sum()/cumsum()/cummax()/cummin()
-
Counting
df.drop_duplicates(subset=["name of column where you want to drop duplicates", "second name"]) df["column name"].value_counts(normalize=True)
- Grouped summary statistics df[df["column"] == "variable 1"]["variable_2"].mean() This would be the average variable 2 of items in the named column that match variable 1
df.groupby("column")[variable_2"].mean() same dif, but without having to name all the different variable 1s
df.groupby("column")[variable_2"].agg([min, max, sum]) similar, and now the agg() let's us call multiple stats at once
df.groupby(["column 1", "column 2"])["column 3"].mean()/.agg([min, max, sum])
df.groupby(["column 1", "column 2"])[["column 3", "column 4"]].mean()/,agg([multiple fxns])
PIVOT TABLES
df.pivot_table(values="column 1", index="column 2")
Column 1 = column you want to summarize column 2 = column you want to sort by
df.pivot_table(values="column 1", index = "column 2", aggfunc=[np.mean, np.median])
df.pivot_table(values="column 1", index = column 2", columns = "column 3", fill_value=0, margins=True)
column 3 = additional columns to table to split out further margins=True adds mean to the rows and columns
Chapter 3: Slicing and Indexing Data
- Indexes and subsetting using indexes df_ind = df.set_index("column you want as index") .reset_index(drop=True) removes all of the column used as an index
with .loc, and an indexed df, you can easily find specific variables within the index. Index variables don't need to be unique Multi-level indexes also exist Can also sort by index values with .sort_index() .sort_index(level="index name") .sort_index(level=["name 1", "name 2"], ascending = [True, False])
Index values are just data Indexes violate "tidy data" principles
- Subsetting using slicing
df.loc["name":"through name"] (requires indexed df, and only works with the first index in that form) df.loc[("name 1", "name 2"):("name 3", "name 4")]
^^ slices rows
to slice columns and keep all rows df.loc[:, "column name":"column name"]
to do both at the same time: df.loc[("row name 1", row name 2"):("row name 3", "row name 4"), "column name":"column name"]
Subset by range
df = df.set_index("range to index").sort_index() df.loc["row name 1":"row name 2"] with dates, can slice by partial dates (IE just years, ex "2014":"2016")
df.iloc[row#:row#, column#:column#] <-- final values NOT INCLUDED
More work with Pivot Tables
Chapter 4: Creating and Visualizing Data
- Plotting
- Handling missing data
- Reading data into a DataFrame