Data Manipulation with pandas

Run the hidden code cell below to import the data used in this course.

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")

DataFrame Info

.head() - a method that shows the top N (default = 5) rows of the DataFrame

.info() - a method that displays the name of the columns, types and wether they contain any missing values

.shape - an attribute for the size of the DataFrame

.describe() - a method showing summary statistics for the numerical columns

.values - an attribute giving content of the DataFrame in a 2D array

.columns - an attribute with the name of the columns

.index - an attribute wich gives a row numbers or their names

New columns

dataframe["new_column_name"] = dataframe["column_name"] / 2 - just like that!

Sorting rows

.sort_values("column_name", ascending=True/False) - sort rows according to data in the corresponding column.

This can be done for several columns: firstly sorting is done following the first column and then the second and so on:

.sort_values(["column_name1", "column_name2"], ascending=[True, False])

Showing of several columns is done by:

dataframe[["column_name1","column_name2"]]

Subsetting the rows according to some rule:

dataframe[dataframe["column_name"] > X]

... and according to several rules:

ind1 = dataframe["column_name1"] == X

ind2 = dataframe["column_name2"] < Y

dataframe[ind1 & ind2]

or in one line:

dataframe[(dataframe["column_name1"]==X) & (dataframe["column_name2"]<Y)]

Another very useful method is .isin():

dataframe["column_name"].isin(["x1","x2"]) - finds rows with x1 and x2 in the corresponding column

Summary statistics

dataframe["column_name"].mean() - mean of the column

Other statistical metrics are found from: .median(), .mode(), .min(), .max(), .var(), .std(), .sum(), .quantile()

Some of them works not only for the numericals, but also for dates...

.agg(function_name) - method that apply some function (or a list of functions by using [] brackets) to the dataframe/column/multiple columns

Cumulative statistics: .cumsum(), .cummin(), .cummax(), .cumprod() which retur an entire column

Counting

dataframe["col_name"].value_counts(normalize=True,sort=True) - counts unique values in the column

dataframe.drop_duplicates(subset = ["col_name1","col_name2",...]) - drop duplicates in one column or in combinations of columns from the datframe

Grouped summary statistics

dataframe.groupby("col_name1")["col_name2"].mean() - finds mean of the values in column 2 grouped by the values in the column 1

Using dataframe.groupby("col_name1")["col_name2"].agg([func1,func2,...]) method we can get the grouped summary statistics from several functions

Groups can be defined from several columns: dataframe.groupby(["col_name1","col_name3"])["col_name2"].mean()

... and obtain statistics for several columns as well: dataframe.groupby(["col_name1","col_name3"])[["col_name2","col_name4"]].mean()

Pivot tables

dataframe.pivot_table(values="col_name1", index="col_name2", aggfunc=np.median) - makes a pivot tables with median (or another aggregated function) values of column 1 grouped by indexed column 2

or a bunch of functions using list of functions: aggfunc=[np.median,np.max]

statistics for two values, one in the rows ("col_name1") and another in the columns ("col_names3") grouped by unique values in "col_name2":

dataframe.pivot_table(values="col_name1", index="col_name2", columns="col_name3") (it is not obligatory to mention values parameter)

Use an argument fill_value=0 to replace NaN values with 0

Use an argument margins=True to get summary statistics for all columns and rows in extra column/row (shown as last)

Working with Pivot Tables:

Use .loc["row1":"row2"] method to slice the table (select interested rows)

ptable.mean(axis="index") - finds means for each row, axis="columns" - for each column

Explicit indices

.set_index("col_name") - set new indices of rows from the col_name column; it is possible to set more than one index columns by using .set_index(["col_name1","col_name2"])

.reset_index() - reset indexing

dataframe_new_ind.loc["x","y"] - select rows with "x" and "y" indexes; in case of several index columns use tuples: .loc[("x1","x2"),("y1","y2")]

Sorting is done by .sort_index(level="col_name1", ascending=True), for several index columns use lists in arguments

It is possible to combine methods: dataframe_new_ind = dataframe.set_index(["col_name1","col_name2"]).sort_index()

Slicing by rows:

dataframe.loc["x":"y"] - select from row "x" to row "y" including both; for several index columns use tuples: .loc[("x1,y1"):("x10,y8")]

Slicing also by columns:

dataframe.loc[:,"col_name1":"col_name2"] - all rows and two columns

If one use dates ("yyyy-mm-dd") as indices, it is possible to slice by partial mentioning of the data, e.g., years: .loc["yyy1":"yyy2"]

.iloc[2:5, 1:4] - slicing as in lists (!in this case lasts are not included)

Vizualizing the data

import matplotlib.pyplot as plt

dataframe["col_name"].hist() - build a histogram of the values in the specified column (! do not forget to use plt.show() to see the resulted figure)

.hist(bins=10,alpha=0.7) - set the number of bins and transparency value

For the grouped statistics:

df_grouped_stat = datagframe.groupby("col_name1")["col_name3"].mean()

df_grouped_stat.plot(kind="bar", title="Mean by Column 1") - bar plot for the mean of each "col_name1" type

Line plots: dataframe.plot(x="col_name1", y="col_name2", kind="line", rot=45), where rot argument defines label rotation angle

Scatter plots: **.plot(x="col_name1", y="col_name2", kind="scatter")

Several plots in one figure - just type several plots consecutively and then call plt.show()

To distinguish the datasets, use legend: plt.legend(["data1","data2",...])

Missing values

dataframe.isna() - provide logical indicators of the values being NaN

dataframe.isna().any() - check if there are NaNs in each column

dataframe.isna().sum() - find number of NaNs in each column

dataframe.isna().sum().plot(kind="bar") - barplot of the number of NaNs in each column

dataframe.dropna() - drop rows with NaNs

dataframe.fillna(x) - replace NaNs with x

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

Print the highest weekly sales for each department in the walmart DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
What was the total nb_sold of organic avocados in 2017 in the avocado DataFrame? If you're stuck, try reviewing this video.
Create a bar plot of the total number of homeless people by region in the homelessness DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.

# Task 1:
print(walmart['weekly_sales'].max())

Data Manipulation with pandas

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Data Manipulation with pandas

DataFrame Info

New columns

Sorting rows

Summary statistics

Counting

Grouped summary statistics

Pivot tables

Working with Pivot Tables:

Explicit indices

Vizualizing the data

Missing values

Explore Datasets

Data Manipulation with pandas