Skip to content

Data Manipulation with pandas

Run the hidden code cell below to import the data used in this course.

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")

DataFrame Info

.head() - a method that shows the top N (default = 5) rows of the DataFrame

.info() - a method that displays the name of the columns, types and wether they contain any missing values

.shape - an attribute for the size of the DataFrame

.describe() - a method showing summary statistics for the numerical columns

.values - an attribute giving content of the DataFrame in a 2D array

.columns - an attribute with the name of the columns

.index - an attribute wich gives a row numbers or their names

New columns

dataframe["new_column_name"] = dataframe["column_name"] / 2 - just like that!

Sorting rows

.sort_values("column_name", ascending=True/False) - sort rows according to data in the corresponding column.

This can be done for several columns: firstly sorting is done following the first column and then the second and so on:

.sort_values(["column_name1", "column_name2"], ascending=[True, False])

Showing of several columns is done by:

dataframe[["column_name1","column_name2"]]

Subsetting the rows according to some rule:

dataframe[dataframe["column_name"] > X]

... and according to several rules:

ind1 = dataframe["column_name1"] == X

ind2 = dataframe["column_name2"] < Y

dataframe[ind1 & ind2]

or in one line:

dataframe[(dataframe["column_name1"]==X) & (dataframe["column_name2"]<Y)]

Another very useful method is .isin():

dataframe["column_name"].isin(["x1","x2"]) - finds rows with x1 and x2 in the corresponding column

Summary statistics

dataframe["column_name"].mean() - mean of the column

Other statistical metrics are found from: .median(), .mode(), .min(), .max(), .var(), .std(), .sum(), .quantile()

Some of them works not only for the numericals, but also for dates...

.agg(function_name) - method that apply some function (or a list of functions by using [] brackets) to the dataframe/column/multiple columns

Cumulative statistics: .cumsum(), .cummin(), .cummax(), .cumprod() which retur an entire column

Counting

dataframe["col_name"].value_counts(normalize=True,sort=True) - counts unique values in the column

dataframe.drop_duplicates(subset = ["col_name1","col_name2",...]) - drop duplicates in one column or in combinations of columns from the datframe

Grouped summary statistics

dataframe.groupby("col_name1")["col_name2"].mean() - finds mean of the values in column 2 grouped by the values in the column 1

Using dataframe.groupby("col_name1")["col_name2"].agg([func1,func2,...]) method we can get the grouped summary statistics from several functions

Groups can be defined from several columns: dataframe.groupby(["col_name1","col_name3"])["col_name2"].mean()

... and obtain statistics for several columns as well: dataframe.groupby(["col_name1","col_name3"])[["col_name2","col_name4"]].mean()

Pivot tables

dataframe.pivot_table(values="col_name1", index="col_name2", aggfunc=np.median) - makes a pivot tables with median (or another aggregated function) values of column 1 grouped by indexed column 2

or a bunch of functions using list of functions: aggfunc=[np.median,np.max]

statistics for two values, one in the rows ("col_name1") and another in the columns ("col_names3") grouped by unique values in "col_name2":

dataframe.pivot_table(values="col_name1", index="col_name2", columns="col_name3") (it is not obligatory to mention values parameter)

Use an argument fill_value=0 to replace NaN values with 0

Use an argument margins=True to get summary statistics for all columns and rows in extra column/row (shown as last)

Working with Pivot Tables:

Use .loc["row1":"row2"] method to slice the table (select interested rows)

ptable.mean(axis="index") - finds means for each row, axis="columns" - for each column

Explicit indices

.set_index("col_name") - set new indices of rows from the col_name column; it is possible to set more than one index columns by using .set_index(["col_name1","col_name2"])

.reset_index() - reset indexing

dataframe_new_ind.loc["x","y"] - select rows with "x" and "y" indexes; in case of several index columns use tuples: .loc[("x1","x2"),("y1","y2")]

Sorting is done by .sort_index(level="col_name1", ascending=True), for several index columns use lists in arguments

It is possible to combine methods: dataframe_new_ind = dataframe.set_index(["col_name1","col_name2"]).sort_index()

Slicing by rows:

dataframe.loc["x":"y"] - select from row "x" to row "y" including both; for several index columns use tuples: .loc[("x1,y1"):("x10,y8")]

Slicing also by columns:

dataframe.loc[:,"col_name1":"col_name2"] - all rows and two columns

If one use dates ("yyyy-mm-dd") as indices, it is possible to slice by partial mentioning of the data, e.g., years: .loc["yyy1":"yyy2"]

.iloc[2:5, 1:4] - slicing as in lists (!in this case lasts are not included)

Vizualizing the data

import matplotlib.pyplot as plt

dataframe["col_name"].hist() - build a histogram of the values in the specified column (! do not forget to use plt.show() to see the resulted figure)

.hist(bins=10,alpha=0.7) - set the number of bins and transparency value

For the grouped statistics:

df_grouped_stat = datagframe.groupby("col_name1")["col_name3"].mean()

df_grouped_stat.plot(kind="bar", title="Mean by Column 1") - bar plot for the mean of each "col_name1" type

Line plots: dataframe.plot(x="col_name1", y="col_name2", kind="line", rot=45), where rot argument defines label rotation angle

Scatter plots: **.plot(x="col_name1", y="col_name2", kind="scatter")

Several plots in one figure - just type several plots consecutively and then call plt.show()

To distinguish the datasets, use legend: plt.legend(["data1","data2",...])

Missing values

dataframe.isna() - provide logical indicators of the values being NaN

dataframe.isna().any() - check if there are NaNs in each column

dataframe.isna().sum() - find number of NaNs in each column

dataframe.isna().sum().plot(kind="bar") - barplot of the number of NaNs in each column

dataframe.dropna() - drop rows with NaNs

dataframe.fillna(x) - replace NaNs with x

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • Print the highest weekly sales for each department in the walmart DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
  • What was the total nb_sold of organic avocados in 2017 in the avocado DataFrame? If you're stuck, try reviewing this video.
  • Create a bar plot of the total number of homeless people by region in the homelessness DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
  • Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.
# Task 1:
print(walmart['weekly_sales'].max())