Data Manipulation with pandas
Run the hidden code cell below to import the data used in this course.
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")
DataFrame Info
.head() - a method that shows the top N (default = 5) rows of the DataFrame
.info() - a method that displays the name of the columns, types and wether they contain any missing values
.shape - an attribute for the size of the DataFrame
.describe() - a method showing summary statistics for the numerical columns
.values - an attribute giving content of the DataFrame in a 2D array
.columns - an attribute with the name of the columns
.index - an attribute wich gives a row numbers or their names
New columns
dataframe["new_column_name"] = dataframe["column_name"] / 2 - just like that!
Sorting rows
.sort_values("column_name", ascending=True/False) - sort rows according to data in the corresponding column.
This can be done for several columns: firstly sorting is done following the first column and then the second and so on:
.sort_values(["column_name1", "column_name2"], ascending=[True, False])
Showing of several columns is done by:
dataframe[["column_name1","column_name2"]]
Subsetting the rows according to some rule:
dataframe[dataframe["column_name"] > X]
... and according to several rules:
ind1 = dataframe["column_name1"] == X
ind2 = dataframe["column_name2"] < Y
dataframe[ind1 & ind2]
or in one line:
dataframe[(dataframe["column_name1"]==X) & (dataframe["column_name2"]<Y)]
Another very useful method is .isin():
dataframe["column_name"].isin(["x1","x2"]) - finds rows with x1 and x2 in the corresponding column
Summary statistics
dataframe["column_name"].mean() - mean of the column
Other statistical metrics are found from: .median(), .mode(), .min(), .max(), .var(), .std(), .sum(), .quantile()
Some of them works not only for the numericals, but also for dates...
.agg(function_name) - method that apply some function (or a list of functions by using [] brackets) to the dataframe/column/multiple columns
Cumulative statistics: .cumsum(), .cummin(), .cummax(), .cumprod() which retur an entire column
Counting
dataframe["col_name"].value_counts(normalize=True,sort=True) - counts unique values in the column
dataframe.drop_duplicates(subset = ["col_name1","col_name2",...]) - drop duplicates in one column or in combinations of columns from the datframe
Grouped summary statistics
dataframe.groupby("col_name1")["col_name2"].mean() - finds mean of the values in column 2 grouped by the values in the column 1
Using dataframe.groupby("col_name1")["col_name2"].agg([func1,func2,...]) method we can get the grouped summary statistics from several functions
Groups can be defined from several columns: dataframe.groupby(["col_name1","col_name3"])["col_name2"].mean()
... and obtain statistics for several columns as well: dataframe.groupby(["col_name1","col_name3"])[["col_name2","col_name4"]].mean()
Pivot tables
dataframe.pivot_table(values="col_name1", index="col_name2", aggfunc=np.median) - makes a pivot tables with median (or another aggregated function) values of column 1 grouped by indexed column 2
or a bunch of functions using list of functions: aggfunc=[np.median,np.max]
statistics for two values, one in the rows ("col_name1") and another in the columns ("col_names3") grouped by unique values in "col_name2":
dataframe.pivot_table(values="col_name1", index="col_name2", columns="col_name3") (it is not obligatory to mention values parameter)
Use an argument fill_value=0 to replace NaN values with 0
Use an argument margins=True to get summary statistics for all columns and rows in extra column/row (shown as last)
Working with Pivot Tables:
Use .loc["row1":"row2"] method to slice the table (select interested rows)
ptable.mean(axis="index") - finds means for each row, axis="columns" - for each column
Explicit indices
.set_index("col_name") - set new indices of rows from the col_name column; it is possible to set more than one index columns by using .set_index(["col_name1","col_name2"])
.reset_index() - reset indexing
dataframe_new_ind.loc["x","y"] - select rows with "x" and "y" indexes; in case of several index columns use tuples: .loc[("x1","x2"),("y1","y2")]
Sorting is done by .sort_index(level="col_name1", ascending=True), for several index columns use lists in arguments
It is possible to combine methods: dataframe_new_ind = dataframe.set_index(["col_name1","col_name2"]).sort_index()
Slicing by rows:
dataframe.loc["x":"y"] - select from row "x" to row "y" including both; for several index columns use tuples: .loc[("x1,y1"):("x10,y8")]
Slicing also by columns:
dataframe.loc[:,"col_name1":"col_name2"] - all rows and two columns
If one use dates ("yyyy-mm-dd") as indices, it is possible to slice by partial mentioning of the data, e.g., years: .loc["yyy1":"yyy2"]
.iloc[2:5, 1:4] - slicing as in lists (!in this case lasts are not included)
Vizualizing the data
import matplotlib.pyplot as plt
dataframe["col_name"].hist() - build a histogram of the values in the specified column (! do not forget to use plt.show() to see the resulted figure)
.hist(bins=10,alpha=0.7) - set the number of bins and transparency value
For the grouped statistics:
df_grouped_stat = datagframe.groupby("col_name1")["col_name3"].mean()
df_grouped_stat.plot(kind="bar", title="Mean by Column 1") - bar plot for the mean of each "col_name1" type
Line plots: dataframe.plot(x="col_name1", y="col_name2", kind="line", rot=45), where rot argument defines label rotation angle
Scatter plots: **.plot(x="col_name1", y="col_name2", kind="scatter")
Several plots in one figure - just type several plots consecutively and then call plt.show()
To distinguish the datasets, use legend: plt.legend(["data1","data2",...])
Missing values
dataframe.isna() - provide logical indicators of the values being NaN
dataframe.isna().any() - check if there are NaNs in each column
dataframe.isna().sum() - find number of NaNs in each column
dataframe.isna().sum().plot(kind="bar") - barplot of the number of NaNs in each column
dataframe.dropna() - drop rows with NaNs
dataframe.fillna(x) - replace NaNs with x
Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- Print the highest weekly sales for each
department
in thewalmart
DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video. - What was the total
nb_sold
of organic avocados in 2017 in theavocado
DataFrame? If you're stuck, try reviewing this video. - Create a bar plot of the total number of homeless people by region in the
homelessness
DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video. - Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.
# Task 1:
print(walmart['weekly_sales'].max())