Data Manipulation with pandas

Run the hidden code cell below to import the data used in this course.

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")

Take Notes

pivot_table(values=, index =, columns= , fill value=, margins=, aggfunc=)

Index

.set_index() .reset_index() .loc[(list of values)] .sort_index(levels = [], ascending =())

Slicing DataFrames

Rows

First, sort_index outer level slicing: pass the indexes instead of position, and it includes the last index inner level slicing: pass the first and last position as tupples (outer index, inner index): (outer index, inner index)

Columns

df.loc[:(keeps all rows, x:y to get specific),x:y(range of columns to keep)]

Slicing with iloc

df.iloc[x:y (rows), x:y (columns)] doesn't include the last number

Use Boolean conditions to subset temperatures for rows in 2010 and 2011

temperatures_bool = temperatures[(temperatures["date"] >= "2010-01-01") & (temperatures["date"] <= "2011-12-31")]

Pivot Tables + Slicing

.mean(axis = 'index'(for each row), 'columns'(for each column)) When needing to get the minimum or maximum data for a dataframe with mean applied, first define the mean and then get the data using a Boolean like this: df[df.mean == df.mean.max()or df.mean.min()]

Visualizing your data

Bar plots

.plot(kind='type of sketch', title = 'name of the plot', x='gives name to the axis', y='same as x', rot = 'int, allows to rotate labels for a better reading') plt.legend(['criteria when laying multiple plots']) alpha = 'int, regulates transparency' goes into the hist or the plot plt.show('displays the plot') .scatterplot() function to get scatterplot(data ='value', x= , y=, hue='leyenda del grafico')

Subset avocados for the conventional type, and the average price column. Create a histogram.

Create a histogram of avg_price for organic type avocados. Add a legend to your plot, with the names "conventional" and "organic". Show your plot.

Solution

avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5, bins= 20)

Modify bins to 20

avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5, bins= 20)

Add a legend

plt.legend(["conventional", "organic"])

Show the plot

plt.show()

Missing values

df.isna() or df.isnull() Detects if there is any NaN value. We can add .sum() and we get the amount of missing values. True = 1, False = 0. Adding any() gives specifics, it detects NaN values in the columns and gives a feedback with the columns. We can add a plot to show the amount of missing values with .plot(kind='bar')

Removing missing values

df.dropna(how='all'for deleting the rows where all the data is NaN) Deletes the rows where the data is missing

Replace missing values

df.fillna(0) Changes the NaN values to 0

Creating DataFrames

Methods

List of dictionaries: Creates row by row list_of_dicts =[{key:value,key:value, key:value, key:value},{key:value, key:value, key:value, key:value}] df = pd.DataFrame(list_of_dicts)

Dictionary of lists: Creates columns by columns dict_of_list ={key:[value, value, value], key:[value, value, value], key:[value, value, value]} df = pf.DataFrame(dict_of_list)

Reading and Writing CSVs

df = pd.read_csv('data.csv', nrows='amount of rows', index_col=['name of the column that we want as index, can be more than one']) To import data df.to_csv(df.csv) To export DataFrame

Joining Data with Pandas

Inner Join

Only returns rows that have matching values in both DataFrames new_df = dfx.merge(dfy, on='specifies the column to match' If we need to merge with to columns ['x', 'y'], suffixes = ('allows to define the suffix for the columns that are repeated. x,y The first position is for the first table, the second for the second one'))

One to many relationships

One to One = Every row in the left table is related to only one row in the right table One to Many = Every row in the left table is related to one or more rows in the right table

Merging multiple DataFrames

new_df = dfx.merge(dfy, on='...')
.merge(dfz, on='...', suffixes=('',''))

Left Join

Returns all the data from the left table and only those rows from the right table with a match new_df = dfx.merge(dfy, on='column', how='left') In a one-to-one relation the resulting DataFrame will have the same amount of rows that the left DataFrame

Other Joins

Right: Returns all the rows from the right table and includes only rows from the left table with a match If the matching column has a diferent name for the tables we write it like this: new_df = dfx.merge(dfy, how='right', left_on='name on the left table', right_on='name on the right table')

Outer: Returns all of the rows from both tables regardless if there is a match between them

Merging a table to itself (self-join)

new_df =dfx.merge(dfx, left_on='', right_on='', suffixes = ('x', 'y'), how='')

Merging on Indexes

If the index has the same name in both tables we use "on=" to define the name of the index new_df = dfx.merge(dfy, on='index'...) If the index has diferents name in the tables we use "left_on=", "right_on=" and we add "left_index=True" and "right_index= True" accordingly new_df = dfx.merge(dfy, left_on='index_name', right_on='index_name', left_index=True, right_index=True)

Filtering joins

Filter observations from one table based on wheter or not they match an observation in another table Semi joins: Returns the intersection, only the columns from the left table, and no duplicates new_df = dfx[dfx['column'].isin(dfy['column'])] Anti joins: Returns the left table, excluding the intersection. Only the columns from the left table 1.new_dfx = dfx.merge(dfy, on='column', how='left', indicator=True 2.new_dfy = new_dfx.loc[new_dfx['_merge'] == 'left_only', 'column to return'] 3.new_dfz = dfx[dfx['column'].isin(new_dfy)]

Concatenate DataFrames together vertically

pd.concat(['x', 'y', ... list of tables to concat], ignore_index = True 'Allows to reset the index, because the tables keep it after concat', keys=[x, y, z]'Careful here if we use keys, the ignore_index = False', sort=True 'For tables with different ammount of columns, concat will join them and the order in alphabetical order', join='inner' 'Only brings the matching columns, the default value is outer, leaves without effect the sort') axis=0: Vertical, is the default

dfx.append([x, y, z]ignore_index=True, sort=True)

Verifying integrity

pd.merge(validate=None '{one to one, one to many, many to one, many to many}' If we input one of these, the merge will validate it. If its not True, an Error will be raised)

pd.concat(verify_integrity= False 'Default value, if set to True it will check if there are duplicate values in the index and only in the index and raise an Error if there are')

Using merge_ordered()

Method for merging time-series and other ordered data, automatic sort default method: 'outer' pd.merge_ordered(dfx, dfy, fill_method='ffill') To fill missing data we can use Forward fill, wich fills the missing values with previous values. If the missing value is the first it'll still be missing because there is no previous value It allows for a right join during the merge If it cannot match the rows of the tables exactly, it can use forward fill to interpolate the missing data

Using merge_asof()

Similiar to merge_ordered left join You are working with data sampled from a process and the dates or times may not exactly align Match on the nearest key column and not exact matches. Match values is equal or less by default. The argument direction='forward',reverts this behavior Merged on columns must be sorted It can be used to do fuzzy matching of dates between tables Has an argument than can be set to 'forward' to select the first row in the right table whose key column is greater than equal to the lefts After matching two tables, if there are missing values at the top op the table from the right table, this function can fill them in

Both: Can set suffix for overlapping column names Can be used when working with ordered or time series data

Selecting data with .query()

df.query('condition' we can add an and & or). df.query('column' == 'value' or ('column'=='value' and 'column'<number)') Accepts an input string Input string used to determine what rows are returned Input string similar to statement after WHERE clause in SQL statement

Reshaping data with .melt()

The melt method allow us to unpivot our dataset df.melt(id_vars=['column_x', 'column_y'.... ], value_vars=['value_x', 'value_y'.....], var_name='name of the variable column', value_name='name of the value column')

Add your notes here

# Add your code snippets here

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

Print the highest weekly sales for each department in the walmart DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
What was the total nb_sold of organic avocados in 2017 in the avocado DataFrame? If you're stuck, try reviewing this video.
Create a bar plot of the total number of homeless people by region in the homelessness DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.

Data Manipulation with pandas

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Data Manipulation with pandas

Take Notes

Index

Slicing DataFrames

Rows

Columns

Slicing with iloc

Use Boolean conditions to subset temperatures for rows in 2010 and 2011

Pivot Tables + Slicing

Visualizing your data

Bar plots

Subset avocados for the conventional type, and the average price column. Create a histogram.

Solution

Modify bins to 20

Add a legend

Show the plot

Missing values

Removing missing values

Replace missing values

Creating DataFrames

Methods

Reading and Writing CSVs

Joining Data with Pandas

Inner Join

One to many relationships

Merging multiple DataFrames

Left Join

Other Joins

Merging a table to itself (self-join)

Merging on Indexes

Filtering joins

Concatenate DataFrames together vertically

Verifying integrity

Using merge_ordered()

Using merge_asof()

Selecting data with .query()

Reshaping data with .melt()

Explore Datasets

Data Manipulation with pandas