Data Manipulation with pandas
Run the hidden code cell below to import the data used in this course.
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")
Take Notes
pivot_table(values=, index =, columns= , fill value=, margins=, aggfunc=)
Index
.set_index() .reset_index() .loc[(list of values)] .sort_index(levels = [], ascending =())
Slicing DataFrames
Rows
First, sort_index outer level slicing: pass the indexes instead of position, and it includes the last index inner level slicing: pass the first and last position as tupples (outer index, inner index): (outer index, inner index)
Columns
df.loc[:(keeps all rows, x:y to get specific),x:y(range of columns to keep)]
Slicing with iloc
df.iloc[x:y (rows), x:y (columns)] doesn't include the last number
Use Boolean conditions to subset temperatures for rows in 2010 and 2011
temperatures_bool = temperatures[(temperatures["date"] >= "2010-01-01") & (temperatures["date"] <= "2011-12-31")]
Pivot Tables + Slicing
.mean(axis = 'index'(for each row), 'columns'(for each column)) When needing to get the minimum or maximum data for a dataframe with mean applied, first define the mean and then get the data using a Boolean like this: df[df.mean == df.mean.max()or df.mean.min()]
Visualizing your data
Bar plots
.plot(kind='type of sketch', title = 'name of the plot', x='gives name to the axis', y='same as x', rot = 'int, allows to rotate labels for a better reading') plt.legend(['criteria when laying multiple plots']) alpha = 'int, regulates transparency' goes into the hist or the plot plt.show('displays the plot') .scatterplot() function to get scatterplot(data ='value', x= , y=, hue='leyenda del grafico')
Subset avocados for the conventional type, and the average price column. Create a histogram.
Create a histogram of avg_price for organic type avocados. Add a legend to your plot, with the names "conventional" and "organic". Show your plot.
Solution
avocados[avocados["type"] == "conventional"]["avg_price"].hist(alpha=0.5, bins= 20)
Modify bins to 20
avocados[avocados["type"] == "organic"]["avg_price"].hist(alpha=0.5, bins= 20)
Add a legend
plt.legend(["conventional", "organic"])
Show the plot
plt.show()
Missing values
df.isna() or df.isnull() Detects if there is any NaN value. We can add .sum() and we get the amount of missing values. True = 1, False = 0. Adding any() gives specifics, it detects NaN values in the columns and gives a feedback with the columns. We can add a plot to show the amount of missing values with .plot(kind='bar')
Removing missing values
df.dropna(how='all'for deleting the rows where all the data is NaN) Deletes the rows where the data is missing
Replace missing values
df.fillna(0) Changes the NaN values to 0
Creating DataFrames
Methods
List of dictionaries: Creates row by row list_of_dicts =[{key:value,key:value, key:value, key:value},{key:value, key:value, key:value, key:value}] df = pd.DataFrame(list_of_dicts)
Dictionary of lists: Creates columns by columns dict_of_list ={key:[value, value, value], key:[value, value, value], key:[value, value, value]} df = pf.DataFrame(dict_of_list)
Reading and Writing CSVs
df = pd.read_csv('data.csv', nrows='amount of rows', index_col=['name of the column that we want as index, can be more than one']) To import data df.to_csv(df.csv) To export DataFrame
Joining Data with Pandas
Inner Join
Only returns rows that have matching values in both DataFrames new_df = dfx.merge(dfy, on='specifies the column to match' If we need to merge with to columns ['x', 'y'], suffixes = ('allows to define the suffix for the columns that are repeated. x,y The first position is for the first table, the second for the second one'))
One to many relationships
One to One = Every row in the left table is related to only one row in the right table One to Many = Every row in the left table is related to one or more rows in the right table
Merging multiple DataFrames
new_df = dfx.merge(dfy, on='...')
.merge(dfz, on='...', suffixes=('',''))
Left Join
Returns all the data from the left table and only those rows from the right table with a match new_df = dfx.merge(dfy, on='column', how='left') In a one-to-one relation the resulting DataFrame will have the same amount of rows that the left DataFrame
Other Joins
Right: Returns all the rows from the right table and includes only rows from the left table with a match If the matching column has a diferent name for the tables we write it like this: new_df = dfx.merge(dfy, how='right', left_on='name on the left table', right_on='name on the right table')
Outer: Returns all of the rows from both tables regardless if there is a match between them
Merging a table to itself (self-join)
new_df =dfx.merge(dfx, left_on='', right_on='', suffixes = ('x', 'y'), how='')
Merging on Indexes
If the index has the same name in both tables we use "on=" to define the name of the index new_df = dfx.merge(dfy, on='index'...) If the index has diferents name in the tables we use "left_on=", "right_on=" and we add "left_index=True" and "right_index= True" accordingly new_df = dfx.merge(dfy, left_on='index_name', right_on='index_name', left_index=True, right_index=True)
Filtering joins
Filter observations from one table based on wheter or not they match an observation in another table Semi joins: Returns the intersection, only the columns from the left table, and no duplicates new_df = dfx[dfx['column'].isin(dfy['column'])] Anti joins: Returns the left table, excluding the intersection. Only the columns from the left table 1.new_dfx = dfx.merge(dfy, on='column', how='left', indicator=True 2.new_dfy = new_dfx.loc[new_dfx['_merge'] == 'left_only', 'column to return'] 3.new_dfz = dfx[dfx['column'].isin(new_dfy)]
Concatenate DataFrames together vertically
pd.concat(['x', 'y', ... list of tables to concat], ignore_index = True 'Allows to reset the index, because the tables keep it after concat', keys=[x, y, z]'Careful here if we use keys, the ignore_index = False', sort=True 'For tables with different ammount of columns, concat will join them and the order in alphabetical order', join='inner' 'Only brings the matching columns, the default value is outer, leaves without effect the sort') axis=0: Vertical, is the default
dfx.append([x, y, z]ignore_index=True, sort=True)
Verifying integrity
pd.merge(validate=None '{one to one, one to many, many to one, many to many}' If we input one of these, the merge will validate it. If its not True, an Error will be raised)
pd.concat(verify_integrity= False 'Default value, if set to True it will check if there are duplicate values in the index and only in the index and raise an Error if there are')
Using merge_ordered()
Method for merging time-series and other ordered data, automatic sort default method: 'outer' pd.merge_ordered(dfx, dfy, fill_method='ffill') To fill missing data we can use Forward fill, wich fills the missing values with previous values. If the missing value is the first it'll still be missing because there is no previous value It allows for a right join during the merge If it cannot match the rows of the tables exactly, it can use forward fill to interpolate the missing data
Using merge_asof()
Similiar to merge_ordered left join You are working with data sampled from a process and the dates or times may not exactly align Match on the nearest key column and not exact matches. Match values is equal or less by default. The argument direction='forward',reverts this behavior Merged on columns must be sorted It can be used to do fuzzy matching of dates between tables Has an argument than can be set to 'forward' to select the first row in the right table whose key column is greater than equal to the lefts After matching two tables, if there are missing values at the top op the table from the right table, this function can fill them in
Both: Can set suffix for overlapping column names Can be used when working with ordered or time series data
Selecting data with .query()
df.query('condition' we can add an and & or). df.query('column' == 'value' or ('column'=='value' and 'column'<number)') Accepts an input string Input string used to determine what rows are returned Input string similar to statement after WHERE clause in SQL statement
Reshaping data with .melt()
The melt method allow us to unpivot our dataset df.melt(id_vars=['column_x', 'column_y'.... ], value_vars=['value_x', 'value_y'.....], var_name='name of the variable column', value_name='name of the value column')
Add your notes here
# Add your code snippets here
Explore Datasets
Use the DataFrames imported in the first cell to explore the data and practice your skills!
- Print the highest weekly sales for each
department
in thewalmart
DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video. - What was the total
nb_sold
of organic avocados in 2017 in theavocado
DataFrame? If you're stuck, try reviewing this video. - Create a bar plot of the total number of homeless people by region in the
homelessness
DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video. - Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.