Skip to content

Data Manipulation with pandas

Run the hidden code cell below to import the data used in this course.


1 hidden cell

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Add your notes here

#drop duplicates
df.drop_duplicates(subset = 'column')
#for multiple columns
subset = ['columna','columnb']
#to count each 'type'
df['column'].value_counts()
#if you want to sort them add
.value_counts(sort = True)
#to normalize add
.value_counts(normalize = True)
#summary by group
df[df['column']=='type within column']['second column comparing'].mean()
#grouped summaries
df.groupby('columna')['columnb'].mean*()
#multiple grouped summaries
df.groupby('columna')['columnb'].agg([min,max,sum])
#grouping by multiple variables

df.groupby(['columna', 'columnb'])['columnc'].mean()
#there is a more efficient way to do this
df.pivot_table(index='column to groupby',values='column you want summarized')
#the default takes the mean
#Aggregation works differently here
df.pivot_table(index='columna',values='columnb',aggfunc=np.median)
#additional summary statistics can be passed
aggfunc[np.mean,np.median]
#grouping by two variables needs the second to be passed in a columns argument
df.pivot_table(index='columna',values='columnb',columns ='columnc')
#A=left,C=top,B=value to be summarized
#think A compared with C
#NaN is the default placeholder for sections without values edit by adding
fill_value = 0
#doesnt have to be 0
#to get a general summary add
margins = True
#the gives the mean of all columns and rows excluding NANs
#bottom right is the mean of everything, whereas the ends represent that row/column

#set a column as an index to align it to the left
variable = df.set_index('column name')
#reset it by using reset index
variable.reset_index()
#adding drop = true to paranthesis removes said column
variable.reset_index(drop = True)
#subsetting on hard mode
df[df['column'].isin(['rows'])]
#subsetting on easy mode when a column is indexed
variable.loc[['indexed columns']]

variable.loc[[('outer level index','inner level column')]]
#sort_index gets more complicated
variable.sort_index(level =['outer index','outer index'],ascending = [True,False])

#slice/subset index numbers 1-3
listname[1:4]
#to start at the beginning [:4]
#to print the whole list[:] 
#to slice dataframes, index by columns first then
variable.loc['index1':'index2']
#to index by inner index layers
variable.loc[('indexa1','indexb1'):('indexa2','indexb2')]
#to slice by columns
variable.loc[:, 'column1':'column2']
#to slice by rows
variable.loc[('rowa1','rowb1'):('rowa2','rowb2'), :]
#pivot table
variable = df.pivot_table('columna', index = 'columnb',columns ='columnc')
#a = value your comparing, b = y value, c = x value
#subset pivot table
variable.loc['start row','end row']
#axis argument
variable.mean(axis = 'index')
#this calculates the mean of all values for each row, so mean of row 1,2,3....

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • Print the highest weekly sales for each department in the walmart DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
  • What was the total nb_sold of organic avocados in 2017 in the avocado DataFrame? If you're stuck, try reviewing this video.
  • Create a bar plot of the total number of homeless people by region in the homelessness DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
  • Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.