Skip to content

Data Manipulation with pandas

Run the hidden code cell below to import the data used in this course.


1 hidden cell

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

dogs.groupby(["color","breed"])["weight_kg"].sum() is the usage of multiple grouping with groupby

PIVOT TABLES

dogs.pivot_table(values="weight_kg",index="color")

takes values and index parameters -- index is the parameter that we want to groupby

!! pivot tables calculate the mean of the parameters by default

IF WE WANT TO ADD OTHER SUMMARY STATISTICS...

--we use aggfunc parameter in the pivot_table()

dogs.pivot_table(values="weight_kg",index="color",aggfunc=[np.mean,np.median,max,min])

IF WE WANT TO ADD ANOTHER PARAMETER TO GROUPBY...

we can use "columns" parameter

dogs.pivot_table(values="weight_kg",index="color",columns="breed",aggfunc=[np.mean,np.median,max,min])

MISSING VALUES

there are parameters to fill in missing values in the dataset, which are fill_value and margins.

in pivot_table --> fill_value=True

fills the value with 0s.

SETTING COLUMNS AS INDEX

-- set_index("nameofindex")

--reset_index() **this keeps the contents of the index

--reset_index(drop=True) **this drops the contents

.loc[] is a subsetting method that takes index.

SLICING

!! we must sort the rows before slicing --> .sort_index()

  • to slice outer level, first and last should be string
  • to slice inner level, first and last should be tuple

dataFrame["columnname"].dt.year --> takes the year part of the date

PIVOT TABLE + SLICING

we can do slicing on the pivot tables like this:

mean_temp_by_year = temp_by_country_city_vs_year.mean(axis="columns") -- for columns mean_temp_by_year = temp_by_country_city_vs_year.mean(axis="index") -- for rows

# Add your code snippets here

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • Print the highest weekly sales for each department in the walmart DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
  • What was the total nb_sold of organic avocados in 2017 in the avocado DataFrame? If you're stuck, try reviewing this video.
  • Create a bar plot of the total number of homeless people by region in the homelessness DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
  • Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.