Skip to content

Data Manipulation with pandas

Run the hidden code cell below to import the data used in this course.

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

#DATA MANIPULATION WITH PANDAS
"Introduction to pandas" 
#Exploring a DataFrame 

#Chapter 1: DataFrames 
"Sorting and subsetting" 
"Creating new columns"

#Chapter 2: Aggregating Data 
'Summary statistics'
'Counting'
'Grouped summary statistics'

#Chapter 3: Slicing and Indexing Data
'Subsetting using slicing'
'Indexes and subsetting using indexes'

#Chapter 4: Creating and Visualizing Data
'Plotting'
'Handling missing data'
'Reading data into a DataFrame'

#Exploring a DataFrame:

dogs.head()
# the first few rows (the “head” of the DataFrame)
dogs.info()
# shows information on each of the columns, such as the data type and number of missing values
dogs.shape
# returns the number of rows and columns of the DataFrame
dogs.describe()
# calculates a few summary statistics for each column
dogs.values
# A two-dimensional NumPy array of values.
dogs.columns
# An index of columns: the column names.
dogs.index
# An index for the rows: either row numbers or row names.

#Chapter 1: DataFrames
'Sorting and subsetting'

dogs.sort_values("weight_kg")
#Sort the lightest dog at the top
dogs.sort_values("weight_kg", ascending=False)
#Sort the heaviest dog at the top
dogs.sort_values(["weight_kg","height_cm"])
#Sort the lightest dog at the top Then the shortest dog 
dogs.sort_values(["weight_kg","height_cm"],ascending=[True,False])
#Sort the lightest dog at the top Then the tallest dog 

"Subsetting Coulmns"


dogs["name"]
#Subsetting DataFrame[“Coulmn_name”]
dogs[["name","weight_kg"]]
#Subsetting multiple columns

"Subsetting Rows"


Dogs["height_cm"] > 50
#Get True or false values

dogs[dogs["height_cm"] > 50]
#Get The rows of dogs taller than 50

dogs[dogs["breed"] == "labrador"]
#Subsetting based on text data

dogs[dogs["date_of_birth"] < "2015-01-01"]
#Subsetting based on dates

is_lab = dogs["breed"] == "Labrador"
is_brown= dogs["color"] == "Brown"
dogs[is_lab & is_brown]
#Subsetting based on multiple conditions

Is_balck_or_brown = dogs["color"].isin(["Black","Brown"])
dogs[is_balck_or_brown]
#Subsetting based on multiple conditions using .isin()


"Creating a new column"

dogs["height_m"]=dogs["height_cm"]/100
#Adding a new column
Hidden output

Add your notes here

# Add your code snippets here
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")

# Add total col as sum of individuals and family_members
homelessness["total"] = homelessness["individuals"]+homelessness["family_members"]

# Add p_individuals col as proportion of total that are individuals
homelessness["p_individuals"]=homelessness["individuals"]/homelessness["total"]

# See the result
print(homelessness)

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • Print the highest weekly sales for each department in the walmart DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
  • What was the total nb_sold of organic avocados in 2017 in the avocado DataFrame? If you're stuck, try reviewing this video.
  • Create a bar plot of the total number of homeless people by region in the homelessness DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
  • Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.