Skip to content

Data Manipulation with pandas

Transforming DataFrames

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
sales = pd.read_csv("datasets/walmart.csv")

Inspecting a DataFrame

When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

.head() returns the first few rows (the “head” of the DataFrame). .info() shows information on each of the columns, such as the data type and number of missing values. .shape returns the number of rows and columns of the DataFrame. .describe() calculates a few summary statistics for each column.

homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018.

  • The individual column is the number of homeless individuals not part of a family with children.
  • The family_members column is the number of homeless individuals part of a family with children.
  • The state_pop column is the state's total population.
print(homelessness.head())
print(homelessness.info())
print(homelessness.shape) # return the count of rows and columns
print(homelessness.describe())

Parts of a DataFrame

To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

.values: A two-dimensional NumPy array of values. .columns: An index of columns: the column names. .index: An index for the rows: either row numbers or row names.

You can usually think of indexes as a list of strings or numbers, though the pandas Index data type allows for more sophisticated options.

print(homelessness.columns)
print(homelessness.index)

Sorting rows

The sort_values() method in Pandas is used to sort the values in a DataFrame. The method takes two arguments:

by : The column or index to sort by. ascending : Whether to sort in ascending or descending order. The default value for ascending is True, which means that the values will be sorted in ascending order. To sort in descending order, set ascending to False.

# Sort homelessness by individuals
homelessness_ind = homelessness.sort_values('individuals', ascending = True)

# Print the top few rows
print(homelessness_ind.head())

The sort_values() method can also be used to sort multiple columns. For example, the following code sorts the region and family menbers in the df DataFrame in ascending and descending order:

# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(['region','family_members'], ascending = [True, False])

# Print the top few rows
print(homelessness_reg_fam.head())