Data Manipulation with pandas with Kyesswa Steven( Kyessvar Stevenz )
Run the hidden code cell below to import the data used in this course.
# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")
Notes Taken By Kyesswa Steven On 11/5/2023
Inspecting a DataFrame Exercise
When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.
.head()
returns the first few rows (the “head” of the DataFrame)..info()
shows information on each of the columns, such as the data type and number of missing values..shape
returns the number of rows and columns of the DataFrame..describe()
calculates a few summary statistics for each column.
homelessness is a DataFrame
containing estimates of homelessness in each U.S. state in 2018. The individual column is the number of homeless individuals not part of a family with children. The family_members
column is the number of homeless individuals part of a family with children. The state_pop
column is the state's total population.
Instructions 1/4
- Print the head of the
homelessness
DataFrame
. - Print information about the column types and missing values in
homelessness
. - Print the number of rows and columns in
homelessness
. - Print some summary statistics that describe the
homelessness
DataFrame
.
# Name : Kyesswa Steven AKA Kyessvar Stevenz
# Date : 11/5/2023
# Linktree : linktr.ee/kyesswasteven
# Website : https://kyesswasteven.me
# Print the head of the homelessness data
print( "Printing the head of the homelessness data ", homelessness.head() )
# Print information about homelessness
print( "Printing information about homelessness ", homelessness.info() )
# Print the shape of homelessness
print( "Printing the shape of homelessness ", homelessness.shape )
# Print a description of homelessness
print( "Print a description of homelessness ", homelessness.describe() )
Parts of a DataFrame Exercise
To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:
.values
: A two-dimensional NumPy array of values.
.columns
: An index of columns: the column names.
.index
: An index for the rows: either row numbers or row names.
You can usually think of indexes as a list of strings or numbers, though the pandas Index
data type allows for more sophisticated options. (These will be covered later in the course.)
Instructions
- Import
pandas
using the aliaspd
. - Print a
2D NumPy array
of the values inhomelessness
. - Print the column names of
homelessness
. - Print the
index
ofhomelessness
.
# Name : Kyesswa Steven AKA Kyessvar Stevenz
# Date : 11/5/2023
# Linktree : linktr.ee/kyesswasteven
# Website : https://kyesswasteven.me
# Import pandas using the alias pd
import pandas as pd
# Print the values of homelessness
print( homelessness.values )
# Print the column index of homelessness
print( homelessness.columns )
# Print the row index of homelessness
print( homelessness.index )
Sorting rows Exercise
Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to .sort_values()
.
In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.
Sort on … Syntax
one column df.sort_values("breed")
multiple columns df.sort_values(["breed", "weight_kg"])
By combining .sort_values()
with .head()
, you can answer questions in the form, "What are the top cases where…?".
Instructions 1/3
- Sort
homelessness
by the number of homeless individuals, from smallest to largest, and save this ashomelessness_ind.
- Print the head of the sorted DataFrame.
- Sort
homelessness
by the number of homelessfamily_members
in descending order, and save this ashomelessness_fam
. - Print the head of the sorted DataFrame.
- Sort
homelessness
first by region (ascending), and then by number of family members (descending). Save this ashomelessness_reg_fam
. - Print the head of the sorted DataFrame.
Code Below
# Name : Kyesswa Steven AKA Kyessvar Stevenz
# Date : 11/7/2023
# Linktree : linktr.ee/kyesswasteven
# Website : https://kyesswasteven.me
# Sort homelessness by individuals
homelessness_ind = homelessness.sort_values( "individuals" )
# Print the top few rows
print( "Print the top few rows ", homelessness_ind.head() )
# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values( "family_members", ascending = False )
# Print the top few rows
print( "Print the top few rows ", homelessness_fam.head() )
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(by=[ "region", "family_members" ], ascending=[ True, False ] )
# Print the top few rows
print( "Print the top few rows ", homelessness_reg_fam.head() )
Subsetting columns Exercise
When working with data, you may not need all of the variables in your dataset. Square brackets ([]
) can be used to select only the columns that matter to you in an order that makes sense to you. To select only "col_a
" of the DataFrame
df
, use
df["col_a"]
To select "col_a"
and "col_b"
of df
, use
df[["col_a", "col_b"]]
homelessness
is available and pandas
is loaded as pd
.
Instructions 1/3
- Create a
DataFrame
calledindividuals
that contains only theindividuals
column ofhomelessness
. Print the head of the result. - Create a
DataFrame
calledstate_fam
that contains only the state andfamily_members
columns ofhomelessness
, in that order. Print the head of the result. - Create a
DataFrame
calledind_state
that contains theindividuals
and state columns ofhomelessness
, in that order. Print the head of the result.
Code Below:
# Name : Kyesswa Steven AKA Kyessvar Stevenz
# Date : 11/7/2023
# Linktree : linktr.ee/kyesswasteven
# Website : https://kyesswasteven.me
# Select the individuals column
individuals = homelessness[ ["individuals"] ]
# Print the head of the result
print( "Print the head of the result in individuals ", individuals.head() )
# Select the state and family_members columns
state_fam = homelessness[ [ "state", "family_members" ] ]
# Print the head of the result
print( "Print the head of the result in state_fam ", state_fam.head() )
# Select only the individuals and state columns, in that order
ind_state = homelessness[ [ "individuals", "state" ] ]
# Print the head of the result
print( "Print the head of the result in ind_state ", ind_state.head() )
Subsetting rows Exercise
A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.
There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True
or False
for each row, then pass that inside square brackets.
dogs[dogs["height_cm"] > 60] dogs[dogs["color"] == "tan"]
You can filter for multiple conditions at once by using the "bitwise and" operator, &
.
dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]
homelessness
is available and pandas
is loaded as pd
.
Instructions 1/3
Filter homelessness
for cases where the number of individuals
is greater than ten thousand, assigning to ind_gt_10k
. View the printed result.
Filter homelessness
for cases where the USA Census region is "Mountain"
, assigning to mountain_reg
. View the printed result.
Filter homelessness
for cases where the number of family_members
is less than one thousand and the region is "Pacific"
, assigning to fam_lt_1k_pac
. View the printed result.
Code Below:
# Name : Kyesswa Steven AKA Kyessvar Stevenz
# Date : 11/7/2023
# Linktree : linktr.ee/kyesswasteven
# Website : https://kyesswasteven.me
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[ homelessness [ "individuals" ] > 10000 ]
# See the result
print( "Filter for rows where individuals is greater than 10000 in ind_gt_10k ", ind_gt_10k )
# Filter for rows where region is Mountain
mountain_reg = homelessness[ homelessness[ "region" ] == "Mountain" ]
# See the result
print( "Filter for rows where region is Mountain in mountain_reg ", mountain_reg )
# Filter for rows where family_members is less than 1000 and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness["family_members"] < 1000)
& (homelessness["region"] == "Pacific")]
# See the result
print("Filter for rows where family_members is less than 1000 and region is Pacific in fam_lt_1k_pac ", fam_lt_1k_pac)
Subsetting rows by categorical variables Exercise
Subsetting data based on a categorical variable often involves using the "or
" operator (|
) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the .isin()
method, which will allow you to tackle this problem by writing one condition instead of three separate ones.
colors = ["brown", "black", "tan"] condition = dogs["color"].isin(colors) dogs[condition]
homelessness
is available and pandas
is loaded as pd
.
Instructions 1/2
- Filter
homelessness
for cases where the USA census region is"South Atlantic"
or it is"Mid-Atlantic"
, assigning tosouth_mid_atlantic
. View the printed result. - Filter
homelessness
for cases where the USA census state is in the list of Mojave states, canu, assigning tomojave_homelessness
. View the printed result.
Code Below:
# Name : Kyesswa Steven AKA Kyessvar Stevenz
# Date : 11/7/2023
# Linktree : linktr.ee/kyesswasteven
# Website : https://kyesswasteven.me
# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[ homelessness[ "region" ].isin( [ "South Atlantic", "Mid-Atlantic" ] ) ]
# See the result
print( south_mid_atlantic )
# The Mojave Desert states
canu = [ "California", "Arizona", "Nevada", "Utah" ]
# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[ homelessness[ "state" ].isin( canu ) ]
# See the result
print( mojave_homelessness )
Adding new columns Exercise
You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.
You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units.
homelessness
is available and pandas
is loaded as pd
.
Instructions
- Add a new column to
homelessness
, namedtotal
, containing the sum of theindividuals
andfamily_members
columns. - Add another column to
homelessness
, namedp_individuals
, containing the proportion of homeless people in each state who areindividuals
.
Code Below: