Skip to content
Data Manipulation with dplyr
Run the hidden code cell below to import the data used in this course.
# Load the Tidyverse
library(tidyverse)
# Load the course datasets
babynames <- read_rds("datasets/babynames.rds")
counties <- read_rds("datasets/counties.rds")
Dplyr Data Manipulation Notes
Can be installed on its own, or with the entire Tidyverse package (see code below)
Chapter 1 Verbs
- select() - can pair down the number of variables in the dataset (code below)
- filter() - filter observations based on logical operators
- arrange() - sorts data based on one or more variables (defaults to ascending - see below)
- mutate() - add new variables or change existing variables
glimpse() function
- used to view the first few values from each variable, along with the data type (useful)
# Loading in dplyr package only
install.packages("dplyr")
# Loading in tidyverse package
install.packages("tidyverse")
# Using select() to pair down the number of variables in the dataset
counties %>%
select(state, county, population, unemployment)
# Can assign to a new variable and print it
counties_selected <- counties %>%
select(state, county, population, unemployment)
# Using arrange to sort data based on one or more variables
counties_selected %>%
arrange(population)
# Have to specify if you want to arrange in descending order
counties_selected %>%
arrange(desc(population))
# filter for counties with unemployment less than 6% in the state of NY, arranged by descending pop
counties_selected %>%
arrange(desc(population)) %>%
filter(state == "New York",
unemployment < 6)
# Use mutate to transform percent unemployment rate to the total number of unemployed in the pop and save it as a new variable
counties_selected %>%
mutate(unemployed_population = population * unemployment / 100)
# Which counties have the highest number of unemployed people?
counties_selected %>%
mutate(unemployed_population = population * unemployment / 100)
arrange(desc(unemployed_population))
Count Verb
- One way to aggregate data to find out the NUMBER OF OBSERVATIONS Generic use of the verb results in a 1x1 table called "n" that tells us the number of observations (see code below)
Counting a specific variable
counties %>% count(state) will give you the number of counties in each state.
Counting and sorting
- Allows us to aggregate data and sort by it counties %>% count(state, sort = TRUE) will give you the number of counties in each state sorted from most common observations to least.
Weighting your counts
Counting citizens by state
- You can weigh your count by particular variables rather than finding the number of counties. In this case, you'll find the number of citizens in each state.
# Count verb
counties %>%
count()
# Counting a specific variable
counties %>%
count(state)
# Counting and sorting
counties %>%
count(state, sort = TRUE)
# Adding weight
counties %>%
count(state, wt = population, sort = TRUE)
# Add weight - example: find the number of citizens in each state
counties %>%
count(state, wt = citizens, sort = TRUE)
# Nestle count, mutate, sort, weight...
counties_selected %>%
mutate(population_walk = walk * population / 100) %>%
count(state, wt = population_walk, sort = TRUE)
Summarize verb
Takes many observations and turns them into one observation
Can combine summary functions and create multiple variables in a single line
# Summarize
summarize(total_population = sum(population))
# Combining summaries
counties %>%
group_by(state) %>%
summarize(total_pop = sum(population),
average_unemployment = mean(unemployment))
# Arrange the results
counties %>%
group_by(state) %>%
summarize(total_pop = sum(population),
average_unemployment = mean(unemployment))
arrange(desc(average_unemployment))
# Group by multiple columns at onces - results in 1 column for each combination of state & metro
counties %>%
group_by(state, metro) %>%
summarize(total_pop = sum(population),
average_unemployment = mean(unemployment))
arrange(desc(average_unemployment))
# Can add ungroup() to remove a grouping that was added
Selecting Data
Can add "helpers" when selecting data...
- Contains("word") will select any columns with that word in it
- starts_with("word")
- ends_with("word")
- last_col()
- matches()
run ?select_helpers for more info
Selecting ranges
counties %>% select(state, county, drive:work_at_home)
-
will select every variable in the range of columns between "Drive" and "work_at_home"
-
can then arrange by a variable that was selected by adding:
-
%>% arrange(drive)
removing variables
select(-variable_name) <- by adding the minus sign before variable name, it deletes the variable