Skip to content

Data Manipulation with dplyr

Run the hidden code cell below to import the data used in this course.

# Load the Tidyverse
library(tidyverse)

# Load the course datasets
babynames <- read_rds("datasets/babynames.rds")
counties <- read_rds("datasets/counties.rds")

Dplyr Data Manipulation Notes

Can be installed on its own, or with the entire Tidyverse package (see code below)

Chapter 1 Verbs

  1. select() - can pair down the number of variables in the dataset (code below)
  2. filter() - filter observations based on logical operators
  3. arrange() - sorts data based on one or more variables (defaults to ascending - see below)
  4. mutate() - add new variables or change existing variables

glimpse() function

  • used to view the first few values from each variable, along with the data type (useful)
# Loading in dplyr package only
install.packages("dplyr")

# Loading in tidyverse package
install.packages("tidyverse")

# Using select() to pair down the number of variables in the dataset 
counties %>%
	select(state, county, population, unemployment)

# Can assign to a new variable and print it
counties_selected <- counties %>%
	select(state, county, population, unemployment)

# Using arrange to sort data based on one or more variables
counties_selected %>%
	arrange(population)

# Have to specify if you want to arrange in descending order
counties_selected %>%
	arrange(desc(population))

# filter for counties with unemployment less than 6% in the state of NY, arranged by descending pop
counties_selected %>%
	arrange(desc(population)) %>%
	filter(state == "New York",
          unemployment < 6)

# Use mutate to transform percent unemployment rate to the total number of unemployed in the pop and save it as a new variable
counties_selected %>%
	mutate(unemployed_population = population * unemployment / 100)

# Which counties have the highest number of unemployed people?
counties_selected %>%
	mutate(unemployed_population = population * unemployment / 100)
	arrange(desc(unemployed_population))

Count Verb

  • One way to aggregate data to find out the NUMBER OF OBSERVATIONS Generic use of the verb results in a 1x1 table called "n" that tells us the number of observations (see code below)

Counting a specific variable

counties %>% count(state) will give you the number of counties in each state.

Counting and sorting

  • Allows us to aggregate data and sort by it counties %>% count(state, sort = TRUE) will give you the number of counties in each state sorted from most common observations to least.

Weighting your counts

Counting citizens by state

  • You can weigh your count by particular variables rather than finding the number of counties. In this case, you'll find the number of citizens in each state.
# Count verb
counties %>%
	count()

# Counting a specific variable
counties %>%
	count(state)

# Counting and sorting
counties %>%
	count(state, sort = TRUE)

# Adding weight
counties %>%
	count(state, wt = population, sort = TRUE)

# Add weight - example: find the number of citizens in each state
counties %>%
	count(state, wt = citizens, sort = TRUE)

# Nestle count, mutate, sort, weight...
counties_selected %>%
	mutate(population_walk = walk * population / 100) %>%
  	count(state, wt = population_walk, sort = TRUE)  

Summarize verb

Takes many observations and turns them into one observation

Can combine summary functions and create multiple variables in a single line

# Summarize
summarize(total_population = sum(population))

# Combining summaries
counties %>%
	group_by(state) %>%
	summarize(total_pop = sum(population),
             average_unemployment = mean(unemployment))

# Arrange the results
counties %>%
	group_by(state) %>%
	summarize(total_pop = sum(population),
             average_unemployment = mean(unemployment))
	arrange(desc(average_unemployment))

# Group by multiple columns at onces - results in 1 column for each combination of state & metro
counties %>%
	group_by(state, metro) %>%
	summarize(total_pop = sum(population),
             average_unemployment = mean(unemployment))
	arrange(desc(average_unemployment))

# Can add ungroup() to remove a grouping that was added

Selecting Data

Can add "helpers" when selecting data...

  • Contains("word") will select any columns with that word in it
  • starts_with("word")
  • ends_with("word")
  • last_col()
  • matches()

run ?select_helpers for more info

Selecting ranges

counties %>% select(state, county, drive:work_at_home)

  • will select every variable in the range of columns between "Drive" and "work_at_home"

  • can then arrange by a variable that was selected by adding:

  • %>% arrange(drive)

removing variables

select(-variable_name) <- by adding the minus sign before variable name, it deletes the variable