Data Manipulation with dplyr

Run the hidden code cell below to import the data used in this course.

# Load the Tidyverse
library(tidyverse)

# Load the course datasets
babynames <- read_rds("datasets/babynames.rds")
counties <- read_rds("datasets/counties.rds")

Dplyr Data Manipulation Notes

Can be installed on its own, or with the entire Tidyverse package (see code below)

Chapter 1 Verbs

select() - can pair down the number of variables in the dataset (code below)
filter() - filter observations based on logical operators
arrange() - sorts data based on one or more variables (defaults to ascending - see below)
mutate() - add new variables or change existing variables

glimpse() function

used to view the first few values from each variable, along with the data type (useful)

# Loading in dplyr package only
install.packages("dplyr")

# Loading in tidyverse package
install.packages("tidyverse")

# Using select() to pair down the number of variables in the dataset 
counties %>%
	select(state, county, population, unemployment)

# Can assign to a new variable and print it
counties_selected <- counties %>%
	select(state, county, population, unemployment)

# Using arrange to sort data based on one or more variables
counties_selected %>%
	arrange(population)

# Have to specify if you want to arrange in descending order
counties_selected %>%
	arrange(desc(population))

# filter for counties with unemployment less than 6% in the state of NY, arranged by descending pop
counties_selected %>%
	arrange(desc(population)) %>%
	filter(state == "New York",
          unemployment < 6)

# Use mutate to transform percent unemployment rate to the total number of unemployed in the pop and save it as a new variable
counties_selected %>%
	mutate(unemployed_population = population * unemployment / 100)

# Which counties have the highest number of unemployed people?
counties_selected %>%
	mutate(unemployed_population = population * unemployment / 100)
	arrange(desc(unemployed_population))

Count Verb

One way to aggregate data to find out the NUMBER OF OBSERVATIONS Generic use of the verb results in a 1x1 table called "n" that tells us the number of observations (see code below)

Counting a specific variable

counties %>% count(state) will give you the number of counties in each state.

Counting and sorting

Allows us to aggregate data and sort by it counties %>% count(state, sort = TRUE) will give you the number of counties in each state sorted from most common observations to least.

Weighting your counts

Counting citizens by state

You can weigh your count by particular variables rather than finding the number of counties. In this case, you'll find the number of citizens in each state.

# Count verb
counties %>%
	count()

# Counting a specific variable
counties %>%
	count(state)

# Counting and sorting
counties %>%
	count(state, sort = TRUE)

# Adding weight
counties %>%
	count(state, wt = population, sort = TRUE)

# Add weight - example: find the number of citizens in each state
counties %>%
	count(state, wt = citizens, sort = TRUE)

# Nestle count, mutate, sort, weight...
counties_selected %>%
	mutate(population_walk = walk * population / 100) %>%
  	count(state, wt = population_walk, sort = TRUE)

Summarize verb

Takes many observations and turns them into one observation

Can combine summary functions and create multiple variables in a single line

# Summarize
summarize(total_population = sum(population))

# Combining summaries
counties %>%
	group_by(state) %>%
	summarize(total_pop = sum(population),
             average_unemployment = mean(unemployment))

# Arrange the results
counties %>%
	group_by(state) %>%
	summarize(total_pop = sum(population),
             average_unemployment = mean(unemployment))
	arrange(desc(average_unemployment))

# Group by multiple columns at onces - results in 1 column for each combination of state & metro
counties %>%
	group_by(state, metro) %>%
	summarize(total_pop = sum(population),
             average_unemployment = mean(unemployment))
	arrange(desc(average_unemployment))

# Can add ungroup() to remove a grouping that was added

Selecting Data

Can add "helpers" when selecting data...

Contains("word") will select any columns with that word in it
starts_with("word")
ends_with("word")
last_col()
matches()

run ?select_helpers for more info

Selecting ranges

counties %>% select(state, county, drive:work_at_home)

will select every variable in the range of columns between "Drive" and "work_at_home"
can then arrange by a variable that was selected by adding:
%>% arrange(drive)

removing variables

select(-variable_name) <- by adding the minus sign before variable name, it deletes the variable

Data Manipulation with dplyr

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Data Manipulation with dplyr

Dplyr Data Manipulation Notes

Chapter 1 Verbs

glimpse() function

Count Verb

Counting a specific variable

Counting and sorting

Weighting your counts

Summarize verb

Can combine summary functions and create multiple variables in a single line

Selecting Data

run ?select_helpers for more info

Selecting ranges

removing variables

Data Manipulation with dplyr