Skip to content

Data Manipulation with dplyr

Run the hidden code cell below to import the data used in this course.

# Load the Tidyverse
library(tidyverse)

# Load the course datasets
babynames <- read_rds("datasets/babynames.rds")
counties <- read_rds("datasets/counties.rds")

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

install dplyr only: install.packages("dplyr")

install entire tidyverse: install.packages("tidyverse")

glimpse(dataset) shows first few values from each variable, along with data type

select() extracts particular variables from a dataset dataset %>% select(fields you want to keep go here) to make a new table only with what you want to keep with assignment new_table <- dataset %>% select(data)

arrange() - sorts data based on variable provided arrange(desc/asc(variable))

mutate() create new variable from dataset to match what you need

count() - number of observations real value of the verb comes with having it count a specific variable sort shows most common observations sort = TRUE weight - wt weighs the observations based on a variable passed

counties %>%
summarize(total_population =sum(population), average_unemployment = mean(unemployment))

Summary functions sum() mean() median() min() max() n()

Aggregate within groups counties %>%
group_by(state)%>% summarize(total_pop =sum(population), average_unemployment = mean(unemployment))

Sorting summaries counties %>% group_by(state)%>% summarize(total_pop =sum(population), average_unemployment = mean(unemployment))%>% arrange(desc(average_unemployment))

Grouping on multiple columns counties %>% group_by(state, metro)%>% summarize(total_pop =sum(population))

Ungroup counties %>% group_by(state, metro)%>% summarize(total_pop =sum(population))%>% ungroup()

When you use summarize on a table that has multiple groups, only the last group gets "peeled off". This is useful when you want to continue doing additional summaries or aggregations.

If you don't want to keep state as a group, you can add another dplyr verb: ungroup().

slice_max() Returns the largest observations in each group counties_selected <- counties %>% select(state, county, population, unemployment, income)

counties_selected %>% group_by(state)%>% slice_max(population, n = 1)

n = number of observations per whatever you grouped by

Select a range counties %>% select(state, county, drive:work_at_home)

Select and arrange counties %>% select(state, county, drive:work_at_home)%>% arrange(drive)

dplyr provides "select helpers", which are functions that specify criteria for choosing columns. Contains counties %>% select(state, county, contains("work"))

Starts with counties %>% select(state, county, starts_with("income"))

Other helpers contains() starts_with() ends_with() last_col() matches() For more: ?select_helpers

Removing a variable counties %>%
select(-census_id)

Select columns counties_selected <- counties %>%
select(state, county, population, unemployment)

counties_selected

Rename a column counties_selected %>% rename(unemployment_rate = unemployment) New name on left, old on right Combine verbs counties_selected %>% select(state, county, population, unemployment_rate = unemployment) Second one does the same thing as the first without the extra step, sometimes you want the extra step though

Transmute verb Transmute is like a combination of select and mutate: it returns a subset of the columns like select, but it can also transform and change the columns, like mutate, at the same time.

Select and calculate counties %>% transmute(state, county, fraction_men = men / population)

Filter for multiple names %in%: used to filter for multiple values lll babynames_multiple <- babynames %>% filter(name %in% c("Amy","Christopher"))

Combining group_by() and mutate() babynames %>%
group_by(year)%>%
mutate(year_total =sum(number))

Add the fraction column babynames %>% group_by(year)%>% mutate(year_total =sum(number))%>% ungroup()%>% mutate(fraction = number / year_total)