Data Manipulation with dplyr
Run the hidden code cell below to import the data used in this course.
# Load the Tidyverse
library(tidyverse)
# Load the course datasets
babynames <- read_rds("datasets/babynames.rds")
counties <- read_rds("datasets/counties.rds")Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
install dplyr only: install.packages("dplyr")
install entire tidyverse: install.packages("tidyverse")
glimpse(dataset) shows first few values from each variable, along with data type
select() extracts particular variables from a dataset dataset %>% select(fields you want to keep go here) to make a new table only with what you want to keep with assignment new_table <- dataset %>% select(data)
arrange() - sorts data based on variable provided arrange(desc/asc(variable))
mutate() create new variable from dataset to match what you need
count() - number of observations real value of the verb comes with having it count a specific variable sort shows most common observations sort = TRUE weight - wt weighs the observations based on a variable passed
counties %>%
summarize(total_population =sum(population),
average_unemployment = mean(unemployment))
Summary functions sum() mean() median() min() max() n()
Aggregate within groups
counties %>%
group_by(state)%>%
summarize(total_pop =sum(population),
average_unemployment = mean(unemployment))
Sorting summaries counties %>% group_by(state)%>% summarize(total_pop =sum(population), average_unemployment = mean(unemployment))%>% arrange(desc(average_unemployment))
Grouping on multiple columns counties %>% group_by(state, metro)%>% summarize(total_pop =sum(population))
Ungroup counties %>% group_by(state, metro)%>% summarize(total_pop =sum(population))%>% ungroup()
When you use summarize on a table that has multiple groups, only the last group gets "peeled off". This is useful when you want to continue doing additional summaries or aggregations.
If you don't want to keep state as a group, you can add another dplyr verb: ungroup().
slice_max() Returns the largest observations in each group counties_selected <- counties %>% select(state, county, population, unemployment, income)
counties_selected %>% group_by(state)%>% slice_max(population, n = 1)
n = number of observations per whatever you grouped by
Select a range counties %>% select(state, county, drive:work_at_home)
Select and arrange counties %>% select(state, county, drive:work_at_home)%>% arrange(drive)
dplyr provides "select helpers", which are functions that specify criteria for choosing columns. Contains counties %>% select(state, county, contains("work"))
Starts with counties %>% select(state, county, starts_with("income"))
Other helpers contains() starts_with() ends_with() last_col() matches() For more: ?select_helpers
Removing a variable
counties %>%
select(-census_id)
Select columns
counties_selected <- counties %>%
select(state, county, population, unemployment)
counties_selected
Rename a column counties_selected %>% rename(unemployment_rate = unemployment) New name on left, old on right Combine verbs counties_selected %>% select(state, county, population, unemployment_rate = unemployment) Second one does the same thing as the first without the extra step, sometimes you want the extra step though
Transmute verb Transmute is like a combination of select and mutate: it returns a subset of the columns like select, but it can also transform and change the columns, like mutate, at the same time.
Select and calculate counties %>% transmute(state, county, fraction_men = men / population)
Filter for multiple names %in%: used to filter for multiple values lll babynames_multiple <- babynames %>% filter(name %in% c("Amy","Christopher"))
Combining group_by() and mutate()
babynames %>%
group_by(year)%>%
mutate(year_total =sum(number))
Add the fraction column babynames %>% group_by(year)%>% mutate(year_total =sum(number))%>% ungroup()%>% mutate(fraction = number / year_total)