Skip to content
Data Manipulation with dplyr
  • AI Chat
  • Code
  • Report
  • Data Manipulation with dplyr

    Run the hidden code cell below to import the data used in this course.

    # Load the Tidyverse
    library(tidyverse)
    
    # Load the course datasets
    babynames <- read_rds("datasets/babynames.rds")
    counties <- read_rds("datasets/counties.rds")

    Dplyr Data Manipulation Notes

    Can be installed on its own, or with the entire Tidyverse package (see code below)

    Chapter 1 Verbs

    1. select() - can pair down the number of variables in the dataset (code below)
    2. filter() - filter observations based on logical operators
    3. arrange() - sorts data based on one or more variables (defaults to ascending - see below)
    4. mutate() - add new variables or change existing variables

    glimpse() function

    • used to view the first few values from each variable, along with the data type (useful)
    # Loading in dplyr package only
    install.packages("dplyr")
    
    # Loading in tidyverse package
    install.packages("tidyverse")
    
    # Using select() to pair down the number of variables in the dataset 
    counties %>%
    	select(state, county, population, unemployment)
    
    # Can assign to a new variable and print it
    counties_selected <- counties %>%
    	select(state, county, population, unemployment)
    
    # Using arrange to sort data based on one or more variables
    counties_selected %>%
    	arrange(population)
    
    # Have to specify if you want to arrange in descending order
    counties_selected %>%
    	arrange(desc(population))
    
    # filter for counties with unemployment less than 6% in the state of NY, arranged by descending pop
    counties_selected %>%
    	arrange(desc(population)) %>%
    	filter(state == "New York",
              unemployment < 6)
    
    # Use mutate to transform percent unemployment rate to the total number of unemployed in the pop and save it as a new variable
    counties_selected %>%
    	mutate(unemployed_population = population * unemployment / 100)
    
    # Which counties have the highest number of unemployed people?
    counties_selected %>%
    	mutate(unemployed_population = population * unemployment / 100)
    	arrange(desc(unemployed_population))

    Count Verb

    • One way to aggregate data to find out the NUMBER OF OBSERVATIONS Generic use of the verb results in a 1x1 table called "n" that tells us the number of observations (see code below)

    Counting a specific variable

    counties %>% count(state) will give you the number of counties in each state.

    Counting and sorting

    • Allows us to aggregate data and sort by it counties %>% count(state, sort = TRUE) will give you the number of counties in each state sorted from most common observations to least.

    Weighting your counts

    Counting citizens by state

    • You can weigh your count by particular variables rather than finding the number of counties. In this case, you'll find the number of citizens in each state.
    # Count verb
    counties %>%
    	count()
    
    # Counting a specific variable
    counties %>%
    	count(state)
    
    # Counting and sorting
    counties %>%
    	count(state, sort = TRUE)
    
    # Adding weight
    counties %>%
    	count(state, wt = population, sort = TRUE)
    
    # Add weight - example: find the number of citizens in each state
    counties %>%
    	count(state, wt = citizens, sort = TRUE)
    
    # Nestle count, mutate, sort, weight...
    counties_selected %>%
    	mutate(population_walk = walk * population / 100) %>%
      	count(state, wt = population_walk, sort = TRUE)  

    Summarize verb

    Takes many observations and turns them into one observation

    Can combine summary functions and create multiple variables in a single line

    # Summarize
    summarize(total_population = sum(population))
    
    # Combining summaries
    counties %>%
    	group_by(state) %>%
    	summarize(total_pop = sum(population),
                 average_unemployment = mean(unemployment))
    
    # Arrange the results
    counties %>%
    	group_by(state) %>%
    	summarize(total_pop = sum(population),
                 average_unemployment = mean(unemployment))
    	arrange(desc(average_unemployment))
    
    # Group by multiple columns at onces - results in 1 column for each combination of state & metro
    counties %>%
    	group_by(state, metro) %>%
    	summarize(total_pop = sum(population),
                 average_unemployment = mean(unemployment))
    	arrange(desc(average_unemployment))
    
    # Can add ungroup() to remove a grouping that was added
    

    Selecting Data

    Can add "helpers" when selecting data...

    • Contains("word") will select any columns with that word in it
    • starts_with("word")
    • ends_with("word")
    • last_col()
    • matches()

    run ?select_helpers for more info

    Selecting ranges

    counties %>% select(state, county, drive:work_at_home)

    • will select every variable in the range of columns between "Drive" and "work_at_home"

    • can then arrange by a variable that was selected by adding:

    • %>% arrange(drive)

    removing variables

    select(-variable_name) <- by adding the minus sign before variable name, it deletes the variable