Skip to content
Data Manipulation with dplyr
  • AI Chat
  • Code
  • Report
  • Spinner
    # Start coding here...

    Exploring data with dplyr

    The dplyr package is one of several packages included in the Tidyverse collection.
    dplyr can be installed on its own, or with the other Tidyverse packages by installing tidyverse.
    Install dplyr only: install.packages ("dplyr")
    Install entire tidyverse, inc. dplyr: install.packages ("tidyverse")

    Transforming Data with dplyr

    Select verb: to select variables or columns from a dataset

    database %>% select(variable 1, variable 2, variable 3,...)

    Filter and arrange verbs:

    counties_selected <- counties %>%
    select (state, county, population, unemployment)


    Arrange verb: sorts data based on one or more variables

    counties_selected %>%
    arrange(population) or arrange(desc(population))

    Filter verb: extract only particular observations from a dataset

    counties_selected %>%
    arrange(desc(population)) %>%
    filter(state == "New York")

    counties_selected %>%
    arrange(desc(population)) %>%
    filter(state == "New Yor", unemployment < 6)

    Mutate verb: we use the mutate() verb to calculate add new variables or change existing variables

    counties_selected %>%
    mutate (unemployed_population = population * unemployment / 100) %>%

    Aggregating Data

    Count verb: one way we can aggregate data is to count it. The result is a table with one row and one column, called n.

    When we explore datasets, we're often interested in sorting the counted data to find the most common observations. The argument sort allows us to do that.
    counties %>%
    count(state, sort = TRUE)
    We can add the argument wt, which stands for weight. As a result, instead of seeing the number of counties in each state, we see the total population
    counties %>%
    count(state, wt=population, sort = TRUE)

    The group_by, summarize and ungroup verbs

    The summarize verb takes many observations and turns them into one observation.
    counties %>%
    summarize (total_population = sum(population), average_unemployment = mean(unemployment))
    sum(), mean(), median(), min(), max(), n()
    The verb group_by() allows us to aggregate data within groups.
    counties %>%
    group_by(state,metro) %>%
    summarize(total_pop = sum(population), average_unemployment=mean(unemployment))
    The verb ungroup() allows us to not keep state as a group.

    The slice_min and slice_max verbs: allow us to extract the most extreme observations from each group.

    Like summarize(), slice_max() operates on a grouped table, and returns the largest observations in each group.
    counties_selected %>%
    group_by(state) %>%
    slice_max(population, n=1)
    The function takes two arguments: the column we want to base the ordering on, and the number of observations to extract from each group, specified with the n argument.
    slice_min() returns the smallest observations in each group
    counties_selected %>%
    group_by(state) %>%
    slice_max(unemployment, n=3)

    Selecting and Transforming Data


    We can select a range of columns. For exapmle, there are a series of columns containing information on how people get to work. If we wanted to select this range of columns from drive to work_at_home, we can use drive:work_at_home
    counties %>%
    select (state, county, drive:work_at_home)
    dplyr provides 'select helpers', which are functions that specify criteria for choosing columns. Select helpers take strings, which must be specified using quotes ""

    Contains function

    counties %>%
    select (state, county, contains("work"))

    Starts with function

    starts_with selects only the columns that start with a particular prefix.
    counties %>%
    select (state, county, starts_with("income")

    Other helpers

    last_col: grabs the last column
    matches: selects columns that hve a specified pattern

    We can use select to remove variables from a table by adding a minus sign - in front of the column name to remove.
    counties %>% select(-census_id)

    counties %>%
    Select state, county, population, and industry-related columns
    select(state, county, population,professional:production) %>%
    Arrange service in descending order

    In both cases, the professional:production notation selects all columns from 'professional' to 'production' inclusive, assuming those variables are consecutively positioned in the data frame.

    The resulting selected_columns data frame will contain the specified columns: 'state', 'county', 'population', 'professional', 'service', 'office', 'construction', and 'production'.

    The rename verb

    Helps us rename the columns that we already have

    counties_selected %>% rename(unemployment_rate = unemployment)

    This can also be done with select

    counties_selected %>% select(state, county, population, unemployment_rate = unemployment)

    It is like select and rename in one step.

    The transmute verb

    It is a combination of select() and mutate (). Returns a subset of columns that are transformed and changed.

    suppose we want only one piece of information about each state and county: the fraction of the population made up of men.

    counties %>% transmute (state, county, fraction_men = men/population)

    counties %>% transmute (state, county, unemployed_people = population * unemployment / 100)

    Keeps only specified variables Keeps other variables
    Can't change values select rename
    Can change values transmute mutate

    Case Study: The babynames Dataset

    Filter for multiple names

    %in%: used to filter for multiple values

    babynames_multiple <- babynames %>% filter (name %in% c("Amy", "Christopher"))

    Grouped mutates

    Combining group_by() and mutate ()

    babynames %>% group_by (year) %>% mutate (year_total = sum(number)) %>% ungroup() %>%

    The group_by tells dplyr that we only want to add up within each year. Then, the mutate creates a new column called year_total, with the total number of people born in that year in this dataset. It is good practice to ungroup (), since we are done with grouped calculations and could cause errors in the future.

    babynames %>% group_by (year) %>% mutate (year_total = sum(number)) %>% ungroup() %>% mutate (fraction = number/year_total)

    Now that we have the total in each year, we can calculate the fraction of people born in each year that have each name. This is number / year_total.

    Window functions

    The window function takes a vector and returns another vector of the same length.

    lag () function: moves the items from a list to the right by one.

    v <- c(1,3,6,14) lag (v) returns: NA 1 3 6

    With v - lag (v) we are asking 'What is each value once we've substracted the previous one?'

    This allows us to calculate what is the difference between consecutive values in a vector. We can use that in a grouped mutate to find the changes in the popularity of one name in consecutive years.

    babynames_fraction %>% filter (name == "Matthew") %>% arrange (year) %>% mutate (difference = fraction - lag(fraction))

    This allows to quantify the difference between the fraction column values for each year with the previous one.

    To do this for every name, we can do this as a grouped mutate.

    babynames_fraction %>% arrange (name, year) %>% group_by (name) %>% mutate (difference = fraction - lag(fraction)) %>% ungroup () %>% arrange (desc(difference))