Data Manipulation with dplyr

# Start coding here...

Exploring data with dplyr

The dplyr package is one of several packages included in the Tidyverse collection.
dplyr can be installed on its own, or with the other Tidyverse packages by installing tidyverse.
Install dplyr only: install.packages ("dplyr")
Install entire tidyverse, inc. dplyr: install.packages ("tidyverse")

Transforming Data with dplyr

Select verb: to select variables or columns from a dataset

database %>% select(variable 1, variable 2, variable 3,...)

Filter and arrange verbs:

counties_selected <- counties %>%
select (state, county, population, unemployment)

counties_selected

Arrange verb: sorts data based on one or more variables

counties_selected %>%
arrange(population) or arrange(desc(population))

Filter verb: extract only particular observations from a dataset

counties_selected %>%
arrange(desc(population)) %>%
filter(state == "New York")

counties_selected %>%
arrange(desc(population)) %>%
filter(state == "New Yor", unemployment < 6)

Mutate verb: we use the mutate() verb to calculate add new variables or change existing variables

counties_selected %>%
mutate (unemployed_population = population * unemployment / 100) %>%
arrange(desc(unemployed_population))

Aggregating Data

Count verb: one way we can aggregate data is to count it. The result is a table with one row and one column, called n.

When we explore datasets, we're often interested in sorting the counted data to find the most common observations. The argument sort allows us to do that.
counties %>%
count(state, sort = TRUE)
We can add the argument wt, which stands for weight. As a result, instead of seeing the number of counties in each state, we see the total population
counties %>%
count(state, wt=population, sort = TRUE)

The group_by, summarize and ungroup verbs

The summarize verb takes many observations and turns them into one observation.
counties %>%
summarize (total_population = sum(population), average_unemployment = mean(unemployment))
sum(), mean(), median(), min(), max(), n()
The verb group_by() allows us to aggregate data within groups.
counties %>%
group_by(state,metro) %>%
summarize(total_pop = sum(population), average_unemployment=mean(unemployment))
The verb ungroup() allows us to not keep state as a group.

The slice_min and slice_max verbs: allow us to extract the most extreme observations from each group.

Like summarize(), slice_max() operates on a grouped table, and returns the largest observations in each group.
counties_selected %>%
group_by(state) %>%
slice_max(population, n=1)
The function takes two arguments: the column we want to base the ordering on, and the number of observations to extract from each group, specified with the n argument.
slice_min() returns the smallest observations in each group
counties_selected %>%
group_by(state) %>%
slice_max(unemployment, n=3)

Selecting and Transforming Data

Selecting

We can select a range of columns. For exapmle, there are a series of columns containing information on how people get to work. If we wanted to select this range of columns from drive to work_at_home, we can use drive:work_at_home
counties %>%
select (state, county, drive:work_at_home)
dplyr provides 'select helpers', which are functions that specify criteria for choosing columns. Select helpers take strings, which must be specified using quotes ""

Contains function

counties %>%
select (state, county, contains("work"))

Starts with function

starts_with selects only the columns that start with a particular prefix.
counties %>%
select (state, county, starts_with("income")

Other helpers

contains()
starts_with
ends_with
last_col: grabs the last column
matches: selects columns that hve a specified pattern

We can use select to remove variables from a table by adding a minus sign - in front of the column name to remove.
counties %>% select(-census_id)

counties %>%
Select state, county, population, and industry-related columns
select(state, county, population,professional:production) %>%
Arrange service in descending order
arrange(desc(service))

In both cases, the professional:production notation selects all columns from 'professional' to 'production' inclusive, assuming those variables are consecutively positioned in the data frame.

The resulting selected_columns data frame will contain the specified columns: 'state', 'county', 'population', 'professional', 'service', 'office', 'construction', and 'production'.

The rename verb

Helps us rename the columns that we already have

counties_selected %>% rename(unemployment_rate = unemployment)

This can also be done with select

counties_selected %>% select(state, county, population, unemployment_rate = unemployment)

It is like select and rename in one step.

The transmute verb

It is a combination of select() and mutate (). Returns a subset of columns that are transformed and changed.

suppose we want only one piece of information about each state and county: the fraction of the population made up of men.

counties %>% transmute (state, county, fraction_men = men/population)

counties %>% transmute (state, county, unemployed_people = population * unemployment / 100)

	Keeps only specified variables	Keeps other variables
Can't change values	select	rename
Can change values	transmute	mutate

Case Study: The babynames Dataset

Filter for multiple names

%in%: used to filter for multiple values

babynames_multiple <- babynames %>% filter (name %in% c("Amy", "Christopher"))

Grouped mutates

Combining group_by() and mutate ()

babynames %>% group_by (year) %>% mutate (year_total = sum(number)) %>% ungroup() %>%

The group_by tells dplyr that we only want to add up within each year. Then, the mutate creates a new column called year_total, with the total number of people born in that year in this dataset. It is good practice to ungroup (), since we are done with grouped calculations and could cause errors in the future.

babynames %>% group_by (year) %>% mutate (year_total = sum(number)) %>% ungroup() %>% mutate (fraction = number/year_total)

Now that we have the total in each year, we can calculate the fraction of people born in each year that have each name. This is number / year_total.

Window functions

The window function takes a vector and returns another vector of the same length.

lag () function: moves the items from a list to the right by one.

v <- c(1,3,6,14) lag (v) returns: NA 1 3 6

With v - lag (v) we are asking 'What is each value once we've substracted the previous one?'

This allows us to calculate what is the difference between consecutive values in a vector. We can use that in a grouped mutate to find the changes in the popularity of one name in consecutive years.

babynames_fraction %>% filter (name == "Matthew") %>% arrange (year) %>% mutate (difference = fraction - lag(fraction))

This allows to quantify the difference between the fraction column values for each year with the previous one.

To do this for every name, we can do this as a grouped mutate.

babynames_fraction %>% arrange (name, year) %>% group_by (name) %>% mutate (difference = fraction - lag(fraction)) %>% ungroup () %>% arrange (desc(difference))