Tutorials
r programming

# Subsetting Datasets in R

Subsetting datasets is a crucial skill for any data professional. Learn and practice subsetting data in this quick interactive tutorial!

Whether you're comparing how different demographics respond to marketing campaigns, zooming in on a specific time frame, or pulling information about a select few products from the inventory, subsetting datasets enables you to extract useful observations in your dataset. R is a great tool that makes subsetting data easy and intuitive. By the end of this tutorial, you'll have the know-how to extract the information you want from your dataset.

Subsetting your data does not change the content of your data, but simply selects the portion most relevant to the goal you have in mind. In general, there are three ways to subset the rows and columns of your dataset—by index, by name, and by value.

### Subsetting rows and columns by index

One way to subset your rows and columns is by your dataset's indices. This is the same as describing your rows and columns as "the first row", "all rows in second and fifth columns", or "the first row in second to fifth columns". Let's specify such phrases using a dataset called iris in R. From its documentation, "[t]his famous (Fisher's or Anderson's) iris dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica."

# "The first row": iris[1, ] # "All rows in second and fifth columns": iris[, c(2, 5)] # "The first row in second to fifth columns": iris[1, 2:5]

To subset your data, square brackets are used after your dataset object. The rows of your dataset are specified as the first element inside the square brackets, and the columns of your dataset are specified as the second, separated by a comma:

data[rows, columns]

### Subsetting rows and columns by name

In R, the rows and columns of your dataset have name attributes. Row names are rarely used and by default provide indices—integers numbering from 1 to the number of rows of your dataset—just like what you saw in the previous section. In fact, if you called rownames() on the iris dataset, you will see that these are just indexed from 1 to 150:

> rownames(iris)
[1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12"  "13"  "14"
[15] "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24"  "25"  "26"  "27"  "28"
[29] "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42"
[43] "43"  "44"  "45"  "46"  "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"
[57] "57"  "58"  "59"  "60"  "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"
[71] "71"  "72"  "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84"
[85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98"
[99] "99"  "100" "101" "102" "103" "104" "105" "106" "107" "108" "109" "110" "111" "112"
[113] "113" "114" "115" "116" "117" "118" "119" "120" "121" "122" "123" "124" "125" "126"
[127] "127" "128" "129" "130" "131" "132" "133" "134" "135" "136" "137" "138" "139" "140"
[141] "141" "142" "143" "144" "145" "146" "147" "148" "149" "150"

> nrow(iris)
[1] 150

Row names are more common in smaller datasets and are used to make observations in your dataset easily identifiable. For example, for a small dataset containing health information of a doctor's patients, the row names of this dataset could be the full names of the patients.

Column names on the other hand, are ubiquitous to almost any dataset. You can access these with the colnames() function or the names() function:

colnames(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

To subset your dataset by the names of your rows and columns, simply use the square brackets again, prefixed by your dataset object:

# Sepal width of the fifth observation iris["5", "Sepal.Width"] # Sepal width and petal width iris[, c("Sepal.Width", "Petal.Width")]

It's important to note that both the row and column names are characters, so using single or double quotes is absolutely necessary!

### Subsetting rows and columns by value

Subsetting your rows and columns by value often allows the most flexibility. For example, you can extract the data on Iris setosa using a conditional statement like this:

> iris[iris$Species == "setosa", ] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa ... 47 5.1 3.8 1.6 0.2 setosa 48 4.6 3.2 1.4 0.2 setosa 49 5.3 3.7 1.5 0.2 setosa 50 5.0 3.3 1.4 0.2 setosa Conditional statements like iris$Species == "setosa" belong in the row element in the square brackets (i.e., the first element before the comma). In addition to the conditional statement in the first element, you can specify columns by index or name in the second element. In the console below, try selecting just the sepal measurements of Iris setosa:

# Print sepal measurements for setosa # Print sepal measurements for setosa (by name) iris[iris$Species == "setosa", c("Sepal.Length", "Sepal.Width")] # Print sepal measurements for setosa (by index) iris[iris$Species == "setosa", 1:2] test_or( test_output_contains('iris[iris$Species == "setosa", c("Sepal.Length", "Sepal.Width")]', incorrect_msg = "Select just the Sepal.Length and Sepal.Width columns for the setosa iris either by name or index."), test_output_contains('iris[iris$Species == "setosa", c("Sepal.Width", "Sepal.Length")]', incorrect_msg = "Select just the Sepal.Length and Sepal.Width columns for the setosa iris either by name or index.")) success_msg("Great work!")

### Recap

In this tutorial, you:

• Learned about subsetting your data frame by index. Rows and columns are indexed as integers from 1 to the number of rows and columns, respectively.
• Learned about subsetting your data frame by name. You learned that row names are rarely specified, and that column names are of character types.
• Learned to use conditional statements in the row element inside square brackets to subset your data frame by value.
• Learned to combine these methods to allow more flexible subsetting (e.g., using conditionals for rows and subsetting by index or name for columns).

Below are some exercises to help reinforce what you've learned. Practice makes perfect, so give these a try!

### Practice exercises

Select all observations for which the sepal width is greater than the petal length.

# Which observations have greater sepal width than petal length? # Which observations have greater sepal width than petal length? iris[iris$Sepal.Width > iris$Petal.Length, ] test_output_contains('iris[iris$Sepal.Width > iris$Petal.Length, ]', incorrect_msg = paste("Select the observations for which Sepal.Width values", "are larger than Petal.Length values.")) success_msg("Nice! Turns out only Iris setosa have this shape.")
Use iris$Sepal.Width > iris$Petal.Length inside the square brackets. Don't forget the comma!

Select all Iris versicolor whose sepal width is larger than the average sepal width among all species.

# Store the versicolor data as an object called versicolor # Select for all Iris versicolor that have bigger than average sepal width # Store the versicolor data as an object called versicolor versicolor <- iris[iris$Species == "versicolor", ] # Select for all Iris versicolor that have bigger than average sepal width versicolor[versicolor$Sepal.Width > mean(iris$Sepal.Width), ] test_object('versicolor', undefined_msg = "Did you define the versicolor object?", incorrect_msg = paste("Select where Species is equal to \"versicolor\".", "Remember to use the == operator!")) test_output_contains('versicolor[versicolor$Sepal.Width > mean(iris$Sepal.Width), ]', incorrect_msg = paste("Not quite! Are you using the mean() function", "to find the average sepal width across all species", "and not just for Iris versicolor?")) success_msg("Great work!") Use the mean() function to find the average sepal width in iris, then use a conditional statement prefixed by versicolor to select only those with bigger-than-average values. Report the average sepal width for all Iris virginica that have petal lengths smaller than 5 centimeters. # You're free to approach this problem however you like! # Store the virginica data as an object called virginica virginica <- iris[iris$Species == "virginica", ] # Average sepal width for Iris virginica that have petal lengths smaller than 5 mean(virginica[virginica$Petal.Length < 5, ]$Sepal.Width) test_output_contains('2.8', incorrect_msg = "Incorrect! This is a tough one!") success_msg("Excellent!")
First, try subsetting for all Iris virginica then for where Petal.Length is smaller than 5. You can find the average by calling the mean() function.