HomeTutorialsR Programming

# Subsetting Datasets in R

Subsetting datasets is a crucial skill for any data professional. Learn and practice subsetting data in this quick interactive tutorial!
Oct 2018  · 16 min read

Whether you're comparing how different demographics respond to marketing campaigns, zooming in on a specific time frame, or pulling information about a select few products from the inventory, subsetting datasets enables you to extract useful observations in your dataset. R is a great tool that makes subsetting data easy and intuitive. By the end of this tutorial, you'll have the know-how to extract the information you want from your dataset.

Subsetting your data does not change the content of your data, but simply selects the portion most relevant to the goal you have in mind. In general, there are three ways to subset the rows and columns of your dataset—by index, by name, and by value.

## Subsetting Rows and Columns by Index

One way to subset your rows and columns is by your dataset's indices. This is the same as describing your rows and columns as "the first row", "all rows in second and fifth columns", or "the first row in second to fifth columns". Let's specify such phrases using a dataset called `iris` in R. From its documentation, "[t]his famous (Fisher's or Anderson's) iris dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica."

``` # "The first row": iris[1, ] # "All rows in second and fifth columns": iris[, c(2, 5)] # "The first row in second to fifth columns": iris[1, 2:5] ```

To subset your data, square brackets are used after your dataset object. The rows of your dataset are specified as the first element inside the square brackets, and the columns of your dataset are specified as the second, separated by a comma:

``````data[rows, columns]
``````

## Subsetting Rows and Columns by Name

In R, the rows and columns of your dataset have name attributes. Row names are rarely used and by default provide indices—integers numbering from 1 to the number of rows of your dataset—just like what you saw in the previous section. In fact, if you called `rownames()` on the `iris` dataset, you will see that these are just indexed from 1 to 150:

`````` > rownames(iris)
[1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12"  "13"  "14"
[15] "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24"  "25"  "26"  "27"  "28"
[29] "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42"
[43] "43"  "44"  "45"  "46"  "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"
[57] "57"  "58"  "59"  "60"  "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"
[71] "71"  "72"  "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84"
[85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98"
[99] "99"  "100" "101" "102" "103" "104" "105" "106" "107" "108" "109" "110" "111" "112"
[113] "113" "114" "115" "116" "117" "118" "119" "120" "121" "122" "123" "124" "125" "126"
[127] "127" "128" "129" "130" "131" "132" "133" "134" "135" "136" "137" "138" "139" "140"
[141] "141" "142" "143" "144" "145" "146" "147" "148" "149" "150"

> nrow(iris)
[1] 150``````

Row names are more common in smaller datasets and are used to make observations in your dataset easily identifiable. For example, for a small dataset containing health information of a doctor's patients, the row names of this dataset could be the full names of the patients.

Column names on the other hand, are ubiquitous to almost any dataset. You can access these with the `colnames()` function or the `names()` function:

``````colnames(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"``````

To subset your dataset by the names of your rows and columns, simply use the square brackets again, prefixed by your dataset object:

``` # Sepal width of the fifth observation iris["5", "Sepal.Width"] # Sepal width and petal width iris[, c("Sepal.Width", "Petal.Width")] ```

It's important to note that both the row and column names are characters, so using single or double quotes is absolutely necessary!

## Subsetting Rows and Columns by Value

Subsetting your rows and columns by value often allows the most flexibility. For example, you can extract the data on Iris setosa using a conditional statement like this:

``````> iris[iris\$Species == "setosa", ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa

...

47          5.1         3.8          1.6         0.2  setosa
48          4.6         3.2          1.4         0.2  setosa
49          5.3         3.7          1.5         0.2  setosa
50          5.0         3.3          1.4         0.2  setosa``````

Conditional statements like `iris\$Species == "setosa"` belong in the row element in the square brackets (i.e., the first element before the comma). In addition to the conditional statement in the first element, you can specify columns by index or name in the second element. In the console below, try selecting just the sepal measurements of Iris setosa:

``` # Print sepal measurements for setosa ``` ``` # Print sepal measurements for setosa (by name) iris[iris\$Species == "setosa", c("Sepal.Length", "Sepal.Width")] # Print sepal measurements for setosa (by index) iris[iris\$Species == "setosa", 1:2] ``` ``` test_or( test_output_contains('iris[iris\$Species == "setosa", c("Sepal.Length", "Sepal.Width")]', incorrect_msg = "Select just the Sepal.Length and Sepal.Width columns for the setosa iris either by name or index."), test_output_contains('iris[iris\$Species == "setosa", c("Sepal.Width", "Sepal.Length")]', incorrect_msg = "Select just the Sepal.Length and Sepal.Width columns for the setosa iris either by name or index.")) success_msg("Great work!") ```

### Recap

In this tutorial, you:

• Learned about subsetting your data frame by index. Rows and columns are indexed as integers from 1 to the number of rows and columns, respectively.
• Learned about subsetting your data frame by name. You learned that row names are rarely specified, and that column names are of character types.
• Learned to use conditional statements in the row element inside square brackets to subset your data frame by value.
• Learned to combine these methods to allow more flexible subsetting (e.g., using conditionals for rows and subsetting by index or name for columns).

Below are some exercises to help reinforce what you've learned. Practice makes perfect, so give these a try!

## Practice Exercises

Select all observations for which the sepal width is greater than the petal length.

``` # Which observations have greater sepal width than petal length? ``` ``` # Which observations have greater sepal width than petal length? iris[iris\$Sepal.Width > iris\$Petal.Length, ] ``` ``` test_output_contains('iris[iris\$Sepal.Width > iris\$Petal.Length, ]', incorrect_msg = paste("Select the observations for which Sepal.Width values", "are larger than Petal.Length values.")) success_msg("Nice! Turns out only Iris setosa have this shape.") ```
Use `iris\$Sepal.Width > iris\$Petal.Length` inside the square brackets. Don't forget the comma!

Select all Iris versicolor whose sepal width is larger than the average sepal width among all species.

``` # Store the versicolor data as an object called versicolor # Select for all Iris versicolor that have bigger than average sepal width ``` ``` # Store the versicolor data as an object called versicolor versicolor <- iris[iris\$Species == "versicolor", ] # Select for all Iris versicolor that have bigger than average sepal width versicolor[versicolor\$Sepal.Width > mean(iris\$Sepal.Width), ] ``` ``` test_object('versicolor', undefined_msg = "Did you define the versicolor object?", incorrect_msg = paste("Select where Species is equal to \"versicolor\".", "Remember to use the == operator!")) test_output_contains('versicolor[versicolor\$Sepal.Width > mean(iris\$Sepal.Width), ]', incorrect_msg = paste("Not quite! Are you using the mean() function", "to find the average sepal width across all species", "and not just for Iris versicolor?")) success_msg("Great work!") ```
Use the `mean()` function to find the average sepal width in `iris`, then use a conditional statement prefixed by `versicolor` to select only those with bigger-than-average values.

Report the average sepal width for all Iris virginica that have petal lengths smaller than 5 centimeters.

``` # You're free to approach this problem however you like! ``` ``` # Store the virginica data as an object called virginica virginica <- iris[iris\$Species == "virginica", ] # Average sepal width for Iris virginica that have petal lengths smaller than 5 mean(virginica[virginica\$Petal.Length < 5, ]\$Sepal.Width) ``` ``` test_output_contains('2.8', incorrect_msg = "Incorrect! This is a tough one!") success_msg("Excellent!") ```
First, try subsetting for all Iris virginica then for where `Petal.Length` is smaller than 5. You can find the average by calling the `mean()` function.

Want to learn more about other ways to extract insights from your data? Check out these courses on DataCamp!

Take a look at our Merging Datasets in R tutorial.

Whether you're comparing how different demographics respond to marketing campaigns, zooming in on a specific time frame, or pulling information about a select few products from the inventory, subsetting datasets enables you to extract useful observations in your dataset. R is a great tool that makes subsetting data easy and intuitive. By the end of this tutorial, you'll have the know-how to extract the information you want from your dataset.

Subsetting your data does not change the content of your data, but simply selects the portion most relevant to the goal you have in mind. In general, there are three ways to subset the rows and columns of your dataset—by index, by name, and by value.

## Subsetting Rows and Columns by Index

One way to subset your rows and columns is by your dataset's indices. This is the same as describing your rows and columns as "the first row", "all rows in second and fifth columns", or "the first row in second to fifth columns". Let's specify such phrases using a dataset called `iris` in R. From its documentation, "[t]his famous (Fisher's or Anderson's) iris dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica."

``` # "The first row": iris[1, ] # "All rows in second and fifth columns": iris[, c(2, 5)] # "The first row in second to fifth columns": iris[1, 2:5] ```

To subset your data, square brackets are used after your dataset object. The rows of your dataset are specified as the first element inside the square brackets, and the columns of your dataset are specified as the second, separated by a comma:

``````data[rows, columns]
``````

## Subsetting Rows and Columns by Name

In R, the rows and columns of your dataset have name attributes. Row names are rarely used and by default provide indices—integers numbering from 1 to the number of rows of your dataset—just like what you saw in the previous section. In fact, if you called `rownames()` on the `iris` dataset, you will see that these are just indexed from 1 to 150:

`````` > rownames(iris)
[1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12"  "13"  "14"
[15] "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24"  "25"  "26"  "27"  "28"
[29] "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42"
[43] "43"  "44"  "45"  "46"  "47"  "48"  "49"  "50"  "51"  "52"  "53"  "54"  "55"  "56"
[57] "57"  "58"  "59"  "60"  "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"
[71] "71"  "72"  "73"  "74"  "75"  "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84"
[85] "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98"
[99] "99"  "100" "101" "102" "103" "104" "105" "106" "107" "108" "109" "110" "111" "112"
[113] "113" "114" "115" "116" "117" "118" "119" "120" "121" "122" "123" "124" "125" "126"
[127] "127" "128" "129" "130" "131" "132" "133" "134" "135" "136" "137" "138" "139" "140"
[141] "141" "142" "143" "144" "145" "146" "147" "148" "149" "150"

> nrow(iris)
[1] 150``````

Row names are more common in smaller datasets and are used to make observations in your dataset easily identifiable. For example, for a small dataset containing health information of a doctor's patients, the row names of this dataset could be the full names of the patients.

Column names on the other hand, are ubiquitous to almost any dataset. You can access these with the `colnames()` function or the `names()` function:

``````colnames(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"

names(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"``````

To subset your dataset by the names of your rows and columns, simply use the square brackets again, prefixed by your dataset object:

``` # Sepal width of the fifth observation iris["5", "Sepal.Width"] # Sepal width and petal width iris[, c("Sepal.Width", "Petal.Width")] ```

It's important to note that both the row and column names are characters, so using single or double quotes is absolutely necessary!

## Subsetting Rows and Columns by Value

Subsetting your rows and columns by value often allows the most flexibility. For example, you can extract the data on Iris setosa using a conditional statement like this:

``````> iris[iris\$Species == "setosa", ]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa

...

47          5.1         3.8          1.6         0.2  setosa
48          4.6         3.2          1.4         0.2  setosa
49          5.3         3.7          1.5         0.2  setosa
50          5.0         3.3          1.4         0.2  setosa``````

Conditional statements like `iris\$Species == "setosa"` belong in the row element in the square brackets (i.e., the first element before the comma). In addition to the conditional statement in the first element, you can specify columns by index or name in the second element. In the console below, try selecting just the sepal measurements of Iris setosa:

``` # Print sepal measurements for setosa ``` ``` # Print sepal measurements for setosa (by name) iris[iris\$Species == "setosa", c("Sepal.Length", "Sepal.Width")] # Print sepal measurements for setosa (by index) iris[iris\$Species == "setosa", 1:2] ``` ``` test_or( test_output_contains('iris[iris\$Species == "setosa", c("Sepal.Length", "Sepal.Width")]', incorrect_msg = "Select just the Sepal.Length and Sepal.Width columns for the setosa iris either by name or index."), test_output_contains('iris[iris\$Species == "setosa", c("Sepal.Width", "Sepal.Length")]', incorrect_msg = "Select just the Sepal.Length and Sepal.Width columns for the setosa iris either by name or index.")) success_msg("Great work!") ```

### Recap

In this tutorial, you:

• Learned about subsetting your data frame by index. Rows and columns are indexed as integers from 1 to the number of rows and columns, respectively.
• Learned about subsetting your data frame by name. You learned that row names are rarely specified, and that column names are of character types.
• Learned to use conditional statements in the row element inside square brackets to subset your data frame by value.
• Learned to combine these methods to allow more flexible subsetting (e.g., using conditionals for rows and subsetting by index or name for columns).

Below are some exercises to help reinforce what you've learned. Practice makes perfect, so give these a try!

## Practice Exercises

Select all observations for which the sepal width is greater than the petal length.

``` # Which observations have greater sepal width than petal length? ``` ``` # Which observations have greater sepal width than petal length? iris[iris\$Sepal.Width > iris\$Petal.Length, ] ``` ``` test_output_contains('iris[iris\$Sepal.Width > iris\$Petal.Length, ]', incorrect_msg = paste("Select the observations for which Sepal.Width values", "are larger than Petal.Length values.")) success_msg("Nice! Turns out only Iris setosa have this shape.") ```
Use `iris\$Sepal.Width > iris\$Petal.Length` inside the square brackets. Don't forget the comma!

Select all Iris versicolor whose sepal width is larger than the average sepal width among all species.

``` # Store the versicolor data as an object called versicolor # Select for all Iris versicolor that have bigger than average sepal width ``` ``` # Store the versicolor data as an object called versicolor versicolor <- iris[iris\$Species == "versicolor", ] # Select for all Iris versicolor that have bigger than average sepal width versicolor[versicolor\$Sepal.Width > mean(iris\$Sepal.Width), ] ``` ``` test_object('versicolor', undefined_msg = "Did you define the versicolor object?", incorrect_msg = paste("Select where Species is equal to \"versicolor\".", "Remember to use the == operator!")) test_output_contains('versicolor[versicolor\$Sepal.Width > mean(iris\$Sepal.Width), ]', incorrect_msg = paste("Not quite! Are you using the mean() function", "to find the average sepal width across all species", "and not just for Iris versicolor?")) success_msg("Great work!") ```
Use the `mean()` function to find the average sepal width in `iris`, then use a conditional statement prefixed by `versicolor` to select only those with bigger-than-average values.

Report the average sepal width for all Iris virginica that have petal lengths smaller than 5 centimeters.

``` # You're free to approach this problem however you like! ``` ``` # Store the virginica data as an object called virginica virginica <- iris[iris\$Species == "virginica", ] # Average sepal width for Iris virginica that have petal lengths smaller than 5 mean(virginica[virginica\$Petal.Length < 5, ]\$Sepal.Width) ``` ``` test_output_contains('2.8', incorrect_msg = "Incorrect! This is a tough one!") success_msg("Excellent!") ```
First, try subsetting for all Iris virginica then for where `Petal.Length` is smaller than 5. You can find the average by calling the `mean()` function.

Want to learn more about other ways to extract insights from your data? Check out these courses on DataCamp!

Take a look at our Merging Datasets in R tutorial.

Topics

Course

### .css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Data Manipulation with dplyr

4 hr
131.5K
Delve further into the Tidyverse by learning to transform and manipulate data with dplyr.
See Details
Start Course

Course

### Joining Data with dplyr

4 hr
64.4K
Learn to combine data across multiple tables to answer more complex questions with dplyr.

Course

### Data Manipulation with data.table in R

4 hr
23.7K
Master core concepts about data manipulation such as filtering, selecting and calculating groupwise statistics using data.table.
See More
Related

tutorial

### Subsetting in R Tutorial

Find out how to access your dataframe's data with subsetting. Follow our tutorial and learn how to use R's subset() function today!

DataCamp Team

4 min

tutorial

### Merging Datasets in R

In this tutorial, you'll learn to join multiple datasets in R.

Tom Jeon

8 min

tutorial

### Utilities in R Tutorial

Learn about several useful functions for data structure manipulation, nested-lists, regular expressions, and working with times and dates in the R programming language.

18 min

tutorial

### Introduction to Data frames in R

This tutorial takes course material from DataCamp's Introduction to R course and allows you to practice data frames.

Ryan Sheehy

5 min

tutorial

### Facets for ggplot2 in R

In this tutorial, you'll learn how to make the most of ggplots facetting functions.

DataCamp Team

8 min

tutorial

### Importing Data Into R - Part Two

A tutorial on importing data into R. The focus is on reading data from sources like statistical software, databases, webscraping, and more.

Karlijn Willems

34 min

See MoreSee More