Course
A Guide to R Regular Expressions
The concept of regular expressions, usually referred to as regex, exists in many programming languages, such as R, Python, C, C++, Perl, Java, and JavaScript. You can access the functionality of regex either in the base version of those languages or via libraries. For most programming languages, the syntax of regex patterns is similar.
In this tutorial, we'll explore what regular expressions in R are, why they're important, what tools and functions allow us to work with them, which regex patterns are the most common ones, and how to use them. In the end, we'll overview some advanced applications of R regex.
What Are R Regex and Why Should You Use Them?
A regular expression, regex, in R is a sequence of characters (or even one character) that describes a certain pattern found in a text. Regex patterns can be as short as ‘a’ or as long as the one mentioned in this StackOverflow thread.
Broadly speaking, the above definition of the regex is related not only to R but also to any other programming language supporting regular expressions.
Regex represents a very flexible and powerful tool widely used for processing and mining unstructured text data. For example, they find their application in search engines, lexical analysis, spam filtering, and text editors.
Tools and Functions to Work with R Regex
While regex patterns are similar for the majority of programming languages, the functions for working with them are different.
In R, we can use the functions of the base R to detect, match, locate, extract, and replace regex. Below are the main functions that search for regex matches in a character vector and then do the following:
- grep(), grepl() – return the indices of strings containing a match (grep()) or a logical vector showing which strings contain a match (grepl()).
- regexpr(), gregexpr() – return the index for each string where the match begins and the length of that match. While regexpr() provides this information only for the first match (from the left), gregexpr() does the same for all the matches.
- sub(), gsub() – replace a detected match in each string with a specified string (sub() – only for the first match, gsub() – for all the matches).
- regexec() – works like regexpr() but returns the same information also for a specified sub-expression inside the match.
- regmatches() – works like regexec() but returns the exact strings detected for the overall match and a specified sub-expression.
However, instead of using the native R functions, a more convenient and consistent way to work with R regex is to use a specialized stringr package of the tidyverse collection. This library is built on top of the stringi package. In the stringr library, all the functions start with str_ and have much more intuitive names (as well as the names of their optional parameters) than those of the base R.
To install and load the stringr package, run the following:
install.packages('stringr')
library(stringr)
The table below shows the correspondence between the stringr functions and those of the base R that we've discussed earlier in this section:
stringr |
Base R |
str_subset() |
grep() |
str_detect() |
grepl() |
str_extract() |
regexpr(), regmatches(), grep() |
str_match() |
regexec() |
str_locate() |
regexpr() |
str_locate_all() |
gregexpr() |
str_replace() |
sub() |
str_replace_all() |
gsub() |
You can find a full list of the stringr functions and regular expressions in these cheat sheets, but we'll discuss some of them further in this tutorial.
Note: in the stringr functions, we pass in first the data and then a regex, while in the base R functions – just the opposite.
R Regex Patterns
Now, we're going to overview the most popular R regex patterns and their usage and, at the same time, practice some of the stringr functions.
Before doing so, let's take a look at a very basic example. Namely, let's check if a unicorn has at least one corn 😉
str_detect('unicorn', 'corn')
Output:
TRUE
In this example, we used the str_detect() stringr function to check the presence of the string corn in the string unicorn.
However, usually, we aren't looking for a certain literal string in a piece of text but rather for a certain pattern – a regular expression. Let's dive in and explore such patterns.
Character Escapes
There are a few characters that have a special meaning when used in R regular expressions. More precisely, they don't match themselves, as all letters and digits do, but they do something different:
str_extract_all('unicorn', '.')
Output:
1. 'u' 'n' 'i' 'c' 'o' 'r' 'n'
We clearly see that there are no dots in our unicorn. However, the str_extract_all() function extracted every single character from this string. This is the exact mission of the . character – to match any single character except for a new line.
What if we want to extract a literal dot? For this purpose, we have to use a regex escape character before the dot – a backslash (\). However, there is a pitfall here to keep in mind: a backslash is also used in the strings themselves as an escape character. This means that we first need to "escape the escape character," by using a double backslash. Let's see how it works:
str_extract_all('Eat. Pray. Love.', '\\.')
Output:
1. '.' '.' '.'
Hence, the backslash helps neglect a special meaning of some symbols in R regular expressions and interpret them literally. It also has the opposite mission: to give a special meaning to some characters that otherwise would be interpreted literally. Below is a table of the most used character escape sequences:
R regex |
What matches |
\b |
A word boundary (a boundary between a \w and a \W) |
\B |
A non-word boundary (\w-\w or \W-\W) |
\n |
A new line |
\t |
A tab |
\v |
A vertical tab |
Let's take a look at some examples keeping in mind that also in such cases, we have to use a double backslash. At the same time, we'll introduce two more stringr functions: str_view() and str_view_all() (to view HTML rendering of the first match or all matches):
str_view('Unicorns are so cute!', 's\\b')
str_view('Unicorns are so cute!', 's\\B')
Output:
Unicorns are so cute!
Unicorns are so cute!
In the string Unicorns are so cute!, there are two instances of the letter s. Above, the first R regex pattern highlighted the first instance of the letter s (since it's followed by a space), while the second one – the second instance (since it's followed by another letter, not a word boundary).
A couple more examples:
cat('Unicorns are\nso cute!')
str_view_all('Unicorns are\nso cute!', '\\n')
Output:
Unicorns are
so cute!
Unicorns are_so cute!
cat('Unicorns are\tso cute!')
str_view_all('Unicorns are\tso cute!', '\\t')
Output:
Unicorns are so cute!
Unicorns are_so cute!
Character Classes
A character class matches any character of a predefined set of characters. Built-in character classes have the same syntax as the character escape sequences we saw in the previous section: a backslash followed by a letter to which it gives a special meaning rather than its literal one. The most popular of these constructions are given below:
R regex |
What matches |
\w |
Any word character (any letter, digit, or underscore) |
\W |
Any non-word character |
\d |
Any digit |
\D |
Any non-digit |
\s |
Any space character (a space, a tab, a new line, etc.) |
\S |
Any non-space character |
Let's take a look at some self-explanatory examples:
str_view_all('Unicorns are so cute!', '\\w')
str_view_all('Unicorns are so cute!', '\\W')
Output:
Unicorns are so cute!
Unicorns_are_so_cute!
str_view_all('Unicorns are\nso cute!', '\\s')
str_view_all('Unicorns are\nso cute!', '\\S')
Output:
Unicorns_are_so_cute!
Unicorns are so cute!
str_detect('Unicorns are so cute!', '\\d')
Output:
FALSE
Built-in character classes can also appear in an alternative form – [:character_class_name:]. Some of these character classes have an equivalent among those with a backslash, others don't. The most common ones are:
R regex |
What matches |
[:alpha:] |
Any letter |
[:lower:] |
Any lowercase letter |
[:upper:] |
Any uppercase letter |
[:digit:] |
Any digit (equivalent to \d) |
[:alnum:] |
Any letter or number |
[:xdigit:] |
Any hexadecimal digit |
[:punct:] |
Any punctuation character |
[:graph:] |
Any letter, number, or punctuation character |
[:space:] |
A space, a tab, a new line, etc. (equivalent to \s) |
Let's explore some examples keeping in mind that we have to put any of the above R regex patterns inside square brackets:
str_view('Unicorns are so cute!', '[[:upper:]]')
str_view('Unicorns are so cute!', '[[:lower:]]')
Output:
Unicorns are so cute!
Unicorns are so cute!
str_detect('Unicorns are so cute!', '[[:digit:]]')
Output:
FALSE
str_extract_all('Unicorns are so cute!', '[[:punct:]]')
Output:
1. '!'
str_view_all('Unicorns are so cute!', '[[:space:]]')
Output:
Unicorns_are_so_cute!
It's also possible to create a user-defined character class putting inside square brackets any set of characters from which we want to match any one character. We can enclose in square brackets a range of letters or numbers (in Unicode order), several different ranges, or any sequential or nonsequential set of characters or groups of characters.
For example, [A-D] will match any uppercase letter from A to D inclusive, [k-r] – any lowercase letter from k to r inclusive,[0-7] – any digit from 0 to 7 inclusive, and [aou14%9] – any of the characters given inside square brackets. If we put the caret (^) as the first character inside square brackets, our R regex pattern will match anything but the provided characters. Note that the above matching mechanisms are case-sensitive.
str_view_all('Unicorns Are SOOO Cute!', '[O-V]')
str_view_all('Unicorns Are SOOO Cute!', '[^O-V]')
Output:
Unicorns Are SOOO Cute!
Unicorns Are SOOO Cute!
str_view_all('3.14159265359', '[0-2]')
Output:
3.14159265359
str_view_all('The number pi is equal to 3.14159265359', '[n2e9&]')
Output:
The number pi is equal to 3.14159265359
Quantifiers
Often, we need to match a certain R regex pattern repetitively, instead of strictly once. For this purpose, we use quantifiers. A quantifier always goes after the regex pattern to which it's related. The most common quantifiers are given in the table below:
R regex |
Number of pattern repetitions |
* |
0 or more |
+ |
at least 1 |
? |
at most 1 |
{n} |
exactly n |
{n,} |
at least n |
{n,m} |
at least n and at most m |
Let's try all of them:
str_extract('dog', 'dog\\d*')
Output:
'dog'
We got the initial string dog: there are no digits at the end of that string, but we're ok with it (0 or more instances of digits).
str_extract('12345', '\\d+')
Output:
'12345'
str_extract('12345', '\\d?')
Output:
'1'
str_extract('12345', '\\d{3}')
Output:
'123'
str_extract('12345', '\\d{7,}')
Output:
NA
We got NA because we don't have at least 7 digits in the string, only 5 of them.
str_extract('12345', '\\d{2,4}')
Output:
'1234'
Anchors
By default, R regex will match any part of a provided string. We can change this behavior by specifying a certain position of an R regex pattern inside the string. Most often, we may want to impose the match from the start or end of the string. For this purpose, we use the two main anchors in R regular expressions:
- ^ – matches from the beginning of the string (for multiline strings – the beginning of each line)
- $ – matches from the end of the string (for multiline strings – the end of each line)
Let's see how they work on the example of a palindrome stella won no wallets:
str_view('stella won no wallets', '^s')
str_view('stella won no wallets', 's$')
Output:
stella won no wallets
stella won no wallets
If we want to match the characters ^ or $ themselves, we need to precede the character of interest with a backslash (doubling it):
str_view_all('Do not have 100$, have 100 friends', '\\$')
Output:
Do not have 100$, have 100 friends
It's also possible to anchor matches to word or non-word boundaries inside the string (\b and \B respectively):
str_view_all('road cocoa oasis oak boa coach', '\\boa')
str_view_all('road cocoa oasis oak boa coach', 'oa\\b')
str_view_all('road cocoa oasis oak boa coach', 'oa\\B')
Output:
road cocoa oasis oak boa coach
road cocoa oasis oak boa coach
road cocoa oasis oak boa coach
Above, we matched the combination of letters oa:
- 1st example – at the beginning of the words
- 2nd example – at the end of the words
- 3rd example – whenever it's followed by a word character (in our case – by a letter)
Alternation
Applying the alternation operator (|), we can match more than one R regex pattern in the same string. Note that if we use this operator as a part of a user-defined character class, it's interpreted literally, hence doesn't perform any alternation.
str_view_all('coach koala board oak cocoa road boa load coat oasis boat', 'boa|coa')
Output:
coach koala board oak cocoa road boa load coat oasis boat
In the above example, we matched all the instances of either boa or coa.
Grouping
R regex patterns follow certain precedence rules. For example, repetition (using quantifiers) is prioritized over anchoring, while anchoring takes precedence over alternation. To override these rules and increase the precedence of a certain operation, we should use grouping. This can be performed by enclosing a subexpression of interest into round brackets.
Grouping works best in combination with the alternation operator. The examples below clearly demonstrate the effect of such a combination:
str_view_all('code rat coat cot cat', 'co|at')
str_view_all('code rat coat cot cat', 'c(o|a)t')
Output:
code rat coat cot cat
code rat coat cot cat
Advanced Applications of R Regular Expressions
Everything we've discussed so far gives us a good basis for starting working with R regular expressions. However, there are many more things we can do with this powerful tool. Without getting into detail, let's just mention some advanced operations that we can perform with R regex:
- Overriding the defaults of the stringr functions
- Matching grapheme clusters
- Group backreferencing
- Matching Unicode properties
- Applying advanced character escaping
- Verifying a pattern's existence without including it in the output (so-called lookarounds)
- Making the pattern repetition mechanism lazy rather than greedy
- Working with atomic groups
Lear more in our course Intermediate Regular Expressions in R.
R Regex Challenge
Now it's your turn to practice the R regex patterns we've discussed in this tutorial. To do so, use our Dataset: Internet News and Consumer Engagement, and try to do the following: extract the top-level domains (TLDs) from all the URLs. Some examples of TLDs are com, net, uk, etc.
There is more than one approach to complete this task, including elegant and compact one-line solutions (hint: you can learn more about lookarounds in R regex mentioned in the previous section and use them to solve this problem). Consider the following very loose guidance for a relatively "rookie" approach:
- Inspect some URLs in the dataset and notice which patterns are always present in any URL before the TLD and which ones are optional
- Notice if there are any mandatory patterns in any URL after the TLD (and if so, which ones) and which patterns are optional
- Remove everything before the TLD of the URL
- Remove everything after the TLD of the URL
To solve the problem using the above algorithm, you don't need to do or learn anything additional. Just refresh everything that we've discussed in this tutorial and put your knowledge into practice. In particular, you'll need the stringr functions, character escapes, character classes, quantifiers, and grouping.
Bonus: Challenge #2
If you want more practice, this time without any hints, try to do the following: using the same dataset from the previous challenge, extract the domain names from all the URLs. An example of a domain name is google in the URL www.google.com.
Conclusion
To conclude, in this tutorial, we learned plenty of things about R regular expressions. In particular, we discussed:
- What R regex are
- Why they're important and where they're applied
- What functions (from both native R and a specialized library) are used for working with R regular expressions
- The most common R regex patterns and their scope, nuances, and pitfalls
- When and how to use character escapes, character classes, quantifiers, anchors, alternation, and grouping with R regex
- Various examples of applying R regex and functions
- Some advanced operations with R regular expressions
These skills and notions will help you working in R and in many other programming languages since the concept of regular expressions is the same for all those languages and also the syntax of the regex is rather similar for the majority of them.
If you're interested in mastering your R skills, consider exploring the following courses, skill tracks, and articles of DataCamp:
Top R Courses
Course
Introduction to Regression in R
Course
String Manipulation with stringr in R
cheat-sheet
Regular Expressions Cheat Sheet
tutorial
Utilities in R Tutorial
tutorial
Python Regular Expression Tutorial
tutorial
Excel Regex Tutorial: Mastering Pattern Matching with Regular Expressions
tutorial
Using Regular Expressions to Clean Strings
Ryan Sheehy
4 min
tutorial