Skip to main content

Course

Intermediate Regular Expressions in R

IntermediateSkill Level

4.8+

Updated 11/2024

Manipulate text data, analyze it and more by mastering regular expressions and string distances in R.

Start Course for Free

RProgramming

4 hr

14 videos

48 Exercises

3,650 XP

4,740

Statement of Accomplishment

Loved by learners at thousands of companies

Training a Team?

Try for Business

Course Description

Analyzing data that comes in tables is fun. But what if the things that we find most interesting are not available as a neatly organized dataset but in plain text? Do not despair: In this course, you'll learn everything you need to know to create powerful regular expressions that will help you find all the information you need for your analyses from just a blob of text. But not only that. Using the concept of string distances, you will learn to work even with text that contains typos or scanning errors, as you will be able to match them to their correct counterparts from other data sources (record linkage). As a learning material, we will analyze real documents about box office figures in Swiss cinemas.

Prerequisites

Introduction to the Tidyverse String Manipulation with stringr in R

1

Regular Expressions: Writing Custom Patterns

Regular expressions can be pretty intimidating at first as they contain vast amounts of special characters. In this chapter, you'll learn to decipher these and write your own patterns to find exactly what you're looking for.

Starts with, ends with

If you don't know what you're looking for

Character classes and repetitions

Digits, words and spaces

Match repetitions

Which special character did what again?

The pipe and the question mark

This or that

The question mark and its two meanings

You can now read this!

2

Creating Strings with Data

In this chapter, we will slightly move away from regular expressions and focus on string manipulation by creating strings from other data structures like vectors or lists.

Getting to know glue

Stop pasting, start gluing

Gluing data frames

How many arguments can glue take?

Collapsing multiple elements into a string

Formulating a question from a list

Collapsing data frames

Glue and Collapse, what's the difference?

Gluing regular expressions

Construct "or patterns" with glue

Using the "or pattern" with a larger dataset

Make advanced patterns more readable

3

Extracting Structured Data From Text

One task where regular expressions really shine is making sense from a blob of text. In this chapter, you'll learn to extract the information from messy data that doesn't come in neatly arranged tables but in plain text.

Capturing groups

Match all capturing groups

Search and replace

Can you nest capturing groups?

tidyr's extract

Creating a regex that matches your needs

Why does this fail?

Extracting an advanced regular expression

Extracting matches and surroundings from a text

Extract names with context

So many special characters

4

Similarities Between Strings

In the last chapter, we will shift gears away from regular expressions to understanding string distances. By calculating the differences of multiple strings, we can match those that are similar. This will help us to find duplicates even when they contain small errors like typos. This is an important part to record linkage where we combine datasets from multiple sources.

Understanding string distances

Calculating a string distance

Finding a match to a search typo

Methods of string distances

Edit distances vs. q-gram methods

Trying out different methods

Is one distance better than the other?

Fuzzy joins

Performing a string distance join

String distances of short strings

Custom Fuzzy Matching

Finding matches based on two conditions

Why join on multiple columns?

Congratulations

Intermediate Regular Expressions in R

Course
Complete

Earn Statement of Accomplishment

Add this credential to your LinkedIn profile, resume, or CV
Share it on social media and in your performance reviewEnroll Now

Don’t just take our word for it

*4.8

from 33 reviews

88%

12%

0%

0%

0%

Sort by

Tung

last month

.

Natiah

2 months ago

great

Alex

3 months ago

Błażej

4 months ago

Thomas

4 months ago

Julia

5 months ago

"great"

Natiah

Alex

Błażej

FAQs

What makes this course different from a beginner regex course?

This intermediate course goes beyond basic pattern matching to cover extracting structured data from plain text, building strings programmatically, and matching similar strings using string distances.

What real-world data is used in this course?

You analyze real documents about box office figures in Swiss cinemas, learning to extract and structure information from messy text sources.

Does the course cover record linkage and fuzzy matching?

Yes. Chapter 4 teaches string distance calculations to match similar strings, even those containing typos or scanning errors, which is essential for combining datasets from multiple sources.

What R packages are used in this course?

You use stringr for string manipulation along with R regex capabilities. The course builds on skills from the String Manipulation with stringr prerequisite course.

What should I already know about regular expressions before enrolling?

You should be comfortable with basic regex patterns and stringr functions. The prerequisites include String Manipulation with stringr in R and intermediate R knowledge.

Join over 19 million learners and start Intermediate Regular Expressions in R today!

Grow your data skills with DataCamp for Mobile

Make progress on the go with our mobile courses and daily 5-minute coding challenges.