课程

Intermediate Regular Expressions in R

中级技能水平

更新时间 2024年11月

Manipulate text data, analyze it and more by mastering regular expressions and string distances in R.

免费开始课程

RProgramming

4小时

14 视频

48 道练习

3,650 XP

4,740

成就证明

深受数千家公司学习者的喜爱

需要团队培训？

企业版试用

课程描述

Analyzing data that comes in tables is fun. But what if the things that we find most interesting are not available as a neatly organized dataset but in plain text? Do not despair: In this course, you'll learn everything you need to know to create powerful regular expressions that will help you find all the information you need for your analyses from just a blob of text. But not only that. Using the concept of string distances, you will learn to work even with text that contains typos or scanning errors, as you will be able to match them to their correct counterparts from other data sources (record linkage). As a learning material, we will analyze real documents about box office figures in Swiss cinemas.

先决条件

Introduction to the Tidyverse String Manipulation with stringr in R

1

Regular Expressions: Writing Custom Patterns

Regular expressions can be pretty intimidating at first as they contain vast amounts of special characters. In this chapter, you'll learn to decipher these and write your own patterns to find exactly what you're looking for.

Starts with, ends with

If you don't know what you're looking for

Character classes and repetitions

Digits, words and spaces

Match repetitions

Which special character did what again?

The pipe and the question mark

This or that

The question mark and its two meanings

You can now read this!

2

Creating Strings with Data

In this chapter, we will slightly move away from regular expressions and focus on string manipulation by creating strings from other data structures like vectors or lists.

Getting to know glue

Stop pasting, start gluing

Gluing data frames

How many arguments can glue take?

Collapsing multiple elements into a string

Formulating a question from a list

Collapsing data frames

Glue and Collapse, what's the difference?

Gluing regular expressions

Construct "or patterns" with glue

Using the "or pattern" with a larger dataset

Make advanced patterns more readable

3

Extracting Structured Data From Text

One task where regular expressions really shine is making sense from a blob of text. In this chapter, you'll learn to extract the information from messy data that doesn't come in neatly arranged tables but in plain text.

Capturing groups

Match all capturing groups

Search and replace

Can you nest capturing groups?

tidyr's extract

Creating a regex that matches your needs

Why does this fail?

Extracting an advanced regular expression

Extracting matches and surroundings from a text

Extract names with context

So many special characters

4

Similarities Between Strings

In the last chapter, we will shift gears away from regular expressions to understanding string distances. By calculating the differences of multiple strings, we can match those that are similar. This will help us to find duplicates even when they contain small errors like typos. This is an important part to record linkage where we combine datasets from multiple sources.

Understanding string distances

Calculating a string distance

Finding a match to a search typo

Methods of string distances

Edit distances vs. q-gram methods

Trying out different methods

Is one distance better than the other?

Fuzzy joins

Performing a string distance join

String distances of short strings

Custom Fuzzy Matching

Finding matches based on two conditions

Why join on multiple columns?

Congratulations

Intermediate Regular Expressions in R

课程完成

获得成就证明

将此证书添加到您的 LinkedIn 档案、简历或履历中
在社交媒体和绩效评估中分享立即注册

加入超过19百万学习者，今天就开始Intermediate Regular Expressions in R！

通过 DataCamp for Mobile 提升您的数据技能

随时随地通过我们的移动课程和每日 5 分钟编程挑战提升技能。