Skip to main content

Text Data In R Cheat Sheet

Welcome to our cheat sheet for working with text data in R! This resource is designed for R users who need a quick reference guide for common tasks related to cleaning, processing, and analyzing text data. The cheat sheet includes a list of useful functions and packages for these tasks and examples of how to use them.
Dec 2022  · 5 min read

Welcome to our cheat sheet for working with text data in R! If you're an R user who needs to clean, process, or analyze text data, then this resource is for you. We've compiled a list of common functions and packages that you can use to quickly and easily work with text data in R. Our cheat sheet is easy to read and understand so that you can get up and running with text data in R in no time. 

Some examples of what you'll find in the cheat sheet include:

  • Getting string lengths and substrings
  • Methods for converting text to lowercase or uppercase
  • Techniques for splitting or joining text

Whether you're a beginner or an experienced R programmer, we hope you'll find this cheat sheet to be a valuable resource for your text data projects. Ready to get started with text data in R? Download our cheat sheet now and have all the information you need at your fingertips!

R Text Cheat Sheet.png

Have this cheat sheet at your fingertips

Download PDF

Packages to install for this cheat sheet

Some functionality is contained in base-R, but the following packages are also used throughout this cheat sheet.

library(stringr)
library(snakecase)
library(glue)

Functions with names starting str_ are from stringr; those with names starting to_ are from snakecase; those with glue in the name are from glue.

Example data

Throughout this cheat sheet, we’ll be using this vector containing the following strings.

suits <- c("Clubs", "Diamonds", "Hearts", "Spades")

Get string lengths and substrings

# Get the number of characters with nchar()
nchar(suits) # Returns 5 8 6 6

# Get substrings by position with str_sub()
stringr::str_sub(suits, 1, 4) # Returns "Club" "Diam" "Hear" "Spad"

# Remove whitespace from the start/end with str_trim()
str_trim(" Lost in Whitespace ") # Returns "Lost in Whitespace"

# Truncate strings to a maximum width with str_trunc()
str_trunc(suits, width = 5) # Returns "Clubs" "Di..." "He..." "Sp..."

# Pad strings to a constant width with str_pad()
str_pad(suits, width = 8) # Returns " Clubs" "Diamonds" " Hearts" " Spades"

# Pad strings on right with str_pad(side="right")
str_pad(suits, width = 8, side = "right", pad = "!")
# Returns "Clubs!!!" "Diamonds" "Hearts!!" "Spades!!"

Changing case

# Convert to lowercase with tolower()
tolower(suits) # Returns "clubs" "diamonds" "hearts" "spades"

# Convert to uppercase with toupper()
toupper(suits) # Returns "CLUBS" "DIAMONDS" "HEARTS" "SPADES"

# Convert to title case with to_title_case()
to_title_case("hello, world!") # Returns "Hello, World!"

# Convert to sentence case with to_sentence_case()
to_sentence_case("hello, world!") # Returns "Hello, world!"

Formatting strings

# Format numbers with sprintf()
sprintf("%.3e", pi) # "3.142e+00"

# Substitute value in a string with an expression
glue('The answer is {ans}', ans = 30 + 10) # The answer is 40

# Substitute value in a string with an expression
cards <- data.frame(value = c("8", "Queen", "Ace"),
                    suit = c("Diamonds", "Hearts", "Spades"))
cards %>% glue_data("{value} of {suit}")

# 8 of Diamonds
# Queen of Hearts
# Ace of Spades

# Wrap strings across multiple lines
str_wrap('The answer to the universe is 42', width = 25)
# The answer to the
# universe is 42

Splitting strings

# Split strings into list of characters with str_split(pattern = "")
str_split(suits, pattern = "")

# "C" "l" "u" "b" "s"
# "D" "i" "a" "m" "o" "n" "d" "s"
# "H" "e" "a" "r" "t" "s"
# "S" "p" "a" "d" "e" "s"

# Split strings by a separator with str_split()
str_split(suits, pattern = "a")

# "Clubs"
# "Di" "monds"
# "He" "rts"
# "Sp" "des"

# Split strings into matrix of n pieces with str_split_fixed()
str_split_fixed(suits, pattern = 'a', n = 2)

# [,1] [,2]
# [1,] "Clubs" ""
# [2,] "Di" "monds"
# [3,] "He" "rts"
# [4,] "Sp" "des"

Joining or concatenating strings

# Combine two strings with paste0()
paste0(suits, '5') # "Clubs5" "Diamonds5" "Hearts5" "Spades5"

# Combine strings with a separator with paste()
paste(5, suits, sep = " of ") # "5 of Clubs" "5 of Diamonds" "5 of Hearts" "5 of Spades"

# Collapse character vector to string with paste() or paste0()
paste(suits, collapse = ", ") # "Clubs, Diamonds, Hearts, Spades"

# Duplicate and concatenate strings with str_dup()
str_dup(suits, 2) # "ClubsClubs" "DiamondsDiamonds" "HeartsHearts" "SpadesSpades" 

Detecting matches

# Highlight string matches in HTML widget with str_view_all()
str_view_all(suits, "[ae]")

Clubs
Diamonds
Hearts
Spades

# Detect if a regex pattern is present in strings with str_detect()
str_detect(suits, "[ae]") # FALSE TRUE TRUE TRUE

# Find the index of strings that match a regex with str_which()
str_which(suits, "[ae]") # 2 3 4

# Count the number of matches with str_count()
str_count(suits, "[ae]") # 0 1 2 2

# Locate the position of matches within strings with str_locate()
str_locate(suits, "[ae]")

# start end
# [1,] NA NA
# [2,] 3 3
# [3,] 2 2
# [4,] 3 3

Extracting matches

# Extract matches from strings with str_extract()
str_extract(suits, ".[ae].") # NA "iam" "Hea" "pad"

# Extract matches and capture groups with str_match()
str_match(suits, ".([ae])(.)")

# [,1] [,2] [,3]
# [1,] NA NA NA
# [2,] "iam" "a" "m"
# [3,] "Hea" "e" "a"
# [4,] "pad" "a" "d"

# Get subset of strings that match with str_subset()
str_subset(suits, "d") # "Diamonds" "Spades"

Replacing matches

# Replace a regex match with another string with str_replace()
str_replace(suits, "a", "4") # "Clubs" "Di4monds" "He4rts" "Sp4des"

# Remove a match with str_remove()
str_remove(suits, "s") # "Club" "Diamond" "Heart" "Spade"

# Replace a substring with `str_sub<-`()
str_sub(suits, start = 1, end = 3) <- c("Bi", "Al", "Yu", "Hi")
suits # Returns "Bibs" "Almonds" "Yurts" "Hides"

Have this cheat sheet at your fingertips

Download PDF
Related

Predicting FIFA World Cup Qatar 2022 Winners

Learn to use Elo ratings to quantify national soccer team performance, and see how the model can be used to predict the winner of FIFA World Cup Qatar 2022.

Arne Warnke

R vs SQL - Which Should I Learn?

Find out everything you need to know about R and SQL, helping you choose which one is best to learn for your needs.
Matt Crabtree's photo

Matt Crabtree

Julia vs R - Which Should You Learn?

Compare the main elements of Julia vs R programming languages that set them apart from one another and explore the current job market for each of these skills.
Joleen Bothma's photo

Joleen Bothma

11 min

ggplot2 Cheat Sheet

ggplot2 is considered to be one of the most robust data visualization packages in any programming language. Use this cheat sheet to guide your ggplot2 learning journey.
DataCamp Team's photo

DataCamp Team

Dates and Times in R Cheat Sheet

Welcome to our cheat sheet for working with dates and times in R! This resource provides a list of common functions and packages for manipulating, analyzing, and visualizing data with dates and times. Whether you're a beginner or an experienced R programmer, we hope you'll find our cheat sheet to be a valuable resource.
DataCamp Team's photo

DataCamp Team

1 min

Multiple Linear Regression in R: Tutorial With Examples

A complete overview to understanding multiple linear regressions in R through examples.
Zoumana Keita 's photo

Zoumana Keita

12 min

See MoreSee More