Course
Run and edit the code from this tutorial online
Run codeRegular Expressions, often shortened as regex, are a sequence of characters used to check whether a pattern exists in a given text (string) or not. If you've ever used search engines, search and replace tools of word processors and text editors - you've already seen regular expressions in use. They are used at the server side to validate the format of email addresses or passwords during registration, used for parsing text data files to find, replace, or delete certain string, etc. They help in manipulating textual data, which is often a prerequisite for data science projects involving text mining.
This tutorial will walk you through the important concepts of regular expressions with Python. You will start with importing re
- Python library that supports regular expressions. Then you will see how basic/ordinary characters are used for performing matches, followed by wild or special characters. Next, you'll learn about using repetitions in your regular expressions. You'll also learn how to create groups and named groups within your search for ease of access to matches. Next, you'll get familiar with the concept of greedy vs. non-greedy matching.
This already seems like a lot, and hence, there is a handy summary table included to help you remember what you've seen so far with short definitions. Do check it out!
This tutorial also covers some very useful functions provided by the re
library, such as: compile()
, search()
, findall()
, sub()
for search and replace, split()
, and some more. You will also learn about compilation flags that you can use to make your regex better.
In the end, there is a case study - where you can put your knowledge in use! So let's regex...
Regular Expressions in Python
In Python, regular expressions are supported by the re module. That means that if you want to start using them in your Python scripts, you have to import this module with the help of import
:
import re
The re
library in Python provides several functions that make it a skill worth mastering. You will see some of them closely in this tutorial.
Basic Patterns: Ordinary Characters
You can easily tackle many basic patterns in Python using ordinary characters. Ordinary characters are the simplest regular expressions. They match themselves exactly and do not have a special meaning in their regular expression syntax.
Examples are 'A', 'a', 'X', '5'.
Ordinary characters can be used to perform simple exact matches:
pattern = r"Cookie"
sequence = "Cookie"
if re.match(pattern, sequence):
print("Match!")
else: print("Not a match!")
Match!
Most alphabets and characters will match themselves, as you saw in the example.
The match()
function returns a match object if the text matches the pattern. Otherwise, it returns None
. The re
module also contains several other functions, and you will learn some of them later on in the tutorial.
For now, let's focus on ordinary characters!
Do you notice the r
at the start of the pattern Cookie
?
This is called a raw string literal. It changes how the string literal is interpreted. Such literals are stored as they appear.
For example, \
is just a backslash when prefixed with an r
rather than being interpreted as an escape sequence. You will see what this means with special characters. Sometimes, the syntax involves backslash-escaped characters, and to prevent these characters from being interpreted as escape sequences; you use the raw r
prefix.
TIP: You don't actually need it for this example; however, it is a good practice to use it for consistency.
Wild Card Characters: Special Characters
Special characters are characters that do not match themselves as seen but have a special meaning when used in a regular expression. For simple understanding, they can be thought of as reserved metacharacters that denote something else and not what they look like.
Let's check out some examples to see the special characters in action...
But before you do, the examples below make use of two functions namely: search()
and group()
.
With the search function, you scan through the given string/sequence, looking for the first location where the regular expression produces a match.
The group function returns the string matched by the re. You will see both these functions in more detail later.
Back to the special characters now.
.
- A period. Matches any single character except the newline character.
re.search(r'Co.k.e', 'Cookie').group()
'Cookie'
^
- A caret. Matches the start of the string.
This is helpful if you want to make sure a document/sentence starts with certain characters.
re.search(r'^Eat', "Eat cake!").group()
## However, the code below will not give the same result. Try it for yourself:
# re.search(r'^eat', "Let's eat cake!").group()
'Eat'
$
- Matches the end of string.
This is helpful if you want to make sure a document/sentence ends with certain characters.
re.search(r'cake$', "Cake! Let's eat cake").group()
## The next search will return the NONE value, try it:
# re.search(r'cake$', "Let's get some cake on our way home!").group()
'cake'
[abc]
- Matches a or b or c. [a-zA-Z0-9]
- Matches any letter from (a to z) or (A to Z) or (0 to 9).
TIP: Characters that are not within a range can be matched by complementing the set. If the first character of the set is ^
, all the characters that are not in the set will be matched.
re.search(r'[0-6]', 'Number: 5').group()
'5'
## Matches any character except 5
re.search(r'Number: [^5]', 'Number: 0').group()
## This will not match and hence a NONE value will be returned
#re.search(r'Number: [^5]', 'Number: 5').group()
'Number: 0'
\
- Backslash.
Perhaps, the most diverse metacharacter!!
- If the character following the backslash is a recognized escape character, then the special meaning of the term is taken (Scenario 1)
- Else if the character following the
\
is not a recognized escape character, then the\
is treated like any other character and passed through (Scenario 2). \
can be used in front of all the metacharacters to remove their special meaning (Scenario 3).
## (Scenario 1) This treats '\s' as an escape character, '\s' defines a space
re.search(r'Not a\sregular character', 'Not a regular character').group()
'Not a regular character'
## (Scenario 2) '\' is treated as an ordinary character, because '\r' is not a recognized escape character
re.search(r'Just a \regular character', 'Just a \regular character').group()
'Just a \regular character'
## (Scenario 3) '\s' is escaped using an extra `\` so its interpreted as a literal string '\s'
re.search(r'Just a \\sregular character', 'Just a \sregular character').group()
'Just a \\sregular character'
There is a predefined set of special sequences that begin with '\' and are also very helpful when performing search and match. Let's look at some of them up close...
\w
- Lowercase 'w'. Matches any single letter, digit, or underscore. \W
- Uppercase 'W'. Matches any character not part of \w
(lowercase w).
print("Lowercase w:", re.search(r'Co\wk\we', 'Cookie').group())
## Matches any character except single letter, digit or underscore
print("Uppercase W:", re.search(r'C\Wke', 'C@ke').group())
## Uppercase W won't match single letter, digit
print("Uppercase W won't match, and return:", re.search(r'Co\Wk\We', 'Cookie'))
Lowercase w: Cookie
Uppercase W: C@ke
Uppercase W won't match, and return: None
\s
- Lowercase 's'. Matches a single whitespace character like: space, newline, tab, return. \S
- Uppercase 'S'. Matches any character not part of \s
(lowercase s).
print("Lowercase s:", re.search(r'Eat\scake', 'Eat cake').group())
print("Uppercase S:", re.search(r'cook\Se', "Let's eat cookie").group())
Lowercase s: Eat cake
Uppercase S: cookie
\d
- Lowercase d. Matches decimal digit 0-9. \D
- Uppercase d. Matches any character that is not a decimal digit.
# Example for \d
print("How many cookies do you want? ", re.search(r'\d+', '100 cookies').group())
How many cookies do you want? 100
The +
symbol used after the \d
in the example above is used for repetition. You will see this in some more detail in the repetition section later on...
\t
- Lowercase t. Matches tab.\n
- Lowercase n. Matches newline. \r
- Lowercase r. Matches return. \A
- Uppercase a. Matches only at the start of the string. Works across multiple lines as well. \Z
- Uppercase z. Matches only at the end of the string.
TIP: ^
and \A
are effectively the same, and so are $
and \Z
. Except when dealing with MULTILINE mode. Learn more about it in the flags section.
\b
- Lowercase b. Matches only the beginning or end of the word.
# Example for \t
print("\\t (TAB) example: ", re.search(r'Eat\tcake', 'Eat cake').group())
# Example for \b
print("\\b match gives: ",re.search(r'\b[A-E]ookie', 'Cookie').group())
\t (TAB) example: Eat cake
\b match gives: Cookie
Repetitions
It becomes quite tedious if you are looking to find long patterns in a sequence. Fortunately, the re
module handles repetitions using the following special characters:
+
- Checks if the preceding character appears one or more times starting from that position.
re.search(r'Co+kie', 'Cooookie').group()
'Cooookie'
*
- Checks if the preceding character appears zero or more times starting from that position.
# Checks for any occurrence of a or o or both in the given sequence
re.search(r'Ca*o*kie', 'Cookie').group()
'Cookie'
?
- Checks if the preceding character appears exactly zero or one time starting from that position.
# Checks for exactly zero or one occurrence of a or o or both in the given sequence
re.search(r'Colou?r', 'Color').group()
'Color'
But what if you want to check for an exact number of sequence repetition?
For example, checking the validity of a phone number in an application. re
module handles this very gracefully as well using the following regular expressions:
{x}
- Repeat exactly x number of times. {x,}
- Repeat at least x times or more. {x, y}
- Repeat at least x times but no more than y times.
re.search(r'\d{9,10}', '0987654321').group()
'0987654321'
The +
and *
qualifiers are said to be greedy
. You will see what this means later on.
Grouping in Regular Expressions
The group feature of regular expression allows you to pick up parts of the matching text. Parts of a regular expression pattern bounded by parenthesis ()
are called groups. The parenthesis does not change what the expression matches, but rather forms groups within the matched sequence. You have been using the group()
function all along in this tutorial's examples. The plain match.group()
without any argument is still the whole matched text as usual.
Let's understand this concept with a simple example. Imagine you were validating email addresses and wanted to check the user name and host. This is when you would want to create separate groups within your matched text.
statement = 'Please contact us at: support@datacamp.com'
match = re.search(r'([\w\.-]+)@([\w\.-]+)', statement)
if statement:
print("Email address:", match.group()) # The whole matched text
print("Username:", match.group(1)) # The username (group 1)
print("Host:", match.group(2)) # The host (group 2)
Email address: support@datacamp.com
Username: support
Host: datacamp.com
Another way of doing the same is with the usage of <>
brackets instead. This will let you create named groups. Named groups will make your code more readable. The syntax for creating named group is: (?P<name>...)
. Replace the name
part with the name you want to give to your group. The ...
represent the rest of the matching syntax. See this in action using the same example as before...
statement = 'Please contact us at: support@datacamp.com'
match = re.search(r'(?P<email>(?P<username>[\w\.-]+)@(?P<host>[\w\.-]+))', statement)
if statement:
print("Email address:", match.group('email'))
print("Username:", match.group('username'))
print("Host:", match.group('host'))
Email address: support@datacamp.com
Username: support
Host: datacamp.com
TIP:
You can always access the named groups using numbers instead of the name. But as the number of groups increases, it gets harder to handle them using numbers alone. So, always make it a habit to use named groups instead.
Greedy vs. Non-Greedy Matching
When a special character matches as much of the search sequence (string) as possible, it is said to be a "Greedy Match". It is the normal behavior of a regular expression, but sometimes this behavior is not desired:
pattern = "cookie"
sequence = "Cake and cookie"
heading = r'<h1>TITLE</h1>'
re.match(r'<.*>', heading).group()
'<h1>TITLE</h1>'
The pattern <.*>
matched the whole string, right up to the second occurrence of >
.
However, if you only wanted to match the first <h1>
tag, you could have used the greedy qualifier *?
that matches as little text as possible.
Adding ?
after the qualifier makes it perform the match in a non-greedy or minimal fashion; That is, as few characters as possible will be matched. When you run <.*>
, you will only get a match with <h1>
.
heading = r'<h1>TITLE</h1>'
re.match(r'<.*?>', heading).group()
'<h1>'
Summary Table
You have already come a long way with regular expressions. It is a lot of information and concepts to grasp! The following table summarizes all that you've seen so far in this tutorial. Don't worry if you can't wrap your head around all the metacharacters just yet. With time and practice, you will be able to see the uniqueness of these characters and learn when to use what...
This tutorial does not discuss all the special sequences provided in Python. Check out the Standard Library reference for a complete list.
Character(s) | What it does |
---|---|
. | A period. Matches any single character except the newline character. |
^ | A caret. Matches a pattern at the start of the string. |
\A | Uppercase A. Matches only at the start of the string. |
$ | Dollar sign. Matches the end of the string. |
\Z | Uppercase Z. Matches only at the end of the string. |
[ ] | Matches the set of characters you specify within it. |
\ | ∙ If the character following the backslash is a recognized escape character, then the special meaning of the term is taken. ∙ Else the backslash () is treated like any other character and passed through. ∙ It can be used in front of all the metacharacters to remove their special meaning. |
\w | Lowercase w. Matches any single letter, digit, or underscore. |
\W | Uppercase W. Matches any character not part of \w (lowercase w). |
\s | Lowercase s. Matches a single whitespace character like: space, newline, tab, return. |
\S | Uppercase S. Matches any character not part of \s (lowercase s). |
\d | Lowercase d. Matches decimal digit 0-9. |
\D | Uppercase D. Matches any character that is not a decimal digit. |
\t | Lowercase t. Matches tab. |
\n | Lowercase n. Matches newline. |
\r | Lowercase r. Matches return. |
\b | Lowercase b. Matches only the beginning or end of the word. |
+ | Checks if the preceding character appears one or more times. |
* | Checks if the preceding character appears zero or more times. |
? | ∙ Checks if the preceding character appears exactly zero or one time. ∙ Specifies a non-greedy version of +, * |
{ } | Checks for an explicit number of times. |
( ) | Creates a group when performing matches. |
< > | Creates a named group when performing matches. |
You have tackled the basics of regex! However, there are some more concepts that can help you on your way of creating some beautiful regular expressions to do search and match.
TIP: Although regular expressions are very powerful and helpful, be wary of long, confusing expressions that are hard for others, and also you to understand and maintain over time.
Function Provided by 're'
The re
library in Python provides several functions to make your tasks easier. You have already seen some of them, such as the re.search()
, re.match()
. Let's check out more...
compile(pattern, flags=0)
Regular expressions are handled as strings by Python. However, with compile()
, you can computer a regular expression pattern into a regular expression object. When you need to use an expression several times in a single program, using compile()
to save the resulting regular expression object for reuse is more efficient than saving it as a string. This is because the compiled versions of the most recent patterns passed to compile()
and the module-level matching functions are cached.
pattern = re.compile(r"cookie")
sequence = "Cake and cookie"
pattern.search(sequence).group()
'cookie'
# This is equivalent to:
re.search(pattern, sequence).group()
'cookie'
search(pattern, string, flags=0)
With this function, you scan through the given string/sequence, looking for the first location where the regular expression produces a match. It returns a corresponding match object if found, else returns None
if no position in the string matches the pattern. Note that None
is different from finding a zero-length match at some point in the string.
pattern = "cookie"
sequence = "Cake and cookie"
re.search(pattern, sequence)
<re.Match object; span=(9, 15), match='cookie'>
match(pattern, string, flags=0)
Returns a corresponding match object if zero or more characters at the beginning of string match the pattern. Else it returns None
, if the string does not match the given pattern.
pattern = "C"
sequence1 = "IceCream"
sequence2 = "Cake"
# No match since "C" is not at the start of "IceCream"
print("Sequence 1: ", re.match(pattern, sequence1))
print("Sequence 2: ", re.match(pattern,sequence2).group())
Sequence 1: None
Sequence 2: C
search()
versus match()
The match()
function checks for a match only at the beginning of the string (by default), whereas the search()
function checks for a match anywhere in the string.
findall(pattern, string, flags=0)
Finds all the possible matches in the entire sequence and returns them as a list of strings. Each returned string represents one match.
statement = "Please contact us at: support@datacamp.com, xyz@datacamp.com"
#'addresses' is a list that stores all the possible match
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', statement)
for address in addresses:
print(address)
support@datacamp.com
xyz@datacamp.com
finditer(string, [position, end_position])
Similar to findall()
- it finds all the possible matches in the entire sequence but returns regex match objects as an iterator.
TIP: finditer()
might be an excellent choice when you want to have more information returned to you about your search. The returned regex match object holds not only the sequence that matched but also their positions in the original text.
statement = "Please contact us at: support@datacamp.com, xyz@datacamp.com"
#'addresses' is a list that stores all the possible match
addresses = re.finditer(r'[\w\.-]+@[\w\.-]+', statement)
for address in addresses:
print(address)
<re.Match object; span=(22, 42), match='support@datacamp.com'>
<re.Match object; span=(44, 60), match='xyz@datacamp.com'>
sub(pattern, repl, string, count=0, flags=0)
subn(pattern, repl, string, count=0)
sub()
is the substitute function. It returns the string obtained by replacing or substituting the leftmost non-overlapping occurrences of pattern in string by the replacement repl
. If the pattern is not found, then the string is returned unchanged.
The subn()
is similar to sub()
. However, it returns a tuple containing the new string value and the number of replacements that were performed in the statement.
statement = "Please contact us at: xyz@datacamp.com"
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'support@datacamp.com', statement)
print(new_email_address)
Please contact us at: support@datacamp.com
split(string, [maxsplit = 0])
This splits the strings wherever the pattern matches and returns a list. If the optional argument maxsplit
is nonzero, then the maximum 'maxsplit' number of splits are performed.
statement = "Please contact us at: xyz@datacamp.com, support@datacamp.com"
pattern = re.compile(r'[:,]')
address = pattern.split(statement)
print(address)
['Please contact us at', ' xyz@datacamp.com', ' support@datacamp.com']
start()
- Returns the starting index of the match. end()
- Returns the index where the match ends. span()
- Return a tuple containing the (start, end) positions of the match.
pattern = re.compile('COOKIE', re.IGNORECASE)
match = pattern.search("I am not a cookie monster")
print("Start index:", match.start())
print("End index:", match.end())
print("Tuple:", match.span())
Start index: 11
End index: 17
Tuple: (11, 17)
Compilation Flags
Did you notice the term re.IGNORECASE
in the last example? Did you figure out its importance?
An expression's behavior can be modified by specifying a flag value. You can add flags as an extra argument to the different functions that you have seen in this tutorial. Some of the more useful ones are:
IGNORECASE (I) - Allows case-insensitive matches.
DOTALL (S) - Allows . to match any character, including newline.
MULTILINE (M) - Allows start of string (^) and end of string ($) anchor to match newlines as well.
VERBOSE (X) - Allows you to write whitespace and comments within a regular expression to make it more readable.
statement = "Please contact us at: support@DataCamp.com, xyz@DATACAMP.com"
# Using the VERBOSE flag helps understand complex regular expressions
pattern = re.compile(r"""
[\w\.-]+ #First part
@ #Matches @ sign within email addresses
datacamp.com #Domain
""", re.X | re.I)
addresses = re.findall(pattern, statement)
for address in addresses:
print("Address: ", address)
Address: support@DataCamp.com
Address: xyz@DATACAMP.com
TIP: You can also combine multiple flags by using bitwise OR |
.
Case Study: Working with Regular Expressions
Now that you have seen how regular expressions work in Python by studying some examples, it's time to get your hands dirty! In this case study, you'll put all your knowledge to work.
You will work with the first part of a free e-book titled "The Idiot", written by Fyodor Dostoyevsky from the Project Gutenberg. The novel is about Prince (Knyaz) Lev Nikolayevich Myshkin, a guileless man whose good, kind, simple nature mistakenly leads many to believe he lacks intelligence and insight. The title is an ironic reference to this young man.
You shall be writing some regular expressions to parse through the text and complete some exercises.
import re
import requests
the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt'
def get_book(url):
# Sends a http request to get the text from project Gutenberg
raw = requests.get(url).text
# Discards the metadata from the beginning of the book
start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*",raw ).end()
# Discards the text starting Part 2 of the book
stop = re.search(r"II", raw).start()
# Keeps the relevant text
text = raw[start:stop]
return text
def preprocess(sentence):
return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()
book = get_book(the_idiot_url)
processed_book = preprocess(book)
#print(processed_book)
- Exercise: Find the number of the pronoun "the" in the corpus. Hint: Use the
len()
function.
len(re.findall(r'the', processed_book))
302
- Exercise: Try to convert every single stand-alone instance of 'i' to 'I' in the corpus. Make sure not to change the 'i' occurring within a word:
processed_book = re.sub(r'\si\s', " I ", processed_book)
#print(processed_book)
- Exercise: Find the number of times anyone was quoted (
""
) in the corpus.
len(re.findall(r'\”', book))
0
- Exercise: What are the words connected by
'--'
in the corpus?
Try this out yourself! Feel free to share your answer in the comments below.
Congrats!
You have made it to the end of this Python regular expressions tutorial! There is much more to cover in your data science journey with Python.
Regex can play an important role in the data pre-processing phase. Check out DataCamp's Cleaning Data in Python course. This course teaches you ways to better explore your data by tidying and cleaning it for data analysis purposes. It also includes a case study in the end where you can put your newly-acquired knowledge to use.
Take a look at our Python String Replace Tutorial.
Python Courses
Course
Intermediate Python
Course
Regular Expressions in Python
cheat-sheet
Regular Expressions Cheat Sheet
tutorial
Python String Replace Tutorial
DataCamp Team
2 min
tutorial
Using Regular Expressions to Clean Strings
Ryan Sheehy
4 min
tutorial
Excel Regex Tutorial: Mastering Pattern Matching with Regular Expressions
tutorial
A Guide to R Regular Expressions
tutorial
String Split in Python Tutorial
DataCamp Team
2 min