Using Regular Expressions to Clean Strings
If you want to take our free Intro to R course, here is the link.
Extracting numerical values from strings
Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.
Say you have the following string: 'the recipe calls for 6 strawberries and 2 bananas'
.
It would be useful to extract the 6
and the 2
from this string to be saved for later use when comparing strawberry to banana ratios.
When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the re.findall()
function. Dan did not discuss this in the video, but it is straightforward to use: You pass in a pattern and a string to re.findall()
, and it will return a list of the matches.
Instructions
- Import
re
. - Write a pattern that will find all the numbers in the following string:
'the recipe calls for 10 strawberries and 1 banana'
. To do this:- Use the
re.findall()
function and pass it two arguments: the pattern, followed by the string. \d
is the pattern required to find digits. This should be followed with a+
so that the previous element is matched one or more times. This ensures that10
is viewed as one number and not as1
and0
.
- Use the
- Print the matches to confirm that your regular expression found the values
10
and1
.
# Import the regular expression module
____
# Find the numeric values: matches
matches = re.findall('____', '____')
# Print the matches
print(____)
# Import the regular expression module
import re
# Find the numeric values: matches
matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana')
# Print the matches
print(matches)
Ex().test_import('re')
Ex().test_correct(test_object('matches'), test_function('re.findall'))
Ex().test_function('print')
success_msg('Excellent work - your regular expression successfully extracted the numeric values 10
and 1
from the string!')
- Use the command
import x
to import the modulex
. - The first argument to
re.findall()
should be\d+
, and the second argument should be the string:'the recipe calls for 10 strawberries and 1 banana'
. - Use the provided
print()
function to printmatches
.
If that makes sense keep going to the next exercise! If not, here is an overview video.
Overview video on using regular expressions to clean strings in Python.
Pattern matching
In this exercise, you'll continue practicing your regular expression skills. For each provided string, your job is to write the appropriate pattern to match it.
Instructions
- Write patterns to match:
- A telephone number of the format
xxx-xxx-xxxx
. You already did this in a previous exercise. - A string of the format: A dollar sign, an arbitrary number of digits, a decimal point, 2 digits.
- Use
\$
to match the dollar sign,\d*
to match an arbitrary number of digits,\.
to match the decimal point, and\d{x}
to matchx
number of digits.
- Use
- A telephone number of the format
- A capital letter, followed by an arbitrary number of alphanumeric characters.
- Use
[A-Z]
to match any capital letter followed by\w*
to match an arbitrary number of alphanumeric characters.
- Use
import re
# Write the first pattern
pattern1 = bool(re.match(pattern='____', string='123-456-7890'))
print(pattern1)
# Write the second pattern
pattern2 = bool(re.match(pattern='____', string='$123.45'))
print(pattern2)
# Write the third pattern
pattern3 = bool(re.match(pattern='____', string='Australia'))
print(pattern3)
# Write the first pattern
pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890'))
print(pattern1)
# Write the second pattern
pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45'))
print(pattern2)
# Write the third pattern
pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia'))
print(pattern3)
Ex().test_correct(test_object('pattern1'), test_function('re.match', index=1))
Ex().test_function('print', index=1)
Ex().test_correct(test_object('pattern2'), test_function('re.match', index=2))
Ex().test_function('print', index=2)
Ex().test_correct(test_object('pattern3'), test_function('re.match', index=3))
Ex().test_function('print', index=3)
success_msg("Great work! You're mastering the fundamentals of writing regular expressions!")
- There are three components to the first pattern your regular expression needs to match:
xxx
,xxx
, andxxxx
. The first two are matched by\d{3}
, while the last one is matched by\d{4}
. Each of these components must be separated by a-
.
If you want to learn more from this course, here is the link.