Skip to main content
HomeTutorialsPython

Using Regular Expressions to Clean Strings

This tutorial takes course material from DataCamp's Cleaning Data in Python course and allows you to clean strings using regular expressions.
Sep 28, 2018  · 4 min read

If you want to take our free Intro to R course, here is the link.

Extracting Numerical Values from Strings

Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.

Say you have the following string: 'the recipe calls for 6 strawberries and 2 bananas'.

It would be useful to extract the 6 and the 2 from this string to be saved for later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the re.findall() function. Dan did not discuss this in the video, but it is straightforward to use: You pass in a pattern and a string to re.findall(), and it will return a list of the matches.

Instructions

  • Import re.
  • Write a pattern that will find all the numbers in the following string: 'the recipe calls for 10 strawberries and 1 banana'. To do this:
    • Use the re.findall() function and pass it two arguments: the pattern, followed by the string.
    • \d is the pattern required to find digits. This should be followed with a + so that the previous element is matched one or more times. This ensures that 10 is viewed as one number and not as 1 and 0.
  • Print the matches to confirm that your regular expression found the values 10 and 1.
# Import the regular expression module ____ # Find the numeric values: matches matches = re.findall('____', '____') # Print the matches print(____) # Import the regular expression module import re # Find the numeric values: matches matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana') # Print the matches print(matches) Ex().test_import('re') Ex().test_correct(test_object('matches'), test_function('re.findall')) Ex().test_function('print') success_msg('Excellent work - your regular expression successfully extracted the numeric values 10 and 1 from the string!')
  • Use the command import x to import the module x.
  • The first argument to re.findall() should be \d+, and the second argument should be the string: 'the recipe calls for 10 strawberries and 1 banana'.
  • Use the provided print() function to print matches.

If that makes sense keep going to the next exercise! If not, here is an overview video.

Overview Video on Using Regular Expressions to Clean Strings in Python

Pattern Matching

In this exercise, you'll continue practicing your regular expression skills. For each provided string, your job is to write the appropriate pattern to match it.

Instructions

  • Write patterns to match:
    • A telephone number of the format xxx-xxx-xxxx. You already did this in a previous exercise.
    • A string of the format: A dollar sign, an arbitrary number of digits, a decimal point, 2 digits.
      • Use \$ to match the dollar sign, \d* to match an arbitrary number of digits, \. to match the decimal point, and \d{x} to match x number of digits.
  • A capital letter, followed by an arbitrary number of alphanumeric characters.
    • Use [A-Z] to match any capital letter followed by \w* to match an arbitrary number of alphanumeric characters.
import re # Write the first pattern pattern1 = bool(re.match(pattern='____', string='123-456-7890')) print(pattern1) # Write the second pattern pattern2 = bool(re.match(pattern='____', string='$123.45')) print(pattern2) # Write the third pattern pattern3 = bool(re.match(pattern='____', string='Australia')) print(pattern3) # Write the first pattern pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890')) print(pattern1) # Write the second pattern pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45')) print(pattern2) # Write the third pattern pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia')) print(pattern3) Ex().test_correct(test_object('pattern1'), test_function('re.match', index=1)) Ex().test_function('print', index=1) Ex().test_correct(test_object('pattern2'), test_function('re.match', index=2)) Ex().test_function('print', index=2) Ex().test_correct(test_object('pattern3'), test_function('re.match', index=3)) Ex().test_function('print', index=3) success_msg("Great work! You're mastering the fundamentals of writing regular expressions!")
  • There are three components to the first pattern your regular expression needs to match: xxx, xxx, and xxxx. The first two are matched by \d{3}, while the last one is matched by \d{4}. Each of these components must be separated by a -.

If you want to learn more from this course, here is the link.

Check out DataCamp's Python String Tutorial.

Topics

Learn more about Python

Course

Regular Expressions in Python

4 hr
38.6K
Learn about string manipulation and become a master at using regular expressions.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

cheat-sheet

Regular Expressions Cheat Sheet

Regular expressions (regex or regexp) are a pattern of characters that describe an amount of text. Regular expressions are one of the most widely used tools in natural language processing and allow you to supercharge common text data manipulation tasks.
Richie Cotton's photo

Richie Cotton

tutorial

Python String Replace Tutorial

Learn to find and replace strings using regular expressions in Python.
DataCamp Team's photo

DataCamp Team

2 min

tutorial

Python Regular Expression Tutorial

Discover the power of regular expressions with this tutorial. You will work with the re library, deal with pattern matching, learn about greedy and non-greedy matching, and much more!
Sejal Jaiswal's photo

Sejal Jaiswal

20 min

tutorial

Excel Regex Tutorial: Mastering Pattern Matching with Regular Expressions

Discover the power of Regular Expressions (RegEx) for pattern matching in Excel. Our comprehensive guide unveils how to standardize data, extract keywords, and perform advanced text manipulations.
Chloe Lubin's photo

Chloe Lubin

12 min

tutorial

String Split in Python Tutorial

Learn how you can perform various operations on string using built-in Python functions like split, join and regular expressions.
DataCamp Team's photo

DataCamp Team

2 min

tutorial

A Guide to R Regular Expressions

Explore regular expressions in R, why they're important, the tools and functions to work with them, common regex patterns, and how to use them.
Elena Kosourova's photo

Elena Kosourova

16 min

See MoreSee More