Skip to main content
HomeTutorialsPython

Using Regular Expressions to Clean Strings

This tutorial takes course material from DataCamp's Cleaning Data in Python course and allows you to clean strings using regular expressions.
Updated Sep 2018  · 4 min read

If you want to take our free Intro to R course, here is the link.

Extracting Numerical Values from Strings

Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.

Say you have the following string: 'the recipe calls for 6 strawberries and 2 bananas'.

It would be useful to extract the 6 and the 2 from this string to be saved for later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the re.findall() function. Dan did not discuss this in the video, but it is straightforward to use: You pass in a pattern and a string to re.findall(), and it will return a list of the matches.

Instructions

  • Import re.
  • Write a pattern that will find all the numbers in the following string: 'the recipe calls for 10 strawberries and 1 banana'. To do this:
    • Use the re.findall() function and pass it two arguments: the pattern, followed by the string.
    • \d is the pattern required to find digits. This should be followed with a + so that the previous element is matched one or more times. This ensures that 10 is viewed as one number and not as 1 and 0.
  • Print the matches to confirm that your regular expression found the values 10 and 1.
# Import the regular expression module ____ # Find the numeric values: matches matches = re.findall('____', '____') # Print the matches print(____) # Import the regular expression module import re # Find the numeric values: matches matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana') # Print the matches print(matches) Ex().test_import('re') Ex().test_correct(test_object('matches'), test_function('re.findall')) Ex().test_function('print') success_msg('Excellent work - your regular expression successfully extracted the numeric values 10 and 1 from the string!')
  • Use the command import x to import the module x.
  • The first argument to re.findall() should be \d+, and the second argument should be the string: 'the recipe calls for 10 strawberries and 1 banana'.
  • Use the provided print() function to print matches.

If that makes sense keep going to the next exercise! If not, here is an overview video.

Overview Video on Using Regular Expressions to Clean Strings in Python

Pattern Matching

In this exercise, you'll continue practicing your regular expression skills. For each provided string, your job is to write the appropriate pattern to match it.

Instructions

  • Write patterns to match:
    • A telephone number of the format xxx-xxx-xxxx. You already did this in a previous exercise.
    • A string of the format: A dollar sign, an arbitrary number of digits, a decimal point, 2 digits.
      • Use \$ to match the dollar sign, \d* to match an arbitrary number of digits, \. to match the decimal point, and \d{x} to match x number of digits.
  • A capital letter, followed by an arbitrary number of alphanumeric characters.
    • Use [A-Z] to match any capital letter followed by \w* to match an arbitrary number of alphanumeric characters.
import re # Write the first pattern pattern1 = bool(re.match(pattern='____', string='123-456-7890')) print(pattern1) # Write the second pattern pattern2 = bool(re.match(pattern='____', string='$123.45')) print(pattern2) # Write the third pattern pattern3 = bool(re.match(pattern='____', string='Australia')) print(pattern3) # Write the first pattern pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890')) print(pattern1) # Write the second pattern pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45')) print(pattern2) # Write the third pattern pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia')) print(pattern3) Ex().test_correct(test_object('pattern1'), test_function('re.match', index=1)) Ex().test_function('print', index=1) Ex().test_correct(test_object('pattern2'), test_function('re.match', index=2)) Ex().test_function('print', index=2) Ex().test_correct(test_object('pattern3'), test_function('re.match', index=3)) Ex().test_function('print', index=3) success_msg("Great work! You're mastering the fundamentals of writing regular expressions!")
  • There are three components to the first pattern your regular expression needs to match: xxx, xxx, and xxxx. The first two are matched by \d{3}, while the last one is matched by \d{4}. Each of these components must be separated by a -.

If you want to learn more from this course, here is the link.

Check out DataCamp's Python String Tutorial.

Topics

Learn more about Python

Course

Regular Expressions in Python

4 hr
37.8K
Learn about string manipulation and become a master at using regular expressions.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Exponents in Python: A Comprehensive Guide for Beginners

Master exponents in Python using various methods, from built-in functions to powerful libraries like NumPy, and leverage them in real-world scenarios to gain a deeper understanding.
Satyam Tripathi's photo

Satyam Tripathi

9 min

Python Linked Lists: Tutorial With Examples

Learn everything you need to know about linked lists: when to use them, their types, and implementation in Python.
Natassha Selvaraj's photo

Natassha Selvaraj

9 min

A Beginner’s Guide to Data Cleaning in Python

Explore the principles of data cleaning in Python and discover the importance of preparing your data for analysis by addressing common issues such as missing values, outliers, duplicates, and inconsistencies.
Amberle McKee's photo

Amberle McKee

11 min

Python Data Classes: A Comprehensive Tutorial

A beginner-friendly tutorial on Python data classes and how to use them in practice
Bex Tuychiev's photo

Bex Tuychiev

9 min

Estimating The Cost of GPT Using The tiktoken Library in Python

Learn to manage GPT model costs with tiktoken in Python. Explore tokenization, BPE, and estimate OpenAI API expenses efficiently.
Moez Ali's photo

Moez Ali

7 min

Python Private Methods Explained

Learn about private methods in Python, their syntax, how and when to use them in your projects using examples, and the best practices.
Arunn Thevapalan's photo

Arunn Thevapalan

9 min

See MoreSee More