Skip to main content
HomeAbout PythonLearn Python

Using Regular Expressions to Clean Strings

This tutorial takes course material from DataCamp's Cleaning Data in Python course and allows you to clean strings using regular expressions.
Sep 2018  · 4 min read

If you want to take our free Intro to R course, here is the link.

Extracting Numerical Values from Strings

Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.

Say you have the following string: 'the recipe calls for 6 strawberries and 2 bananas'.

It would be useful to extract the 6 and the 2 from this string to be saved for later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the re.findall() function. Dan did not discuss this in the video, but it is straightforward to use: You pass in a pattern and a string to re.findall(), and it will return a list of the matches.

Instructions

  • Import re.
  • Write a pattern that will find all the numbers in the following string: 'the recipe calls for 10 strawberries and 1 banana'. To do this:
    • Use the re.findall() function and pass it two arguments: the pattern, followed by the string.
    • \d is the pattern required to find digits. This should be followed with a + so that the previous element is matched one or more times. This ensures that 10 is viewed as one number and not as 1 and 0.
  • Print the matches to confirm that your regular expression found the values 10 and 1.
# Import the regular expression module ____ # Find the numeric values: matches matches = re.findall('____', '____') # Print the matches print(____) # Import the regular expression module import re # Find the numeric values: matches matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana') # Print the matches print(matches) Ex().test_import('re') Ex().test_correct(test_object('matches'), test_function('re.findall')) Ex().test_function('print') success_msg('Excellent work - your regular expression successfully extracted the numeric values 10 and 1 from the string!')
  • Use the command import x to import the module x.
  • The first argument to re.findall() should be \d+, and the second argument should be the string: 'the recipe calls for 10 strawberries and 1 banana'.
  • Use the provided print() function to print matches.

If that makes sense keep going to the next exercise! If not, here is an overview video.

Overview Video on Using Regular Expressions to Clean Strings in Python

Pattern Matching

In this exercise, you'll continue practicing your regular expression skills. For each provided string, your job is to write the appropriate pattern to match it.

Instructions

  • Write patterns to match:
    • A telephone number of the format xxx-xxx-xxxx. You already did this in a previous exercise.
    • A string of the format: A dollar sign, an arbitrary number of digits, a decimal point, 2 digits.
      • Use \$ to match the dollar sign, \d* to match an arbitrary number of digits, \. to match the decimal point, and \d{x} to match x number of digits.
  • A capital letter, followed by an arbitrary number of alphanumeric characters.
    • Use [A-Z] to match any capital letter followed by \w* to match an arbitrary number of alphanumeric characters.
import re # Write the first pattern pattern1 = bool(re.match(pattern='____', string='123-456-7890')) print(pattern1) # Write the second pattern pattern2 = bool(re.match(pattern='____', string='$123.45')) print(pattern2) # Write the third pattern pattern3 = bool(re.match(pattern='____', string='Australia')) print(pattern3) # Write the first pattern pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890')) print(pattern1) # Write the second pattern pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45')) print(pattern2) # Write the third pattern pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia')) print(pattern3) Ex().test_correct(test_object('pattern1'), test_function('re.match', index=1)) Ex().test_function('print', index=1) Ex().test_correct(test_object('pattern2'), test_function('re.match', index=2)) Ex().test_function('print', index=2) Ex().test_correct(test_object('pattern3'), test_function('re.match', index=3)) Ex().test_function('print', index=3) success_msg("Great work! You're mastering the fundamentals of writing regular expressions!")
  • There are three components to the first pattern your regular expression needs to match: xxx, xxx, and xxxx. The first two are matched by \d{3}, while the last one is matched by \d{4}. Each of these components must be separated by a -.

If you want to learn more from this course, here is the link.

Check out DataCamp's Python String Tutorial.

Learn more about Python

Regular Expressions in Python

BeginnerSkill Level
4 hr
34.5K
Learn about string manipulation and become a master at using regular expressions.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Google Cloud for Data Scientists: Harnessing Cloud Resources for Data Analysis

How can using Google Cloud make data analysis easier? We explore examples of companies that have already experienced all the benefits.
Oleh Maksymovych's photo

Oleh Maksymovych

9 min

A Guide to Docker Certification: Exploring The Docker Certified Associate (DCA) Exam

Unlock your potential in Docker and data science with our comprehensive guide. Explore Docker certifications, learning paths, and practical tips.
Matt Crabtree's photo

Matt Crabtree

8 min

Bash & zsh Shell Terminal Basics Cheat Sheet

Improve your Bash & zsh Shell skills with the handy shortcuts featured in this convenient cheat sheet!
Richie Cotton's photo

Richie Cotton

6 min

Functional Programming vs Object-Oriented Programming in Data Analysis

Explore two of the most commonly used programming paradigms in data science: object-oriented programming and functional programming.
Amberle McKee's photo

Amberle McKee

15 min

A Comprehensive Introduction to Anomaly Detection

A tutorial on mastering the fundamentals of anomaly detection - the concepts, terminology, and code.
Bex Tuychiev's photo

Bex Tuychiev

14 min

Pandas Profiling (ydata-profiling) in Python: A Guide for Beginners

Learn how to use the ydata-profiling library in Python to generate detailed reports for datasets with many features.
Satyam Tripathi's photo

Satyam Tripathi

9 min

See MoreSee More