Tutorials
importing & cleaning data
+1

Using Regular Expressions to Clean Strings

This tutorial takes course material from DataCamp's Cleaning Data in Python course and allows you to clean strings using regular expressions.

If you want to take our free Intro to R course, here is the link.

Extracting numerical values from strings

Extracting numbers from strings is a common task, particularly when working with unstructured data or log files.

Say you have the following string: 'the recipe calls for 6 strawberries and 2 bananas'.

It would be useful to extract the 6 and the 2 from this string to be saved for later use when comparing strawberry to banana ratios.

When using a regular expression to extract multiple numbers (or multiple pattern matches, to be exact), you can use the re.findall() function. Dan did not discuss this in the video, but it is straightforward to use: You pass in a pattern and a string to re.findall(), and it will return a list of the matches.

Instructions

  • Import re.
  • Write a pattern that will find all the numbers in the following string: 'the recipe calls for 10 strawberries and 1 banana'. To do this:
    • Use the re.findall() function and pass it two arguments: the pattern, followed by the string.
    • \d is the pattern required to find digits. This should be followed with a + so that the previous element is matched one or more times. This ensures that 10 is viewed as one number and not as 1 and 0.
  • Print the matches to confirm that your regular expression found the values 10 and 1.

# Import the regular expression module ____ # Find the numeric values: matches matches = re.findall('____', '____') # Print the matches print(____) # Import the regular expression module import re # Find the numeric values: matches matches = re.findall('\d+', 'the recipe calls for 10 strawberries and 1 banana') # Print the matches print(matches) Ex().test_import('re') Ex().test_correct(test_object('matches'), test_function('re.findall')) Ex().test_function('print') success_msg('Excellent work - your regular expression successfully extracted the numeric values 10 and 1 from the string!')
  • Use the command import x to import the module x.
  • The first argument to re.findall() should be \d+, and the second argument should be the string: 'the recipe calls for 10 strawberries and 1 banana'.
  • Use the provided print() function to print matches.

If that makes sense keep going to the next exercise! If not, here is an overview video.

Overview video on using regular expressions to clean strings in Python.

Pattern matching

In this exercise, you'll continue practicing your regular expression skills. For each provided string, your job is to write the appropriate pattern to match it.

Instructions

  • Write patterns to match:
    • A telephone number of the format xxx-xxx-xxxx. You already did this in a previous exercise.
    • A string of the format: A dollar sign, an arbitrary number of digits, a decimal point, 2 digits.
      • Use \$ to match the dollar sign, \d* to match an arbitrary number of digits, \. to match the decimal point, and \d{x} to match x number of digits.
  • A capital letter, followed by an arbitrary number of alphanumeric characters.
    • Use [A-Z] to match any capital letter followed by \w* to match an arbitrary number of alphanumeric characters.

import re # Write the first pattern pattern1 = bool(re.match(pattern='____', string='123-456-7890')) print(pattern1) # Write the second pattern pattern2 = bool(re.match(pattern='____', string='$123.45')) print(pattern2) # Write the third pattern pattern3 = bool(re.match(pattern='____', string='Australia')) print(pattern3) # Write the first pattern pattern1 = bool(re.match(pattern='\d{3}-\d{3}-\d{4}', string='123-456-7890')) print(pattern1) # Write the second pattern pattern2 = bool(re.match(pattern='\$\d*\.\d{2}', string='$123.45')) print(pattern2) # Write the third pattern pattern3 = bool(re.match(pattern='[A-Z]\w*', string='Australia')) print(pattern3) Ex().test_correct(test_object('pattern1'), test_function('re.match', index=1)) Ex().test_function('print', index=1) Ex().test_correct(test_object('pattern2'), test_function('re.match', index=2)) Ex().test_function('print', index=2) Ex().test_correct(test_object('pattern3'), test_function('re.match', index=3)) Ex().test_function('print', index=3) success_msg("Great work! You're mastering the fundamentals of writing regular expressions!")
  • There are three components to the first pattern your regular expression needs to match: xxx, xxx, and xxxx. The first two are matched by \d{3}, while the last one is matched by \d{4}. Each of these components must be separated by a -.


If you want to learn more from this course, here is the link.

Want to leave a comment?