Tutorials
python

Fuzzy String Matching in Python

In this tutorial, you will learn how to approximately match strings and determine how similar they are by going over various examples.

Have you ever wanted to compare strings that were referring to the same thing, but they were written slightly different, had typos or were misspelled? This is a problem that you can encounter in a variety of situations ranging from spelling check software to mapping databases that lack a common key (think of trying to join two tables by company name, and these appear differently in both tables).

For this tutorial, I will focus on a case study in which the database problem mentioned above was addressed. However, before we start, it would be beneficial to show how we can fuzzy match strings. Normally, when you compare strings in Python you can do the following:

Str1 = "Apple Inc."
Str2 = "Apple Inc."
Result = Str1 == Str2
print(Result)
True

In this case, the variable Result will print True since the strings are an exact match (100% similarity), but see what happens if the case of Str2 changes:

Str1 = "Apple Inc."
Str2 = "apple Inc."
Result = Str1 == Str2
print(Result)
False

This time we got a False even though the strings look quite the same to the human eye. However, this problem is very straightforward to solve as we can coerce both strings to lower case:

Str1 = "Apple Inc."
Str2 = "apple Inc."
Result = Str1.lower() == Str2.lower()
print(Result)
True

Nevertheless, this happiness ends as soon as a character is off. For example:

Str1 = "Apple Inc."
Str2 = "apple Inc"
Result = Str1.lower() == Str2.lower()
print(Result)
False

Situations like the one above can, at times, appear on databases that have been created based on human data entry and in these cases we need more powerful tools to compare strings. One of these tools is called the Levenshtein distance.

The Levenshtein Distance

The Levenshtein distance is a metric to measure how apart are two sequences of words. In other words, it measures the minimum number of edits that you need to do to change a one-word sequence into the other. These edits can be insertions, deletions or substitutions. This metric was named after Vladimir Levenshtein, who originally considered it in 1965.

The formal definition of the Levenshtein distance between two strings $a$ and $b$ can be seen as follows:

Where $1_{(a_i \neq b_j)}$ denotes 0 when $a = b$ and 1 otherwise. It is important to note that the rows on the minimum above correspond to a deletion, an insertion, and a substitution in that order.

It is also possible to calculate the Levenshtein similarity ratio based on the Levenshtein distance. This can be done using the following formula:

$$\frac{(|a| + |b|) - {lev}_{a,b}(i,j)}{|a| + |b|}$$

where $|a|$ and $|b|$ are the lengths of sequence $a$ and sequence $b$ respectively.

Below you can see how such a formula can be implemented from scratch using a Python function:

import numpy as np
def levenshtein_ratio_and_distance(s, t, ratio_calc = False):
    """ levenshtein_ratio_and_distance:
        Calculates levenshtein distance between two strings.
        If ratio_calc = True, the function computes the
        levenshtein distance ratio of similarity between two strings
        For all i and j, distance[i,j] will contain the Levenshtein
        distance between the first i characters of s and the
        first j characters of t
    """
    # Initialize matrix of zeros
    rows = len(s)+1
    cols = len(t)+1
    distance = np.zeros((rows,cols),dtype = int)

    # Populate matrix of zeros with the indeces of each character of both strings
    for i in range(1, rows):
        for k in range(1,cols):
            distance[i][0] = i
            distance[0][k] = k

    # Iterate over the matrix to compute the cost of deletions,insertions and/or substitutions    
    for col in range(1, cols):
        for row in range(1, rows):
            if s[row-1] == t[col-1]:
                cost = 0 # If the characters are the same in the two strings in a given position [i,j] then the cost is 0
            else:
                # In order to align the results with those of the Python Levenshtein package, if we choose to calculate the ratio
                # the cost of a substitution is 2. If we calculate just distance, then the cost of a substitution is 1.
                if ratio_calc == True:
                    cost = 2
                else:
                    cost = 1
            distance[row][col] = min(distance[row-1][col] + 1,      # Cost of deletions
                                 distance[row][col-1] + 1,          # Cost of insertions
                                 distance[row-1][col-1] + cost)     # Cost of substitutions
    if ratio_calc == True:
        # Computation of the Levenshtein Distance Ratio
        Ratio = ((len(s)+len(t)) - distance[row][col]) / (len(s)+len(t))
        return Ratio
    else:
        # print(distance) # Uncomment if you want to see the matrix showing how the algorithm computes the cost of deletions,
        # insertions and/or substitutions
        # This is the minimum number of edits needed to convert string a to string b
        return "The strings are {} edits away".format(distance[row][col])

If we apply this function to the earlier example where we where tryng to compare "Apple Inc." to "apple Inc" we will see that these two strinsg are very likley the same since the Levenshtein distance is very small.

Str1 = "Apple Inc."
Str2 = "apple Inc"
Distance = levenshtein_ratio_and_distance(Str1,Str2)
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1,Str2,ratio_calc = True)
print(Ratio)
The strings are 2 edits away
0.8421052631578947

As you can see, the function found the 2 differences between the two strings. These were the upper/lower case a and the full stop (period) at the end of the first string as well as a similarity ratio of 84%, which is pretty high. Before moving on further though, I would like to highlight an important notion. If you do very simple string preprocessing before calculating distance you will see the following:

Str1 = "Apple Inc."
Str2 = "apple Inc"
Distance = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower())
print(Distance)
Ratio = levenshtein_ratio_and_distance(Str1.lower(),Str2.lower(),ratio_calc = True)
print(Ratio)
The strings are 1 edits away
0.9473684210526315

Just like that, the distance has been reduced by 1 simply by turning the strings to lower case before comparing and the similarity ratio to almost 95%. This emphasizes the relevance of string preprocessing before performing calculations. If you were, say, choosing if a string is similar to another one based on a similarity threshold of 90%, then "Apple Inc." and "apple Inc" without preprocessing would be marked as not similar.

Even though the example above is a valid way of implementing a function to calculate Levenshtein distance, there is a simpler alternative in Python in the form of the Levenshtein package. The Levenshtein package contains two functions that do the same as the user-defined function above. An example is shown below.

import Levenshtein as lev
Str1 = "Apple Inc."
Str2 = "apple Inc"
Distance = lev.distance(Str1.lower(),Str2.lower()),
print(Distance)
Ratio = lev.ratio(Str1.lower(),Str2.lower())
print(Ratio)
(1,)
0.9473684210526315
Note: You may have noticed that in the user-defined function above there is a comment mentioning that, when the similarity ratio is calculated, the cost of a substitution is 2 instead of 1. That is because lev.ratio() assigns such a cost for a substitution (Think about a substitution as having to do a deletion and an insertion at the same time). However, it is important to note that a substitution has a cost of 1 in lev.distance(). Thus, it is important to keep this in mind if you try to verify the function's outputs manually.

The FuzzyWuzzy Package

This package may have a funny name, but it can be your best friend when the standard Levenshtein distance ratio of similarity between two strings falls short. So far the example that I have been using with "Apple Inc." and "apple Inc" has been relatively simple. After all, there is just one full stop/period of difference if you turn both strings to lower case. However, what happens when something is spelled out of order? What happens when something has considerable spelling variation, but yet it refers to the same thing? That's where the FuzzyWuzzy package comes in since it has functions that allow our fuzzy matching scripts to handle these sorts of cases.

Let's start simple. FuzzyWuzzy has, just like the Levenshtein package, a ratio function that computes the standard Levenshtein distance similarity ratio between two sequences. You can see an example below:

from fuzzywuzzy import fuzz
Str1 = "Apple Inc."
Str2 = "apple Inc"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
print(Ratio)
95

That ratio of similarity is the same as we expected given the other examples above. However, fuzzywuzzy has more powerful functions that allow us to deal with more complex situations such as substring matching. Here is an example:

Str1 = "Los Angeles Lakers"
Str2 = "Lakers"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
print(Ratio)
print(Partial_Ratio)
50
100

fuzz.partial_ratio() is capable of detecting that both strings are referring to the Lakers. Thus, it yields 100% similarity. The way this works is by using an "optimal partial" logic. In other words, if the short string has length $k$ and the longer string has the length $m$, then the algorithm seeks the score of the best matching length-$k$ substring.

Nevertheless, this approach is not foolproof. What happens when the strings comparison the same, but they are in a different order? Luckily for us, fuzzywuzzy has a solution. You can see the example below:

Str1 = "united states v. nixon"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
59
74
100

The fuzz.token functions have an important advantage over ratio and partial_ratio. They tokenize the strings and preprocess them by turning them to lower case and getting rid of punctuation. In the case of fuzz.token_sort_ratio(), the string tokens get sorted alphabetically and then joined together. After that, a simple fuzz.ratio() is applied to obtain the similarity percentage. This allows cases such as court cases in this example to be marked as being the same.

Still, what happens if these two strings are of widely differing lengths? Thats where fuzz.token_set_ratio() comes in. Here is an example:

Str1 = "The supreme court case of Nixon vs The United States"
Str2 = "Nixon v. United States"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Token_Set_Ratio = fuzz.token_set_ratio(Str1,Str2)
print(Ratio)
print(Partial_Ratio)
print(Token_Sort_Ratio)
print(Token_Set_Ratio)
57
77
58
95

95% similarity is that magic? No, it's just string preprocessing under the hood. In particular, fuzz.token_set_ratio() takes a more flexible approach than fuzz.token_sort_ratio(). Instead of just tokenizing the strings, sorting and then pasting the tokens back together, token_set_ratio performs a set operation that takes out the common tokens (the intersection) and then makes fuzz.ratio() pairwise comparisons between the following new strings:

  • s1 = Sorted_tokens_in_intersection
  • s2 = Sorted_tokens_in_intersection + sorted_rest_of_str1_tokens
  • s3 = Sorted_tokens_in_intersection + sorted_rest_of_str2_tokens

The logic behind these comparisons is that since Sorted_tokens_in_intersection is always the same, the score will tend to go up as these words make up a larger chunk of the original strings or the remaining tokens are closer to each other.

Finally, the fuzzywuzzy package has a module called process that allows you to calculate the string with the highest similarity out of a vector of strings. You can see how this works below:

from fuzzywuzzy import process
str2Match = "apple inc"
strOptions = ["Apple Inc.","apple park","apple incorporated","iphone"]
Ratios = process.extract(str2Match,strOptions)
print(Ratios)
# You can also select the string with the highest matching percentage
highest = process.extractOne(str2Match,strOptions)
print(highest)
[('Apple Inc.', 100), ('apple incorporated', 90), ('apple park', 67), ('iphone', 30)]
('Apple Inc.', 100)

In this tutorial, you have learned how to approximately match strings and determine how similar they are. The examples presented here may be simple, but they are enough to illustrate how to handle various cases of what a computer thinks are mismatching strings. I have approached this tutorial based on a case in which I had to use fuzzy string matching to map manually entered company names to the account names present in my employer's Salesforce CRM ("Apple Inc." to "apple inc" was actually one of the mappings). However, the usefulness of this technique does not end up here. There are applications in fields such as spell checking or bioinformatics to match DNA sequences, so then there is definitely more uses for fuzzy matching.

I hope you have thoroughly enjoyed the tutorial, and that you have learned from it. Keep learning; the sky is the limit!

If you want to learn more about Python functions, take DataCamp's Python Data Science Toolbox (Part 1) course.

Want to leave a comment?