download notebook

Fuzzy string matching

Files needed = ('gb_stats.csv', 'gb_salaries.csv')

Often we want to merge datasets based on strings, but we encounter strings that should be matches but that are not exact matches. For example, suppose American Express receives a charge from 'Frank L. Wright,' who pays with a credit card issued to 'Frank Lloyd Wright.' Is this a legitimate charge? It likely is, but a strict matching routine such as str.match() would fail.

This kind of problem comes up often when we are trying to combine data from different sources. Unless both data sets are using an agreed upon standard identifier, we may have problems. [The social security number and the ISO country code are examples of standardized identifiers.] Fuzzy matching is also the basis for spell checkers and optical character recognition.

We will work with the fuzzywuzzy package that implements fuzzy matching. Fuzzy matching techniques try to find likely matches among strings that are not identical. The goal is to measure the similarity between two strings and decide if the match is 'close enough.'

What we want is a methodology that measures how similar two strings are but is also robust to small changes. Let's take a look at two of them.

1. Jaccard Distance

Chances are, if you spent a couple minutes thinking about how to do this, you would come up with something like the 'Jaccard distance' similarity measure. It can be generalized to a distance measure for any two sets and uses the following formula:

$$ \begin{align*} Dist(A,B) &= \frac{\text{Number of Elements in Either Set But Not Shared by Both}}{\text{Total Number of Distinct Elements}}\\ \hookrightarrow Dist(A,B) &= 1 - \frac{|A\cap B|}{| A\cup B |}\\ \Rightarrow Dist(A,B) & = \frac{|A\cap B| - |A \cup B| }{|A\cap B|} \end{align*} $$

So if two strings are similar, they share many of the same elements and have a small 'distance.' In my example above, 'Frank L. Wright' and 'Frank Lloyd Wright' would have a high similarity score since the two share many unique letters.

There's a problem with this measure, though. The Jaccard distance doesn't account for the number of words or their order. Consider, for example, anagrams, like the words 'new york times' and 'monkeys write.' The Jaccard distance of anagrams will always be zero—a 100 percent match—but we know they refer to different things. We need a metric which remedies this.

2. Levenshtein Distance

Invented by the Russian Scientist Vladimir Levenshtein in the 1960s, the 'Levenshtein Distance' is the measure used by the fuzzywuzzy library. The metric is 'simply' the number of substitutions needed to transform a string u into another string v. When there are few substitutions, the similarity is high.

A substitution is defined as: 1. Erasing a character (cost = 1) 2. Adding a character (cost = 1) 3. Replacing a character with another one (cost = 2, i.e., one erase and one add)

The Levenshtein distance is the minimum number of operations required to change string u into v.

u = Frank L. Wright vs v = Frank Lloyd Wright 1. Replace '.' with 'l' 2. Add 'o', 'y', 'd'

So the Levenshtein distance is 5.

Package installation

The fuzzywuzzy package was develped at seatgeek to find event ticket listings on the web. It is not included in Anaconda, so we have to install it.

On winstat, open an Anaconda prompt and use

pip install --user fuzzywuzzy

You can omit the --user if installing on your own computer.

import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

You likely received a pink-box warning. This is because we are using the pure python implementation of Levenshtein matching which can be slow for big jobs. To speed things up you would also install

pip install python-Levenshtein

which requires the (free) Microsoft Visual Studio C++ build tools. This implementation farms out some of the calculations to code written in C++, which is faster than python for this kind of thing.

Winstat is not set up for this (yet) so we will stick with the slower implementation and keep our examples small enough that our code runs easily. If you are working on your own computer, you could try installing the faster package.

Let's see what we have.

fuzz.ratio("Frank L. Wright", "Frank Lloyd Wright")

What is this ratio? It is defined as

$$ \text{ratio}(u,v) = \frac{\text{len}(u)+\text{len}(v)-\text{lev}(u,v)}{\text{len}(u)+\text{len}(v)}\times 100 $$

In this example, we have

$$ \text{ratio}(u,v) = \frac{15+18-5}{15+18} \times 100 = 84.8$$

So it normalizes the Levenshtein score by the length of the strings we are comparing.

print(fuzz.ratio("The Lord of the Rings II: The Two Towers",
           "The Lord of the Rings 2: the 2 Towers"))
print(fuzz.ratio("the Lord of the Rings II: The two Towers",
           "The Lord of the Rings 2: the 2 Towers"))

What just happened? The ratio() function is case sensitive.

Other matching methods

The fuzzywuzzy library also provides modifications to the Levenshtein distance. Why do we need modifications? Well, there is no perfect way to to fuzzy match (otherwise, it wouldn't be fuzzy), so different applications may call for different matching considerations: Do we care about 'stop words' (e.g., the, and, a, it)? Do we care about repetition? Do we care about word order? Your answers to these kinds of questions depend on what you are trying to achieve.

partial_ratio. The partial_ratio method calculates the fuzzywuzzy ratio for all substrings of the longer string with the length of the shorter one, and then returns the highest match. For example:

s1 = 'hello world'
s2 = 'worl'

The length of the shorter string is 4, so we search all the 4-character substrings of s1. In this case, we would find worl in s1 (as part of world) and return 100.

Let's see it at work:

fuzz.partial_ratio('hello world', 'worl')
fuzz.partial_ratio("Superman Batman vs ", "Batman")

Let's compare the partial_ratio to the ratio. The strings

s1 = 'Peanut Butter and Jelly'
s2 = 'Jelly and Peanut Butter'

are probably close enough to be a match in most cases.

Which method does a better job of finding this match?

print('partial ratio score:', fuzz.partial_ratio("Peanut Butter and Jelly", "Jelly and Peanut Butter") )
print('ratio score:', fuzz.ratio("Peanut Butter and Jelly", "Jelly and Peanut Butter") )

But nothing is perfect. The strings

s1 = 'Peanut Butter and Jelly'
s2 = 'Otter and Hell'

are probably not a match in most cases. But...

# Otter and Hell matched with utter and Jelly
print('partial_ratio score:' ,fuzz.partial_ratio("Peanut Butter and Jelly", "Otter and Hell") )
print('ratio score:', fuzz.ratio("Peanut Butter and Jelly", "Otter and Hell") )

token_sort_ratio. token_sort_ratio breaks the strings into distinct words (tokenizes them), sorts them, and then computes the ratio. This method ignores the order in which words appear.

fuzz.token_sort_ratio("Batman vs Superman", "Superman vs Batman")
fuzz.token_sort_ratio("a b c", "c b a")

token_set_ratio. The token_set_ratio separates each string into distinct words, turns the lists of words into sets (repeats are discarded), sorts, and then computes the ratio. The net effect is that not only do we rule out shared words, we also account for repetitions.

print('token set ratio score:', fuzz.token_set_ratio("fun","fun fun fun"))
print('ratio score:', fuzz.ratio("fun","fun fun fun"))

Fuzzy merging

We have looked at how to score the similarity of two strings. Let's now apply this idea to something we might need to do in our data cleaning: merging two data sets on imperfectly matched keys.

We will use data from the pro football reference on the Green Bay Packers' roster. We have two data files that are "matched" on the player names, but I have inserted some errors into the names—maybe my autocorrect got out of hand.

In theory, each file should have the same players. This should be a perfect match. Let's see how close we can get.

money = pd.read_csv('gb_salaries.csv')
money.head(2)
stats = pd.read_csv('gb_stats.csv')
stats.head(2)

Notice that in the second observation, 'Montravius' and 'Adams' is spelled differently in the two files. There are several of these mistakes floating around.

An exact merge

Let's start with our standard merge.

merge_exact = pd.merge(left=money, right=stats, on='Player', how='outer', indicator=True)
merge_exact['_merge'].value_counts()

Only 32 matches! Let's take a look at the non-matches.

merge_exact[merge_exact['_merge']!='both']

We see Montravius did not get matched, as we could have predicted. I see Ty Summers vs Tyler Summers and a bunch of other typos.

A fuzzy merge

Here's our algorithm

  1. For each player name in 'stat', find the most similar name in 'money'
  2. Save the most-similar names from 'money' into a column in 'stat'
  3. Use the most-similar names in 'stat' and the player names in 'money' as the merge keys

Steps 2 and 3 are straightforward. To complete step 1, we will write a function that takes a target name and a list of all the potential matches. Inside the function we will use the process.extractOne() method of fuzzywuzzy to extract the best match. This method computes a similarity score for the target and each potential match and returns the match with the highest score.

process.extractOne() can also be given a cutoff level that is the minimum acceptable match score and we can tell it which metric to use: e.g., ratio, token_sort_ratio, etc.

def fuzzymatcher(target, choices, threshold):
    """
    target: the string we are trying to match
    choices: a series of potential matching strings
    threshold: the minimum score to be considered a match
    """
    return process.extractOne(target, choices, scorer=fuzz.token_set_ratio, score_cutoff=threshold)

Let's test it out on Montravious Adams. The money['Player'] Series contains all the possible matches in the 'money' DataFrame.

fuzzymatcher('Montravius Adams', money['Player'], 80)

That returned the best match (which is correct in this case), the score, and the index in money where the match was found.

money.iloc[1]

Apply the matcher to each observation

As usual, we use .apply() to execute fuzzymatcher() on each name in stats. The args argument is a tuple that holds the choices and threshold values.

stats['fuzz_res'] = stats['Player'].apply(fuzzymatcher, args=(money['Player'], 80))
stats.head(2)

Now we have a column of tuples returned by fuzzymatcher. Let's make that into three columns.

stats[['best_name', 'score', 'match_index']] = pd.DataFrame(stats['fuzz_res'].to_list(), stats.index)
stats.head(7)

Now the 'stats' DataFrame has a name variable that will match exactly to the names in 'money'. Let's merge.

Notice that 'Hunter Bradley' did not match. 'David Batiari' matched with (the correct) 'David Bakhtiari' and (the correct) 'Montravius Adams' matched with 'Montravvius Adamses'.

merge_fuzz = pd.merge(left=stats, right=money, left_on='best_name', right_on='Player', how='outer', indicator=True)
merge_fuzz['_merge'].value_counts()

We matched 50 out of 53. Not bad. Let's see who we missed.

merge_fuzz[merge_fuzz['_merge']!='both'][['Player_x','Player_y']]

'Mason Cosby' might have a chance at matching if we lowered the cutoff threshold. I'm not sure we can ever get 'Hunter Bradley' to match with 'Harry Bradford' or 'Aaron Jones' to match with 'yaron Johnson'. Who are those guys?

Practice

Redo the fuzzy matches above changing the scorer type and the thresholds. How do the matches change? Can you improve on 50/53? Can you make things worse?