Fuzzy Match with Power Query - Excel University (original) (raw)
A few years ago, I wrote a post about how to perform fuzzy lookups using the “Fuzzy Lookup Add-In for Excel.” However, this capability is now available in Power Query!! Wait, what? Yes! Performing a fuzzy match is so much easier and far more intuitive in Power Query. This post walks through the basics … and a huge THANK YOU to the Microsoft developers for this feature 🙂
Objective
Before we get to the mechanics, let’s back up and understand what we are trying to do and what a fuzzy match is.
Generally speaking, Excel lookup functions expect that the lookup values match on both lists. The case can be different (eg, “Microsoft” and “microsoft” would match), but generally, the idea is that the text strings need to be the same. There are some exceptions to this as noted below, but generally, the idea is that the lookup values are the same between both lists.
When the lookup values are different, for example ABC Company and ABC Company, Inc., traditional lookup functions aren’t always reliable. Now, to us humans, these two names represent the same underlying company. But to Excel, these represents different lookup values. The idea with a fuzzy lookup is to enable Excel to match them.
For example, here is a table (named CompanyName) with some company names:
We would like Excel to match them to this table (named CompanyList):
You’ll notice there are some differences: ABC Company vs ABC Company, Inc., Micro soft vs Microsoft, and Xeon vs Xeon Products.
With a traditional lookup function such as VLOOKUP, they don’t match:
With Power Query’s fuzzy merge option, they do:
Now, let’s get to the mechanics.
Note: some lookup functions provide an option for “approximate” match. For example, when you set the 4th argument of VLOOKUP to TRUE, or the 3rd argument of MATCH to 1 or -1, it enables us to perform range lookups. And, depending on the data set, lookup values, and sort order, this may or may not provide the expected approximate match desired. Additionally, some lookup functions support the use of a wildcard which has helped in some situations as well. But, we can now use the built-in fuzzy merge option in Power Query to perform a true fuzzy match 🙂
Details
We will walk through this process in three steps:
- Import List 1
- Import List 2
- Fuzzy merge
Let’s do this thing.
Note: depending on when you are reading this and your version of Excel, you may not have the fuzzy merge option, which was recently released via O365 subscription (Office Insiders update channel).
Import List 1
We begin by getting our first list (CompanyName) into Power Query by selecting the table (or other data source) and using the Data > From Table/Range command. A preview is displayed in the Power Query Editor:
Next, we Home > Close & Load To … a connection only query … and no need to load it into the data model:
Now, it is time to get our second list.
Import List 2
We basically rinse and repeat to import the lookup table (CompanyList). We again Close & Load To … and create a connection only query.
At this point, we should see both queries listed in the Queries & Connections pane on the right side of the Excel window:
With our first and second tables in Power Query, we can now perform the fuzzy merge.
Fuzzy merge
To start this step, from within Excel we select Data > Get Data > Combine Queries > Merge. The Merge dialog appears, and we select our first list (CompanyName) and our second list (CompanyList). We also identify the lookup columns by clicking the column header from both, as shown below:
You’ll notice the bottom of the screenshot says, “The selection matches 0 of 3 rows from the first table.” This is because ABC Company is not the same as ABC Company, Inc., and so on.
But … and here is the good part … you’ll notice a checkbox called Use fuzzy matching to perform the merge.
Check it and … interesting! We now have 2 of 3 rows that match:
Wow! Now let’s see which items matched by clicking OK.
Note: don’t worry, we’ll explore Fuzzy merge options shortly.
Back in the Power Query Editor, we Expand the CompanyList Table field and we can see the matches below:
Our first two companies match … awesome!
Can we get our third company, Xeon, to match as well? Time to explore a few fuzzy merge options. Click the gear icon on the Source applied step and expand the Fuzzy merge options:
- The Similarity threshold field allows us to specify how similar the two values must be to make a match. The range is 0 to 1, and you can set as desired based on your data.
- The Ignore case is checked by default (so MICROSOFT will match microsoft).
- The Match by combining text parts is checked by default (so Micro soft would match Microsoft).
- The Maximum number of matches allows us to set how many matches are returned for each lookup value.
- The Transformation table allows us to specify a separate “mapping table” that increases reliability for lookups we perform frequently (more on this below).
Let’s change the Similarity threshold to 0.5, and bam … 3 of 3 items now match:
We click OK and … yes … it worked!
We Close & Load To … a Table, and the results are now back in Excel:
And that, my friends, is a fuzzy lookup in Excel!
Now, one last note. If we perform a similar lookup on similar lists over and over, we can increase the reliability of our match by incorporating a transformation “mapping” table.
Transformation Table
To create a Transformation Table, you create a new table with a From and To column, as shown below:
Then, you pull it into Power Query (Data > From Table/Range) and save the query as a connection-only query as we did previously.
Now, we open the Merge dialog by editing the fuzzy merge query (double-click the merge query in the Queries & Connections pane). Once again, click the gear icon in the Source applied step. Expand Fuzzy merge options and select your table in the Transformation table drop-down:
Think of this table as providing a reliable match for known differences, and if a value can’t be found in it, then Power Query uses the fuzzy match logic.
Well, for me, this is a very exciting capability in Power Query as it solves a common issue that has historically been hard to solve. Now that it is built-in to Power Query, it is easy to use and will really help in practice. Again, thanks Microsoft for this gift 🙂
Sample File: FuzzyPQ.xlsx