Skip to Main Content

Fuzzy Merge in Stata: Matching Fuzzy Text/String using Stata

This tutorial provides a step-by-step guide to conduct fuzzy matching using Stata

Matching Fuzzy Text/String using Stata

1. Introduction

When we merge two datasets, we usually have at least one key (or common) variable in each dataset that we use as an identifier. The observations of the key variable in each dataset usually match precisely. For example, if we want to match two datasets based on the key variable country_name, the country names of each of the datasets should match exactly with each other. However, this may not always happen in practice. For example, we may have "United States" in one dataset and "United States of America" in another. We may use the fuzzy match / fuzzy merge technique in that case.

In short, we use fuzzy merge when the strings of the key variables in two datasets do not match exactly.

We use either reclink or matchit commands of Stata to conduct fuzzy merge. Both of the commands are useful for fuzzy merge. However, they differ in terms of functionalities. 

2. Fuzzy Merge using "reclink"

The reclink function matches observations between two datasets without perfect key identifying variables. For example, in dataset 1, the key variable "Name" may have "Princeton University", whereas in dataset 2, the key variable "Name" may have "Princeton U". The reclink function helps us to merge the two datasets by using a matching algorithm for these types of dissimilar strings.

Below, we will show step-by-step how to use the reclink function to match two datasets with key variables containing dissimilar strings (e.g., " Princeton University" and " Princeton  U").

Step 1: open a new do file and set the working directory. 

cd "path_of_your_directory\FolderName"

Notes: here are the codes we used to set our working directory:

cd "C:\Users\alamin\Desktop\DSS\FuzzyMerge"

Step 2: install the "reclink" package in Stata. Use the following codes:

ssc install reclink

Step 3: import our first example dataset stored in Excel in the DSS repository. Use the following codes:

import excel using "https://dss.princeton.edu/training/sp500.xlsx", firstrow sheet("sp500") clear

Notice that the dataset contains two columns: "Name" and "Volume". The "Name" column contains the names of different companies, which are recorded as strings. 

Step 4: generate an id column in the dataset. To do so, type:

gen id_sp500 = _n

Step 5: save the dataset in Stata (.dta) format. Type:

save sp500.dta, replace

Step 6: import the second example dataset (Excel) from the DSS repository. Type:

import excel using "https://dss.princeton.edu/training/nyse.xlsx", firstrow sheet("nyse") clear

Step 7: generate an id column in the dataset. To do so, type:

gen id_nyse = _n

Step 8: save the dataset in Stata (.dta) format. Type:

save nyse.dta, replace

Step 9: type the following codes to fuzzy merge two datasets.

reclink Name using sp500, idm(id_nyse) idu(id_sp500) gen(myscore)

Here:

reclink: is the reclink function

Name: the variable contains the dissimilar strings we want to match.  This variable name should be the same in the master and using the dataset.

idm(id_nyse): id name in the master dataset

idu(id_sp500): id name in the using dataset

gen (myscore): will generate a column named myscore that will give us the matching scores range between 0-1. 1 indicates perfect matching.  

After running the above codes, Stata gives us the following error message:

. reclink Name using sp500, idm(id_nyse) idu(id_sp500) gen(myscore)
112 perfect matches found
Going through 3141 observations to assess fuzzy matches, each .=5% complete
..) required
r(100);

The error message suggests that our data needs more basic cleaning before running the "reclink" command. It says, somewhere in our master dataset, about 10% of the way through (use the dots to get the range), we have a "(" in our matching variable, but no ")".  To find the cases where we have this problem, type:

list Name if strpos(Name, "(") & (strpos(Name, ")")==0)

Note: Name is the matching variable in the master and using dataset.

Stata will give us the following results:

. list Name if strpos(Name, "(") & (strpos(Name, ")")==0)
      +----------------------------------------------------------------+
      |                                                           Name |
      |----------------------------------------------------------------|
 461. | BlackRock New York Investment Quality Municipal Trust Inc. (Th |
 605. |                           CBX (Listing Market NYSE Networks AE |
      +----------------------------------------------------------------+

The message tells us that the closing brackets ")" are missing in rows 461 and 605 in the column Name.

To correct this error, use the following codes:

replace Name = "BlackRock New York Investment Quality Municipal Trust Inc. (Th)" in 461
replace Name = "CBX (Listing Market NYSE Networks AE)" in 605

Stata will give us the following message:

. replace Name = "BlackRock New York Investment Quality Municipal Trust Inc. (Th)" in 461
variable Name was str62 now str63
(1 real change made)
. replace Name = "CBX (Listing Market NYSE Networks AE)" in 605
(1 real change made)

We are now ready to rerun the reclink command as our data is adequately cleaned. To rerun the reclink command, type:

reclink Name using sp500, idm(id_nyse) idu(id_sp500) gen(myscore) 

Stata will give us the following message:

. reclink Name using sp500, idm(id_nyse) idu(id_sp500) gen(myscore) 
112 perfect matches found
Going through 3141 observations to assess fuzzy matches, each .=5% complete
....................
Added: id_sp500= identifier from sp500   myscore = matching score
Observations:  Master N = 3253    sp500 N= 504 
  Unique Master Cases: matched = 1334 (exact = 112), unmatched = 1919

The above message tells us that the reclink command successfully finished the fuzzy merge task.  The command has matched 1334 cells of the Name variable, where 112 are exact matches. 

A part of the fuzzy-matched dataset looks as follows:

Notice that the company names in the Name and UName columns are the same where myscore is = 1 (for example, see rows 20 and 26). 

You will have to manually inspect at what myscore level you get your desired matching. Some researcher even finds that 0.60 myscore provides the desired matching.

After manual inspection, if you think that the myscore of greater than 0.75 provides a satisfactory match, drop the observations with myscore values lower than 0.75. To do so, type:

keep if myscore >= 0.75

Step 10: to save the dataset, type:

save mergedData1.dta, replace

Note: Fuzzy matching can produce a lot of incorrect matches. Therefore, it is always advised to inspect the results manually.

3. Fuzzy Merge using "matchit"

The matchit command is similar to reclink. However, the advantage of it is that it allows selecting from a larger variety of matching algorithms, and it also allows the use of string weights.

Below, we will show step-by-step how to use the matchit function to match two datasets with key variables containing dissimilar strings (e.g., " Princeton University" and " Princeton  U").

Step 1: set the working directory. 

cd "path_of_your_directory\FolderName"

Note: here are the codes we used to set our working directory:

cd "C:\Users\alamin\Desktop\DSS\FuzzyMerge"

Step 2: install the "matchit" and "freqindex" functions that are necessary for fuzzy matching. To install, use the following codes:

ssc install matchit
ssc install freqindex

Step 3: import our first example dataset stored in Excel in the DSS repository. Use the following codes:

import excel using "https://dss.princeton.edu/training/sp500.xlsx", firstrow sheet("sp500") clear

Notice that the dataset contains two columns: "Name" and "Volume". The "Name" column contains the names of different companies, which are recorded as strings.  

Step 4: generate an id column in the dataset. To do so, type:

gen id_sp500 = _n

Step 5: save the dataset in Stata (.dta) format. Type:

save sp500.dta, replace

Step 6: import the second example dataset (Excel) from the DSS repository. Type:

import excel using "https://dss.princeton.edu/training/nyse.xlsx", firstrow sheet("nyse") clear

Step 7: generate an id column in the dataset. To do so, type:

gen id_nyse = _n

Step 8: save the dataset in Stata (.dta) format. Type:

save nyse.dta, replace

Step 9: type the following codes to fuzzy merge the two datasets.

matchit id_nyse Name using sp500.dta, idu(id_sp500) txtu(Name)

Stata will give us the following message:

. matchit id_nyse Name using sp500.dta, idu(id_sp500) txtu(Name)
Matching the current dataset with sp500.dta
Similarity function: bigram
Loading USING file: sp500.dta
Indexing USING file.
0%
20%
40%
60%
80%
Done!
Computing results
        Percent completed ...   (search space saved by index so far)
        20%               ...   (24%)
        40%               ...   (24%)
        60%               ...   (24%)
        80%               ...   (24%)
        Done!
Total search space saved by index: 24%

 

The first few rows of the fuzzy-matched data look as follows:

After some manual inspection, it seems that the data did not match well. Let's keep only that matches >.75. Type:

keep if similscore>0.75

Note that the researchers have to decide carefully about the cut-off point of the similscore. Some researchers even find that the similscore of 0.65 provides a good match. 

The new data looks as follows:

The above table indicates a better match.

The matchit command does not merge the two datasets (nyse.dta and sp500.dta) automatically. To merge the two datasets, use the following codes: 

joinby id_nyse using nyse.dta
joinby id_sp500 using sp500.dta

The new dataset looks as follows:

Step 10: save the merged dataset. Type: 

save mergedData2.dta, replace

Note: Fuzzy matching can produce a lot of incorrect matches. Therefore, it is always advised to inspect the results manually.

4. Useful Resources

Data Consultant

Profile Photo
Muhammad Al Amin
He/Him/His
Contact:
Firestone Library, A-12-F.1
609-258-6051

Data Consultant

Profile Photo
Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519

Comments or Questions?

If you have questions or comments about this guide or method, please email data@Princeton.edu.