When we merge two datasets, we usually have at least one key (or common) variable in each dataset that we use as an identifier. The observations of the key variable in each dataset usually match precisely. For example, if we want to match two datasets based on the key variable country_name, the country names of each of the datasets should match exactly with each other. However, this may not always happen in practice. For example, we may have "United States" in one dataset and "United States of America" in another. We may use the fuzzy match / fuzzy merge technique in that case.
In short, we use fuzzy merge when the strings of the key variables in two datasets do not match exactly.
We use either reclink or matchit commands of Stata to conduct fuzzy merge. Both of the commands are useful for fuzzy merge. However, they differ in terms of functionalities.
The reclink function matches observations between two datasets without perfect key identifying variables. For example, in dataset 1, the key variable "Name" may have "
Below, we will show step-by-step how to use the reclink function to match two datasets with key variables containing dissimilar strings (e.g., " Princeton University" and " Princeton U").
Step 1: open a new do file and set the working directory.
cd "path_of_your_directory\FolderName"
Notes: here are the codes we used to set our working directory:
cd "C:\Users\alamin\Desktop\DSS\FuzzyMerge"
Step 2: install the "reclink" package in Stata. Use the following codes:
ssc install reclink
Step 3: import our first example dataset stored in Excel in the DSS repository. Use the following codes:
import excel using "https://dss.princeton.edu/training/sp500.xlsx", firstrow sheet("sp500") clear
Notice that the dataset contains two columns: "Name" and "Volume". The "Name" column contains the names of different companies, which are recorded as strings.
Step 4: generate an id column in the dataset. To do so, type:
gen id_sp500 = _n
Step 5: save the dataset in Stata (.dta) format. Type:
save sp500.dta, replace
Step 6: import the second example dataset (Excel) from the DSS repository. Type:
import excel using "https://dss.princeton.edu/training/nyse.xlsx", firstrow sheet("nyse") clear
Step 7: generate an id column in the dataset. To do so, type:
gen id_nyse = _n
Step 8: save the dataset in Stata (.dta) format. Type:
save nyse.dta, replace
Step 9: type the following codes to fuzzy merge two datasets.
reclink Name using sp500, idm(id_nyse) idu(id_sp500) gen(myscore)
Here:
reclink: is the reclink function
Name: the variable contains the dissimilar strings we want to match. This variable name should be the same in the master and using the dataset.
idm(id_nyse): id name in the master dataset.
idu(id_sp500): id name in the using dataset.
gen (myscore): will generate a column named myscore that will give us the matching scores range between 0-1. 1 indicates perfect matching.
After running the above codes, Stata gives us the following error message:
The error message suggests that our data needs more basic cleaning before running the "reclink" command. It says, somewhere in our master dataset, about 10% of the way through (use the dots to get the range), we have a "(" in our matching variable, but no ")". To find the cases where we have this problem, type:
list Name if strpos(Name, "(") & (strpos(Name, ")")==0)
Note: Name is the matching variable in the master and using dataset.
Stata will give us the following results:
The message tells us that the closing brackets ")" are missing in rows 461 and 605 in the column Name.
To correct this error, use the following codes:
replace Name = "BlackRock New York Investment Quality Municipal Trust Inc. (Th)" in 461
replace Name = "CBX (Listing Market NYSE Networks AE)" in 605
Stata will give us the following message:
We are now ready to rerun the reclink command as our data is adequately cleaned. To rerun the reclink command, type:
reclink Name using sp500, idm(id_nyse) idu(id_sp500) gen(myscore)
Stata will give us the following message:
The above message tells us that the reclink command successfully finished the fuzzy merge task. The command has matched 1334 cells of the Name variable, where 112 are exact matches.
A part of the fuzzy-matched dataset looks as follows:
Notice that the company names in the Name and UName columns are the same where myscore is = 1 (for example, see rows 20 and 26).
You will have to manually inspect at what myscore level you get your desired matching. Some researcher even finds that 0.60 myscore provides the desired matching.
After manual inspection, if you think that the myscore of greater than 0.75 provides a satisfactory match, drop the observations with myscore values lower than 0.75. To do so, type:
keep if myscore >= 0.75
Step 10: to save the dataset, type:
save mergedData1.dta, replace
Note: Fuzzy matching can produce a lot of incorrect matches. Therefore, it is always advised to inspect the results manually.
The matchit command is similar to reclink. However, the advantage of it is that it allows selecting from a larger variety of matching algorithms, and it also allows the use of string weights.
Below, we will show step-by-step how to use the matchit function to match two datasets with key variables containing dissimilar strings (e.g., " Princeton University" and " Princeton U").
Step 1: set the working directory.
cd "path_of_your_directory\FolderName"
Note: here are the codes we used to set our working directory:
cd "C:\Users\alamin\Desktop\DSS\FuzzyMerge"
Step 2: install the "matchit" and "freqindex" functions that are necessary for fuzzy matching. To install, use the following codes:
ssc install matchit
ssc install freqindex
Step 3: import our first example dataset stored in Excel in the DSS repository. Use the following codes:
import excel using "https://dss.princeton.edu/training/sp500.xlsx", firstrow sheet("sp500") clear
Notice that the dataset contains two columns: "Name" and "Volume". The "Name" column contains the names of different companies, which are recorded as strings.
Step 4: generate an id column in the dataset. To do so, type:
gen id_sp500 = _n
Step 5: save the dataset in Stata (.dta) format. Type:
save sp500.dta, replace
Step 6: import the second example dataset (Excel) from the DSS repository. Type:
import excel using "https://dss.princeton.edu/training/nyse.xlsx", firstrow sheet("nyse") clear
Step 7: generate an id column in the dataset. To do so, type:
gen id_nyse = _n
Step 8: save the dataset in Stata (.dta) format. Type:
save nyse.dta, replace
Step 9: type the following codes to fuzzy merge the two datasets.
matchit id_nyse Name using sp500.dta, idu(id_sp500) txtu(Name)
Stata will give us the following message:
The first few rows of the fuzzy-matched data look as follows:
After some manual inspection, it seems that the data did not match well. Let's keep only that matches >.75. Type:
keep if similscore>0.75
Note that the researchers have to decide carefully about the cut-off point of the similscore. Some researchers even find that the similscore of 0.65 provides a good match.
The new data looks as follows:
The above table indicates a better match.
The matchit command does not merge the two datasets (nyse.dta and sp500.dta) automatically. To merge the two datasets, use the following codes:
joinby id_nyse using nyse.dta
joinby id_sp500 using sp500.dta
The new dataset looks as follows:
Step 10: save the merged dataset. Type:
save mergedData2.dta, replace
Note: Fuzzy matching can produce a lot of incorrect matches. Therefore, it is always advised to inspect the results manually.
DSS Data Analysis Guides. Available at https://libguides.princeton.edu/c.php?g=1415215
Global Poverty Research Lab. https://povertyaction.github.io/guides/cleaning/04%20Data%20Aggregation/02%20Fuzzy%20Merge/
Reclink Stata Help Guide. Available at http://fmwww.bc.edu/RePEc/bocode/r/reclink.html
Statalist resources. https://www.statalist.org/forums/forum/general-stata-discussion/general/1338206-what-are-the-differences-between-matchit-and-reclink
Stata: How to use reclink, part 1/2. Youtube. Available at https://www.youtube.com/watch?v=AfMu5v_JaYc
Stata: How to use reclink, part 2/2. Youtube. Available at https://www.youtube.com/watch?v=ebFiAY3rWy8
If you have questions or comments about this guide or method, please email data@Princeton.edu.