Skip to Main Content

Merge/Append in R: including fuzzy merge

Merge, append and fuzzy merge in R

Merge and Append

When working with data, we often need to combine information from multiple datasets. There are two common ways to do so: merge and append. They can both put different data frames together, but they serve different purposes.

Merge adds new variables(columns) to a dataset. It is used when the datasets share at least one common variable, such as a country name or a year, but each dataset contains different kinds of information about those units.

Append, on the other hand, adds new observations(rows) to an existing dataset. It is used when the datasets contain the same variables but represent different cases or time periods -- for example, when combining survey data from two different years.

In short, merge expands a dataset horizontally, while append expands it vertically.

 

Merge

Merge – adds variables to a dataset.

This document will use –merge– function. Merging two datasets require that both have at least one variable in common (either string or numeric). If string make sure the categories have the same spelling (i.e. country names, etc.) and R is Capital-sensitive.

Explore each dataset separately before merging. Make sure to use all possible common variables (for example, if merging two panel datasets you will need both country and years).

We will use the gdp data and population data from WDI (World Development Indicators) to experiment the merge. Here are the links to access them.

gdp_url <- "https://dss.princeton.edu/training/gdp_18to22.csv"
pop_url <- "https://dss.princeton.edu/training/population_18to22.csv"

gdp <- read.csv(gdp_url,  stringsAsFactors = FALSE)
pop <- read.csv(pop_url,  stringsAsFactors = FALSE)

The first dataset includes the GDP per capita(current US dollars) for China, India, and the United States from 2018 to 2022. The second datasets includes the population information for the same countries from the same time span.

We are going to use the common variables ‘country’ and ‘year’  to match each row on both files. Now we merge these two datasets.

merged <- merge(pop, gdp, by = c("country", "year"))

We can see the two datasets are now combined into one.  The new dataset has one more variable instead of three, and this is because two of the variables are used as matching keys in the merge process.

In fact, common variables do not need to have the same name, We can use by.x and by.y to indicate the column names respectively in two files.

We can rename some of the columns for them to have different names. There are multiple ways to rename columns and this is one example.

gdp_renamed <- gdp
names(gdp_renamed)[names(gdp_renamed) == "country"] <- "nations"
names(gdp_renamed)[names(gdp_renamed) == "year"]    <- "time"

Then we can merge them using by.x/ by.y to match different key names.

merged_xy <- merge(
  x = pop, y = gdp_renamed,
  by.x = c("country", "year"),
  by.y = c("nations", "time"),
  all  = TRUE
)

This code will give us the same output as the 'merged'.

Append

Append – adds cases/observations to a dataset.

This document will use the rbind function. We can also use – smartbind– function from the –gtools- package. Appending two datasets require that both have variables with exactly the same name and spelling. If using categorical data make sure the categories on both datasets refer to exactly the same thing (i.e. 1 “Agree”, 2”Disagree”, 3 “OK” on both).

Let's look at the example dataset.

gdp_url <- "https://dss.princeton.edu/training/gdp_18to22.csv"
gdp_1617_url <- "https://dss.princeton.edu/training/gdp_16to17.csv"

gdp_16_17 <- read.csv(gdp_1617_url, stringsAsFactors = FALSE)
gdp_18_22 <- read.csv(gdp_url,     stringsAsFactors = FALSE)

The two datasets both contain GDP per capita (current US dollars) data from WDI. The first dataset contains information for the three countries from 2018 to 2022, and the second includes 2016 and 2017 for the same three countries.

Now we append the two datafiles. 

gdp_all <- rbind(gdp_16_17, gdp_18_22)

Now these two datasets containing the same variables but different observations are appended to one. Let's try to inspect them.

nrow(gdp_16_17);
nrow(gdp_18_22);
nrow(gdp_all)
[1] 6
[1] 15
[1] 21

Unlike merge, the result of the appending should just be the summation of the row numbers of the two previous data frames.

table(gdp_all$year)
2016 2017 2018 2019 2020 2021 2022 
   3    3    3    3    3    3    3

We can see this data frame now contains information from 2016 to 2022.

 

Fuzzy merge

Sometimes the strings are not exactly the same and it's hard to identify which ones are different. We can use fuzzy merge for approximate mapping.

We use the agrep() function in this document. The main goal is to get a key file to merge the data files. Keep in mind that string merging/matching is not exact and require constant checking and some trial-and-error. The example presented here will try to merge two files which only common variable is company name.

First we load the two files 'sp500' and 'nyse'.

sp500 <- read.csv(("http://www.princeton.edu/~otorres/sandp500.csv"))
head(sp500)
                         Name     Volume
1        Agilent Technologies  1,723,233
2                   Alcoa Inc 25,876,128
3 American Airlines Group Inc  9,930,273
4      Advance Auto Parts Inc    783,826
5                   Apple Inc 34,553,089
6    Abbvie Inc. Common Stock  6,169,855 
nyse <- read.csv("http://www.princeton.edu/~otorres/nyse.csv")
head(nyse)
  Symbol                   Name IPOyear            Sector
1    DDD 3D Systems Corporation     n/a        Technology
2    MMM             3M Company     n/a       Health Care
3   WBAI        500.com Limited    2013 Consumer Services
4   WUBA            58.com Inc.    2013        Technology
5    AHC  A.H. Belo Corporation     n/a Consumer Services
6   ATEN     A10 Networks, Inc.    2014        Technology
                                         Industry
1         Computer Software: Prepackaged Software
2                      Medical/Dental Instruments
3           Services-Misc. Amusement & Recreation
4 Computer Software: Programming, Data Processing
5                            Newspapers/Magazines
6               Computer Communications Equipment 

Then we separate the string variable from each dataset.

sp500.name = data.frame(sp500$Name)
names(sp500.name)[names(sp500.name)=="sp500.Name"] = "name.sp"
sp500.name$name.sp = as.character(sp500.name$name.sp)
sp500.name = unique(sp500.name) # Removing duplicates
head(sp500.name) 
              name.sp
1        Agilent Technologies
2                   Alcoa Inc
3 American Airlines Group Inc
4      Advance Auto Parts Inc
5                   Apple Inc
6    Abbvie Inc. Common Stock
nyse.name = data.frame(nyse$Name)
names(nyse.name)[names(nyse.name)=="nyse.Name"] = "name.nyse"
nyse.name$name.nyse = as.character(nyse.name$name.nyse)
nyse.name = unique(nyse.name) # Removing duplicates
head(nyse.name) 
       name.nyse
1 3D Systems Corporation
2             3M Company
3        500.com Limited
4            58.com Inc.
5  A.H. Belo Corporation
6     A10 Networks, Inc.

Then we match the string variables from sp500 to nyse data.

sp500.name$name.nyse <- "" # Creating an empty column
for(i in 1:dim(sp500.name)[1]) {
  x <- agrep(sp500.name$name.sp[i], nyse.name$name.nyse,
             ignore.case=TRUE, value=TRUE,
             max.distance = 0.05, useBytes = TRUE)
  x <- paste0(x,"")
  sp500.name$name.nyse[i] <- x
} 

#note: The following warning may pop-up, : “There were 28 warnings (use warnings() to see them)”
head(sp500.name, 13)
   name.sp                                   
1            Agilent Technologies
2                       Alcoa Inc
3     American Airlines Group Inc
4          Advance Auto Parts Inc
5                       Apple Inc
6        Abbvie Inc. Common Stock
7          Amerisourcebergen Corp
8             Abbott Laboratories
9                         Ace Ltd
10                  Accenture Plc
11     Adobe Systems Incorporated
12                 Analog Devices
13 Archer Daniels Midland Company

 name.nyse
1                  Agilent Technologies, Inc.
2                                  Alcoa Inc.
3                                            
4                      Advance Auto Parts Inc
5                                            
6                                            
7  AmerisourceBergen Corporation (Holding Co)
8                         Abbott Laboratories
9                Third Point Reinsurance Ltd.
10                             Accenture plc.
11                                           
12                                           
13             Archer-Daniels-Midland Company

First column has the original names in the file sp500; second column has the corresponding matched names from the nyse file. This file is the key file to merge the full datasets (make sure to check it first). Notice that at Line 9 there is a mismatch. Further cleaning and data inspecation is needed when performing fuzzy matching.

Now we merge the key file.

sp500 = merge(sp500, sp500.name, by.x=c("Name"), by.y=c("name.sp"), all= TRUE)
head(sp500)
                      Name    Volume
1 21St Century Fox Class A 6,206,568
2 21St Century Fox Class B 1,756,450
3               3M Company 1,606,555
4      Abbott Laboratories 4,747,144
5 Abbvie Inc. Common Stock 6,169,855
6            Accenture Plc 1,497,234
            name.nyse
1                    
2                    
3          3M Company
4 Abbott Laboratories
5                    
6      Accenture plc.

Renaming the original variable for company name in dataset sp500 (this is for better tracking when merging the nyse file).

names(sp500)[names(sp500)=="Name"] = "name.sp"
  • Merge the two files (keep all)
companies = merge(sp500, nyse, by.x=c("name.nyse"), by.y=c("Name"), all = TRUE)
companies[sample(nrow(companies), 10), ] 

                              name.nyse                      name.sp     Volume Symbol IPOyear            Sector                      Industry
715      Campus Crest Communities, Inc.                                         CCG^A   n/a               n/a                           n/a
2093            Monster Worldwide, Inc.                                         MWW     n/a        Technology         Professional Services
436                       Autohome Inc.                                         ATHM    2013        Technology                  EDP Services
23                                      Discovery Communications Inc  1,763,873                                               
3400               Zions Bancorporation         Zions Bancorporation  2,791,235 ZB^A    n/a               n/a                           n/a
3159 United Dominion Realty Trust, Inc.                                         UDR     n/a Consumer Services Real Estate Investment Trusts
297      American Financial Group, Inc.                                         AFGH    2015           Finance    Property-Casualty Insurers
2986              Teleflex Incorporated                                         TFX     n/a       Health Care    Medical/Dental Instruments
495         Bank of America Corporation         Bank of America Corp 66,656,351 BML^G   n/a               n/a                           n/a
90                                             Wisconsin Energy Corp  1,161,041                                                             
  • Merge the two files (keep only perfect matches)
companies = merge(sp500, nyse, by.x=c("name.nyse"), by.y=c("Name"))
companies[sample(nrow(companies), 10), ] 
                          name.nyse                         name.sp     Volume Symbol IPOyear                 Sector                           Industry
390   Regions Financial Corporation          Regions Financial Corp 11,877,681   RF^B     n/a                   n/a                                 n/a
155           Delta Air Lines, Inc.             Delta Air Lines Inc  7,744,041    DAL     n/a        Transportation       Air Freight/Delivery Services
488           Wal-Mart Stores, Inc.                 Wal-Mart Stores  7,833,473    WMT     n/a     Consumer Services  Department/Specialty Retail Stores
360            PPG Industries, Inc.                  PPG Industries  1,244,825    PPG     n/a      Basic Industries                     Paints/Coatings
354 Philip Morris International Inc Philip Morris International Inc  4,502,861     PM     n/a Consumer Non-Durables               Farming/Seeds/Milling
315   Molson Coors Brewing  Company    Molson Coors Brewing Company    996,032    TAP     n/a Consumer Non-Durables Beverages (Production/Distribution)
513          Wyndham Worldwide Corp          Wyndham Worldwide Corp    817,299    WYN     n/a     Consumer Services                      Hotels/Resorts
30       Allstate Corporation (The)                   Allstate Corp  1,475,707  ALL^B     n/a                   n/a                                 n/a
411          Southern Company (The)                Southern Company  3,797,694     SO     n/a      Public Utilities         Electric Utilities: Central
187                   Equifax, Inc.                     Equifax Inc    755,471    EFX     n/a               Finance          Finance: Consumer Services

Note that since the key file was not cleaned or corrected, when merging the files we still have some wrong matches.

Comments or Questions?

If you have questions or comments about this guide or method, please email data@Princeton.edu.

Data Consultant

Profile Photo
Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519

Data Consultant

Profile Photo
Muhammad Al Amin
He/Him/His
Contact:
Firestone Library, A-12-F.1
609-258-6051