Research Guides: Merge/Append in R: including fuzzy merge

Append

Append – adds cases/observations to a dataset.

This document will use the – smartbind– function from the –gtools- package. Appending two datasets require that both have variables with exactly the same name and spelling. If using categorical data make sure the categories on both datasets refer to exactly the same thing (i.e. 1 “Agree”, 2”Disagree”, 3 “OK” on both).

Let's look at the example dataset.

mydata6080 = read.csv("http://www.princeton.edu/~otorres/mydata6080.csv",
                      header=TRUE,
                      stringsAsFactors = FALSE)

It contains 2850 observations of 6 variables. We can see from the table below that it contains observations from 1960 to 1989.

table(mydata6080$year)

1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 
  95   95   95   95   95   95   95   95   95   95   95   95   95   95 
1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 
  95   95   95   95   95   95   95   95   95   95   95   95   95   95 
1988 1989 
  95   95

The second dataset contains 2280 observations from 6 variables from1990 to 2013.

mydata9020 = read.csv("http://www.princeton.edu/~otorres/mydata9020.csv",
                      header=TRUE,
                      stringsAsFactors = FALSE)

table(mydata9020$year)
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 
  95   95   95   95   95   95   95   95   95   95   95   95   95   95 
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 
  95   95   95   95   95   95   95   95   95   95

Now we append the two datafiles.

library(gtools)
mydata = smartbind(mydata6080, mydata9020)
table(mydata$year)
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 
  95   95   95   95   95   95   95   95   95   95   95   95   95   95 
1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 
  95   95   95   95   95   95   95   95   95   95   95   95   95   95 
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 
  95   95   95   95   95   95   95   95   95   95   95   95   95   95 
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 
  95   95   95   95   95   95   95   95   95   95   95   95

Now these two datasets containing the same variables but different observations are appended to one.

Merge

Merge – adds variables to a dataset.

This document will use –merge– function. Merging two datasets require that both have at least one variable in common (either string or numeric). If string make sure the categories have the same spelling (i.e. country names, etc.) and R is Capital-sensitive.

Explore each dataset separately before merging. Make sure to use all possible common variables (for example, if merging two panel datasets you will need both country and years).

We use the 'mydata' dataset we just appended to illustrate an example.

 str(mydata)
'data.frame':	5130 obs. of  6 variables:
 $ year   : int  1989 1980 1989 1981 1988 1989 1989 1980 1989 1987 ...
 $ country: chr  "Estonia" "Thailand" "Hong Kong SAR, China" "Thailand" ...
 $ gdppc  : num  NA 882 16973 915 16767 ...
 $ unemp  : num  0.6 0.9 1.1 1.3 1.4 1.4 1.5 1.6 1.6 1.7 ...
 $ export : num  NA 9.92e+09 NA 1.08e+10 NA ...
 $ import : num  NA 1.66e+10 NA 1.67e+10 NA ...

We can see that this dataset now has 5130 observation of 6 variables.

We are going to add one more variable from a dataset called 'mydatapol'. We are going to use the common variables ‘country’ and ‘year’ to match each row on both files.

mydatapol = read.csv("http://www.princeton.edu/~otorres/mydatapol.csv",
                    stringsAsFactors = FALSE)
str(mydatapol)
'data.frame':	3655 obs. of  3 variables:
 $ year    : int  1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ...
 $ country : chr  "Afghanistan" "Albania" "Algeria" "American Samoa" ...
 $ politics: num  1.12 24.05 16.69 NA 89.51 ...

Now we merge these two datasets.

mydata <- merge(mydata, mydatapol, by=c("country","year"), all=TRUE)
mydata <- mydata[order(mydata$year,mydata$country),] #Sorting data by year/country

We can see the new dataset has one more variable.

str(mydata)
'data.frame':	7170 obs. of  7 variables:
 $ country : chr  "Algeria" "Argentina" "Australia" "Austria" ...
 $ year    : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ gdppc   : num  1766 3732 13469 10862 NA ...
 $ unemp   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ export  : num  2.24e+10 3.98e+09 1.02e+10 1.03e+10 NA ...
 $ import  : num  1.61e+10 5.43e+09 1.01e+10 1.14e+10 NA ...
 $ politics: num  NA NA NA NA NA NA NA NA NA NA ...

Common variables do not need to have the same name, We can use by.x and by.y to indicate the column names respectively in two files.

We can rename some of the columns for them to have different names.

mydata1 <- read.dta("http://dss.princeton.edu/training/mydata1.dta")
mydata2 <- read.dta("http://dss.princeton.edu/training/mydata2.dta")
library(dplyr)
mydata2<-mydata2 %>% 
  rename(nations = country, time= year)

Then we can merge them

mydata <- merge(mydata1, mydata2,
                by.x=c("country","year"),
                by.y=c("nations","time"), all = TRUE)

Also, If one file has ‘country/year’, but the other only ‘country’, # you can still merge. Values in the data with only ‘country’ will repeat in the ‘country/year’ data.

mydata <- merge(mydata1, mydata2, by=c("country"), all=TRUE)

Fuzzy merge

Sometimes the strings are not exactly the same and it's hard to identify which ones are different. We can use fuzzy merge for approximate mapping.

We use the agrep() function in this document. The main goal is to get a key file to merge the data files. Keep in mind that string merging/matching is not exact and require constant checking and some trial-and-error. The example presented here will try to merge two files which only common variable is company name.

First we load the two files 'sp500' and 'nyse'.

sp500 <- read.csv(("http://www.princeton.edu/~otorres/sandp500.csv"))
head(sp500)
                         Name     Volume
1        Agilent Technologies  1,723,233
2                   Alcoa Inc 25,876,128
3 American Airlines Group Inc  9,930,273
4      Advance Auto Parts Inc    783,826
5                   Apple Inc 34,553,089
6    Abbvie Inc. Common Stock  6,169,855

nyse <- read.csv("http://www.princeton.edu/~otorres/nyse.csv")
head(nyse)
  Symbol                   Name IPOyear            Sector
1    DDD 3D Systems Corporation     n/a        Technology
2    MMM             3M Company     n/a       Health Care
3   WBAI        500.com Limited    2013 Consumer Services
4   WUBA            58.com Inc.    2013        Technology
5    AHC  A.H. Belo Corporation     n/a Consumer Services
6   ATEN     A10 Networks, Inc.    2014        Technology
                                         Industry
1         Computer Software: Prepackaged Software
2                      Medical/Dental Instruments
3           Services-Misc. Amusement & Recreation
4 Computer Software: Programming, Data Processing
5                            Newspapers/Magazines
6               Computer Communications Equipment

Then we separate the string variable from each dataset.

sp500.name = data.frame(sp500$Name)
names(sp500.name)[names(sp500.name)=="sp500.Name"] = "name.sp"
sp500.name$name.sp = as.character(sp500.name$name.sp)
sp500.name = unique(sp500.name) # Removing duplicates
head(sp500.name) 
              name.sp
1        Agilent Technologies
2                   Alcoa Inc
3 American Airlines Group Inc
4      Advance Auto Parts Inc
5                   Apple Inc
6    Abbvie Inc. Common Stock

nyse.name = data.frame(nyse$Name)
names(nyse.name)[names(nyse.name)=="nyse.Name"] = "name.nyse"
nyse.name$name.nyse = as.character(nyse.name$name.nyse)
nyse.name = unique(nyse.name) # Removing duplicates
head(nyse.name) 
       name.nyse
1 3D Systems Corporation
2             3M Company
3        500.com Limited
4            58.com Inc.
5  A.H. Belo Corporation
6     A10 Networks, Inc.

Then we match the string variables from sp500 to nyse data.

sp500.name$name.nyse <- "" # Creating an empty column
for(i in 1:dim(sp500.name)[1]) {
  x <- agrep(sp500.name$name.sp[i], nyse.name$name.nyse,
             ignore.case=TRUE, value=TRUE,
             max.distance = 0.05, useBytes = TRUE)
  x <- paste0(x,"")
  sp500.name$name.nyse[i] <- x
} 

#note: The following warning may pop-up, : “There were 28 warnings (use warnings() to see them)”
head(sp500.name, 13)
   name.sp                                   
1            Agilent Technologies
2                       Alcoa Inc
3     American Airlines Group Inc
4          Advance Auto Parts Inc
5                       Apple Inc
6        Abbvie Inc. Common Stock
7          Amerisourcebergen Corp
8             Abbott Laboratories
9                         Ace Ltd
10                  Accenture Plc
11     Adobe Systems Incorporated
12                 Analog Devices
13 Archer Daniels Midland Company

 name.nyse
1                  Agilent Technologies, Inc.
2                                  Alcoa Inc.
3                                            
4                      Advance Auto Parts Inc
5                                            
6                                            
7  AmerisourceBergen Corporation (Holding Co)
8                         Abbott Laboratories
9                Third Point Reinsurance Ltd.
10                             Accenture plc.
11                                           
12                                           
13             Archer-Daniels-Midland Company

First column has the original names in the file sp500; second column has the corresponding matched names from the nyse file. This file is the key file to merge the full datasets (make sure to check it first). Notice that at Line 9 there is a mismatch. Further cleaning and data inspecation is needed when performing fuzzy matching.

Now we merge the key file.

sp500 = merge(sp500, sp500.name, by.x=c("Name"), by.y=c("name.sp"), all= TRUE)
head(sp500)
                      Name    Volume
1 21St Century Fox Class A 6,206,568
2 21St Century Fox Class B 1,756,450
3               3M Company 1,606,555
4      Abbott Laboratories 4,747,144
5 Abbvie Inc. Common Stock 6,169,855
6            Accenture Plc 1,497,234
            name.nyse
1                    
2                    
3          3M Company
4 Abbott Laboratories
5                    
6      Accenture plc.

Renaming the original variable for company name in dataset sp500 (this is for better tracking when merging the nyse file).

names(sp500)[names(sp500)=="Name"] = "name.sp"

Merge the two files (keep all)

companies = merge(sp500, nyse, by.x=c("name.nyse"), by.y=c("Name"), all = TRUE)
companies[sample(nrow(companies), 10), ] 

                              name.nyse                      name.sp     Volume Symbol IPOyear            Sector                      Industry
715      Campus Crest Communities, Inc.                                         CCG^A   n/a               n/a                           n/a
2093            Monster Worldwide, Inc.                                         MWW     n/a        Technology         Professional Services
436                       Autohome Inc.                                         ATHM    2013        Technology                  EDP Services
23                                      Discovery Communications Inc  1,763,873                                               
3400               Zions Bancorporation         Zions Bancorporation  2,791,235 ZB^A    n/a               n/a                           n/a
3159 United Dominion Realty Trust, Inc.                                         UDR     n/a Consumer Services Real Estate Investment Trusts
297      American Financial Group, Inc.                                         AFGH    2015           Finance    Property-Casualty Insurers
2986              Teleflex Incorporated                                         TFX     n/a       Health Care    Medical/Dental Instruments
495         Bank of America Corporation         Bank of America Corp 66,656,351 BML^G   n/a               n/a                           n/a
90                                             Wisconsin Energy Corp  1,161,041

Merge the two files (keep only perfect matches)

companies = merge(sp500, nyse, by.x=c("name.nyse"), by.y=c("Name"))
companies[sample(nrow(companies), 10), ] 
                          name.nyse                         name.sp     Volume Symbol IPOyear                 Sector                           Industry
390   Regions Financial Corporation          Regions Financial Corp 11,877,681   RF^B     n/a                   n/a                                 n/a
155           Delta Air Lines, Inc.             Delta Air Lines Inc  7,744,041    DAL     n/a        Transportation       Air Freight/Delivery Services
488           Wal-Mart Stores, Inc.                 Wal-Mart Stores  7,833,473    WMT     n/a     Consumer Services  Department/Specialty Retail Stores
360            PPG Industries, Inc.                  PPG Industries  1,244,825    PPG     n/a      Basic Industries                     Paints/Coatings
354 Philip Morris International Inc Philip Morris International Inc  4,502,861     PM     n/a Consumer Non-Durables               Farming/Seeds/Milling
315   Molson Coors Brewing  Company    Molson Coors Brewing Company    996,032    TAP     n/a Consumer Non-Durables Beverages (Production/Distribution)
513          Wyndham Worldwide Corp          Wyndham Worldwide Corp    817,299    WYN     n/a     Consumer Services                      Hotels/Resorts
30       Allstate Corporation (The)                   Allstate Corp  1,475,707  ALL^B     n/a                   n/a                                 n/a
411          Southern Company (The)                Southern Company  3,797,694     SO     n/a      Public Utilities         Electric Utilities: Central
187                   Equifax, Inc.                     Equifax Inc    755,471    EFX     n/a               Finance          Finance: Consumer Services

Note that since the key file was not cleaned or corrected, when merging the files we still have some wrong matches.