When working with data, we often need to combine information from multiple datasets. There are two common ways to do so: merge and append. They can both put different data frames together, but they serve different purposes.
Merge adds new variables(columns) to a dataset. It is used when the datasets share at least one common variable, such as a country name or a year, but each dataset contains different kinds of information about those units.
Append, on the other hand, adds new observations(rows) to an existing dataset. It is used when the datasets contain the same variables but represent different cases or time periods -- for example, when combining survey data from two different years.
In short, merge expands a dataset horizontally, while append expands it vertically.
Merge – adds variables to a dataset.
This document will use –merge– function. Merging two datasets require that both have at least one variable in common (either string or numeric). If string make sure the categories have the same spelling (i.e. country names, etc.) and R is Capital-sensitive.
Explore each dataset separately before merging. Make sure to use all possible common variables (for example, if merging two panel datasets you will need both country and years).
We will use the gdp data and population data from WDI (World Development Indicators) to experiment the merge. Here are the links to access them.
gdp_url <- "https://dss.princeton.edu/training/gdp_18to22.csv" pop_url <- "https://dss.princeton.edu/training/population_18to22.csv" gdp <- read.csv(gdp_url, stringsAsFactors = FALSE) pop <- read.csv(pop_url, stringsAsFactors = FALSE)
The first dataset includes the GDP per capita(current US dollars) for China, India, and the United States from 2018 to 2022. The second datasets includes the population information for the same countries from the same time span.
We are going to use the common variables ‘country’ and ‘year’ to match each row on both files. Now we merge these two datasets.
merged <- merge(pop, gdp, by = c("country", "year"))

We can see the two datasets are now combined into one. The new dataset has one more variable instead of three, and this is because two of the variables are used as matching keys in the merge process.
In fact, common variables do not need to have the same name, We can use by.x and by.y to indicate the column names respectively in two files.
We can rename some of the columns for them to have different names. There are multiple ways to rename columns and this is one example.
gdp_renamed <- gdp names(gdp_renamed)[names(gdp_renamed) == "country"] <- "nations" names(gdp_renamed)[names(gdp_renamed) == "year"] <- "time"
Then we can merge them using by.x/ by.y to match different key names.
merged_xy <- merge(
x = pop, y = gdp_renamed,
by.x = c("country", "year"),
by.y = c("nations", "time"),
all = TRUE
)
This code will give us the same output as the 'merged'.
Append – adds cases/observations to a dataset.
This document will use the rbind function. We can also use – smartbind– function from the –gtools- package. Appending two datasets require that both have variables with exactly the same name and spelling. If using categorical data make sure the categories on both datasets refer to exactly the same thing (i.e. 1 “Agree”, 2”Disagree”, 3 “OK” on both).
Let's look at the example dataset.
gdp_url <- "https://dss.princeton.edu/training/gdp_18to22.csv" gdp_1617_url <- "https://dss.princeton.edu/training/gdp_16to17.csv" gdp_16_17 <- read.csv(gdp_1617_url, stringsAsFactors = FALSE) gdp_18_22 <- read.csv(gdp_url, stringsAsFactors = FALSE)
The two datasets both contain GDP per capita (current US dollars) data from WDI. The first dataset contains information for the three countries from 2018 to 2022, and the second includes 2016 and 2017 for the same three countries.
Now we append the two datafiles.
gdp_all <- rbind(gdp_16_17, gdp_18_22)
Now these two datasets containing the same variables but different observations are appended to one. Let's try to inspect them.
nrow(gdp_16_17); nrow(gdp_18_22); nrow(gdp_all) [1] 6 [1] 15 [1] 21
Unlike merge, the result of the appending should just be the summation of the row numbers of the two previous data frames.
table(gdp_all$year)
2016 2017 2018 2019 2020 2021 2022
3 3 3 3 3 3 3
We can see this data frame now contains information from 2016 to 2022.
Sometimes the strings are not exactly the same and it's hard to identify which ones are different. We can use fuzzy merge for approximate mapping.
We use the agrep() function in this document. The main goal is to get a key file to merge the data files. Keep in mind that string merging/matching is not exact and require constant checking and some trial-and-error. The example presented here will try to merge two files which only common variable is company name.
First we load the two files 'sp500' and 'nyse'.
sp500 <- read.csv(("http://www.princeton.edu/~otorres/sandp500.csv")) head(sp500) Name Volume 1 Agilent Technologies 1,723,233 2 Alcoa Inc 25,876,128 3 American Airlines Group Inc 9,930,273 4 Advance Auto Parts Inc 783,826 5 Apple Inc 34,553,089 6 Abbvie Inc. Common Stock 6,169,855
nyse <- read.csv("http://www.princeton.edu/~otorres/nyse.csv") head(nyse) Symbol Name IPOyear Sector 1 DDD 3D Systems Corporation n/a Technology 2 MMM 3M Company n/a Health Care 3 WBAI 500.com Limited 2013 Consumer Services 4 WUBA 58.com Inc. 2013 Technology 5 AHC A.H. Belo Corporation n/a Consumer Services 6 ATEN A10 Networks, Inc. 2014 Technology Industry 1 Computer Software: Prepackaged Software 2 Medical/Dental Instruments 3 Services-Misc. Amusement & Recreation 4 Computer Software: Programming, Data Processing 5 Newspapers/Magazines 6 Computer Communications Equipment
Then we separate the string variable from each dataset.
sp500.name = data.frame(sp500$Name)
names(sp500.name)[names(sp500.name)=="sp500.Name"] = "name.sp"
sp500.name$name.sp = as.character(sp500.name$name.sp)
sp500.name = unique(sp500.name) # Removing duplicates
head(sp500.name)
name.sp
1 Agilent Technologies
2 Alcoa Inc
3 American Airlines Group Inc
4 Advance Auto Parts Inc
5 Apple Inc
6 Abbvie Inc. Common Stock
nyse.name = data.frame(nyse$Name)
names(nyse.name)[names(nyse.name)=="nyse.Name"] = "name.nyse"
nyse.name$name.nyse = as.character(nyse.name$name.nyse)
nyse.name = unique(nyse.name) # Removing duplicates
head(nyse.name)
name.nyse
1 3D Systems Corporation
2 3M Company
3 500.com Limited
4 58.com Inc.
5 A.H. Belo Corporation
6 A10 Networks, Inc.
Then we match the string variables from sp500 to nyse data.
sp500.name$name.nyse <- "" # Creating an empty column for(i in 1:dim(sp500.name)[1]) { x <- agrep(sp500.name$name.sp[i], nyse.name$name.nyse, ignore.case=TRUE, value=TRUE, max.distance = 0.05, useBytes = TRUE) x <- paste0(x,"") sp500.name$name.nyse[i] <- x } #note: The following warning may pop-up, : “There were 28 warnings (use warnings() to see them)” head(sp500.name, 13) name.sp 1 Agilent Technologies 2 Alcoa Inc 3 American Airlines Group Inc 4 Advance Auto Parts Inc 5 Apple Inc 6 Abbvie Inc. Common Stock 7 Amerisourcebergen Corp 8 Abbott Laboratories 9 Ace Ltd 10 Accenture Plc 11 Adobe Systems Incorporated 12 Analog Devices 13 Archer Daniels Midland Company name.nyse 1 Agilent Technologies, Inc. 2 Alcoa Inc. 3 4 Advance Auto Parts Inc 5 6 7 AmerisourceBergen Corporation (Holding Co) 8 Abbott Laboratories 9 Third Point Reinsurance Ltd. 10 Accenture plc. 11 12 13 Archer-Daniels-Midland Company
First column has the original names in the file sp500; second column has the corresponding matched names from the nyse file. This file is the key file to merge the full datasets (make sure to check it first). Notice that at Line 9 there is a mismatch. Further cleaning and data inspecation is needed when performing fuzzy matching.
Now we merge the key file.
sp500 = merge(sp500, sp500.name, by.x=c("Name"), by.y=c("name.sp"), all= TRUE) head(sp500) Name Volume 1 21St Century Fox Class A 6,206,568 2 21St Century Fox Class B 1,756,450 3 3M Company 1,606,555 4 Abbott Laboratories 4,747,144 5 Abbvie Inc. Common Stock 6,169,855 6 Accenture Plc 1,497,234 name.nyse 1 2 3 3M Company 4 Abbott Laboratories 5 6 Accenture plc.
Renaming the original variable for company name in dataset sp500 (this is for better tracking when merging the nyse file).
names(sp500)[names(sp500)=="Name"] = "name.sp"
companies = merge(sp500, nyse, by.x=c("name.nyse"), by.y=c("Name"), all = TRUE) companies[sample(nrow(companies), 10), ] name.nyse name.sp Volume Symbol IPOyear Sector Industry 715 Campus Crest Communities, Inc.CCG^A n/a n/a n/a 2093 Monster Worldwide, Inc. MWW n/a Technology Professional Services 436 Autohome Inc. ATHM 2013 Technology EDP Services 23 Discovery Communications Inc 1,763,873 3400 Zions Bancorporation Zions Bancorporation 2,791,235 ZB^A n/a n/a n/a 3159 United Dominion Realty Trust, Inc. UDR n/a Consumer Services Real Estate Investment Trusts 297 American Financial Group, Inc. AFGH 2015 Finance Property-Casualty Insurers 2986 Teleflex Incorporated TFX n/a Health Care Medical/Dental Instruments 495 Bank of America Corporation Bank of America Corp 66,656,351 BML^G n/a n/a n/a 90 Wisconsin Energy Corp 1,161,041
companies = merge(sp500, nyse, by.x=c("name.nyse"), by.y=c("Name"))
companies[sample(nrow(companies), 10), ]
name.nyse name.sp Volume Symbol IPOyear Sector Industry
390 Regions Financial Corporation Regions Financial Corp 11,877,681 RF^B n/a n/a n/a
155 Delta Air Lines, Inc. Delta Air Lines Inc 7,744,041 DAL n/a Transportation Air Freight/Delivery Services
488 Wal-Mart Stores, Inc. Wal-Mart Stores 7,833,473 WMT n/a Consumer Services Department/Specialty Retail Stores
360 PPG Industries, Inc. PPG Industries 1,244,825 PPG n/a Basic Industries Paints/Coatings
354 Philip Morris International Inc Philip Morris International Inc 4,502,861 PM n/a Consumer Non-Durables Farming/Seeds/Milling
315 Molson Coors Brewing Company Molson Coors Brewing Company 996,032 TAP n/a Consumer Non-Durables Beverages (Production/Distribution)
513 Wyndham Worldwide Corp Wyndham Worldwide Corp 817,299 WYN n/a Consumer Services Hotels/Resorts
30 Allstate Corporation (The) Allstate Corp 1,475,707 ALL^B n/a n/a n/a
411 Southern Company (The) Southern Company 3,797,694 SO n/a Public Utilities Electric Utilities: Central
187 Equifax, Inc. Equifax Inc 755,471 EFX n/a Finance Finance: Consumer Services
Note that since the key file was not cleaned or corrected, when merging the files we still have some wrong matches.
Merge in R: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/merge
Rename in R: https://www.datanovia.com/en/lessons/rename-data-frame-columns-in-r/
World Development Indicators (World Bank): https://databank.worldbank.org/reports.aspx?source=world-development-indicators
https://stats.stackexchange.com/questions/3425/how-to-quasi-match-two-vectors-of-strings-in-r
https://stackoverflow.com/questions/8273313/sample-random-rows-in-dataframe
https://www.barchart.com/stocks/indices/sp/sp500
https://www.nasdaq.com/market-activity/stocks/screener?exchange=NYSE
Package ‘WDI’: https://cran.r-project.org/web/packages/WDI/WDI.pdf
If you have questions or comments about this guide or method, please email data@Princeton.edu.