Append – adds cases/observations to a dataset.
This document will use the – smartbind– function from the –gtools- package. Appending two datasets require that both have variables with exactly the same name and spelling. If using categorical data make sure the categories on both datasets refer to exactly the same thing (i.e. 1 “Agree”, 2”Disagree”, 3 “OK” on both).
Let's look at the example dataset.
mydata6080 = read.csv("http://www.princeton.edu/~otorres/mydata6080.csv",
header=TRUE,
stringsAsFactors = FALSE)
It contains 2850 observations of 6 variables. We can see from the table below that it contains observations from 1960 to 1989.
table(mydata6080$year)
1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
95 95 95 95 95 95 95 95 95 95 95 95 95 95
1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987
95 95 95 95 95 95 95 95 95 95 95 95 95 95
1988 1989
95 95
The second dataset contains 2280 observations from 6 variables from1990 to 2013.
mydata9020 = read.csv("http://www.princeton.edu/~otorres/mydata9020.csv",
header=TRUE,
stringsAsFactors = FALSE)
table(mydata9020$year)
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
95 95 95 95 95 95 95 95 95 95 95 95 95 95
2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
95 95 95 95 95 95 95 95 95 95
Now we append the two datafiles.
library(gtools) mydata = smartbind(mydata6080, mydata9020) table(mydata$year) 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 95 95 95 95 95 95 95 95 95 95 95 95 95 95 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 95 95 95 95 95 95 95 95 95 95 95 95 95 95 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 95 95 95 95 95 95 95 95 95 95 95 95 95 95 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 95 95 95 95 95 95 95 95 95 95 95 95
Now these two datasets containing the same variables but different observations are appended to one.
Merge – adds variables to a dataset.
This document will use –merge– function. Merging two datasets require that both have at least one variable in common (either string or numeric). If string make sure the categories have the same spelling (i.e. country names, etc.) and R is Capital-sensitive.
Explore each dataset separately before merging. Make sure to use all possible common variables (for example, if merging two panel datasets you will need both country and years).
We use the 'mydata' dataset we just appended to illustrate an example.
str(mydata)
'data.frame': 5130 obs. of 6 variables:
$ year : int 1989 1980 1989 1981 1988 1989 1989 1980 1989 1987 ...
$ country: chr "Estonia" "Thailand" "Hong Kong SAR, China" "Thailand" ...
$ gdppc : num NA 882 16973 915 16767 ...
$ unemp : num 0.6 0.9 1.1 1.3 1.4 1.4 1.5 1.6 1.6 1.7 ...
$ export : num NA 9.92e+09 NA 1.08e+10 NA ...
$ import : num NA 1.66e+10 NA 1.67e+10 NA ...
We can see that this dataset now has 5130 observation of 6 variables.
We are going to add one more variable from a dataset called 'mydatapol'. We are going to use the common variables ‘country’ and ‘year’ to match each row on both files.
mydatapol = read.csv("http://www.princeton.edu/~otorres/mydatapol.csv", stringsAsFactors = FALSE) str(mydatapol) 'data.frame': 3655 obs. of 3 variables: $ year : int 1996 1996 1996 1996 1996 1996 1996 1996 1996 1996 ... $ country : chr "Afghanistan" "Albania" "Algeria" "American Samoa" ... $ politics: num 1.12 24.05 16.69 NA 89.51 ...
Now we merge these two datasets.
mydata <- merge(mydata, mydatapol, by=c("country","year"), all=TRUE) mydata <- mydata[order(mydata$year,mydata$country),] #Sorting data by year/country
We can see the new dataset has one more variable.
str(mydata)
'data.frame': 7170 obs. of 7 variables:
$ country : chr "Algeria" "Argentina" "Australia" "Austria" ...
$ year : int 1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
$ gdppc : num 1766 3732 13469 10862 NA ...
$ unemp : num NA NA NA NA NA NA NA NA NA NA ...
$ export : num 2.24e+10 3.98e+09 1.02e+10 1.03e+10 NA ...
$ import : num 1.61e+10 5.43e+09 1.01e+10 1.14e+10 NA ...
$ politics: num NA NA NA NA NA NA NA NA NA NA ...
Common variables do not need to have the same name, We can use by.x and by.y to indicate the column names respectively in two files.
We can rename some of the columns for them to have different names.
mydata1 <- read.dta("http://dss.princeton.edu/training/mydata1.dta") mydata2 <- read.dta("http://dss.princeton.edu/training/mydata2.dta") library(dplyr) mydata2<-mydata2 %>% rename(nations = country, time= year)
Then we can merge them
mydata <- merge(mydata1, mydata2, by.x=c("country","year"), by.y=c("nations","time"), all = TRUE)
Also, If one file has ‘country/year’, but the other only ‘country’, # you can still merge. Values in the data with only ‘country’ will repeat in the ‘country/year’ data.
mydata <- merge(mydata1, mydata2, by=c("country"), all=TRUE)
Sometimes the strings are not exactly the same and it's hard to identify which ones are different. We can use fuzzy merge for approximate mapping.
We use the agrep() function in this document. The main goal is to get a key file to merge the data files. Keep in mind that string merging/matching is not exact and require constant checking and some trial-and-error. The example presented here will try to merge two files which only common variable is company name.
First we load the two files 'sp500' and 'nyse'.
sp500 <- read.csv(("http://www.princeton.edu/~otorres/sandp500.csv")) head(sp500) Name Volume 1 Agilent Technologies 1,723,233 2 Alcoa Inc 25,876,128 3 American Airlines Group Inc 9,930,273 4 Advance Auto Parts Inc 783,826 5 Apple Inc 34,553,089 6 Abbvie Inc. Common Stock 6,169,855
nyse <- read.csv("http://www.princeton.edu/~otorres/nyse.csv") head(nyse) Symbol Name IPOyear Sector 1 DDD 3D Systems Corporation n/a Technology 2 MMM 3M Company n/a Health Care 3 WBAI 500.com Limited 2013 Consumer Services 4 WUBA 58.com Inc. 2013 Technology 5 AHC A.H. Belo Corporation n/a Consumer Services 6 ATEN A10 Networks, Inc. 2014 Technology Industry 1 Computer Software: Prepackaged Software 2 Medical/Dental Instruments 3 Services-Misc. Amusement & Recreation 4 Computer Software: Programming, Data Processing 5 Newspapers/Magazines 6 Computer Communications Equipment
Then we separate the string variable from each dataset.
sp500.name = data.frame(sp500$Name)
names(sp500.name)[names(sp500.name)=="sp500.Name"] = "name.sp"
sp500.name$name.sp = as.character(sp500.name$name.sp)
sp500.name = unique(sp500.name) # Removing duplicates
head(sp500.name)
name.sp
1 Agilent Technologies
2 Alcoa Inc
3 American Airlines Group Inc
4 Advance Auto Parts Inc
5 Apple Inc
6 Abbvie Inc. Common Stock
nyse.name = data.frame(nyse$Name)
names(nyse.name)[names(nyse.name)=="nyse.Name"] = "name.nyse"
nyse.name$name.nyse = as.character(nyse.name$name.nyse)
nyse.name = unique(nyse.name) # Removing duplicates
head(nyse.name)
name.nyse
1 3D Systems Corporation
2 3M Company
3 500.com Limited
4 58.com Inc.
5 A.H. Belo Corporation
6 A10 Networks, Inc.
Then we match the string variables from sp500 to nyse data.
sp500.name$name.nyse <- "" # Creating an empty column for(i in 1:dim(sp500.name)[1]) { x <- agrep(sp500.name$name.sp[i], nyse.name$name.nyse, ignore.case=TRUE, value=TRUE, max.distance = 0.05, useBytes = TRUE) x <- paste0(x,"") sp500.name$name.nyse[i] <- x } #note: The following warning may pop-up, : “There were 28 warnings (use warnings() to see them)” head(sp500.name, 13) name.sp 1 Agilent Technologies 2 Alcoa Inc 3 American Airlines Group Inc 4 Advance Auto Parts Inc 5 Apple Inc 6 Abbvie Inc. Common Stock 7 Amerisourcebergen Corp 8 Abbott Laboratories 9 Ace Ltd 10 Accenture Plc 11 Adobe Systems Incorporated 12 Analog Devices 13 Archer Daniels Midland Company name.nyse 1 Agilent Technologies, Inc. 2 Alcoa Inc. 3 4 Advance Auto Parts Inc 5 6 7 AmerisourceBergen Corporation (Holding Co) 8 Abbott Laboratories 9 Third Point Reinsurance Ltd. 10 Accenture plc. 11 12 13 Archer-Daniels-Midland Company
First column has the original names in the file sp500; second column has the corresponding matched names from the nyse file. This file is the key file to merge the full datasets (make sure to check it first). Notice that at Line 9 there is a mismatch. Further cleaning and data inspecation is needed when performing fuzzy matching.
Now we merge the key file.
sp500 = merge(sp500, sp500.name, by.x=c("Name"), by.y=c("name.sp"), all= TRUE) head(sp500) Name Volume 1 21St Century Fox Class A 6,206,568 2 21St Century Fox Class B 1,756,450 3 3M Company 1,606,555 4 Abbott Laboratories 4,747,144 5 Abbvie Inc. Common Stock 6,169,855 6 Accenture Plc 1,497,234 name.nyse 1 2 3 3M Company 4 Abbott Laboratories 5 6 Accenture plc.
Renaming the original variable for company name in dataset sp500 (this is for better tracking when merging the nyse file).
names(sp500)[names(sp500)=="Name"] = "name.sp"
companies = merge(sp500, nyse, by.x=c("name.nyse"), by.y=c("Name"), all = TRUE) companies[sample(nrow(companies), 10), ] name.nyse name.sp Volume Symbol IPOyear Sector Industry 715 Campus Crest Communities, Inc.CCG^A n/a n/a n/a 2093 Monster Worldwide, Inc. MWW n/a Technology Professional Services 436 Autohome Inc. ATHM 2013 Technology EDP Services 23 Discovery Communications Inc 1,763,873 3400 Zions Bancorporation Zions Bancorporation 2,791,235 ZB^A n/a n/a n/a 3159 United Dominion Realty Trust, Inc. UDR n/a Consumer Services Real Estate Investment Trusts 297 American Financial Group, Inc. AFGH 2015 Finance Property-Casualty Insurers 2986 Teleflex Incorporated TFX n/a Health Care Medical/Dental Instruments 495 Bank of America Corporation Bank of America Corp 66,656,351 BML^G n/a n/a n/a 90 Wisconsin Energy Corp 1,161,041
companies = merge(sp500, nyse, by.x=c("name.nyse"), by.y=c("Name"))
companies[sample(nrow(companies), 10), ]
name.nyse name.sp Volume Symbol IPOyear Sector Industry
390 Regions Financial Corporation Regions Financial Corp 11,877,681 RF^B n/a n/a n/a
155 Delta Air Lines, Inc. Delta Air Lines Inc 7,744,041 DAL n/a Transportation Air Freight/Delivery Services
488 Wal-Mart Stores, Inc. Wal-Mart Stores 7,833,473 WMT n/a Consumer Services Department/Specialty Retail Stores
360 PPG Industries, Inc. PPG Industries 1,244,825 PPG n/a Basic Industries Paints/Coatings
354 Philip Morris International Inc Philip Morris International Inc 4,502,861 PM n/a Consumer Non-Durables Farming/Seeds/Milling
315 Molson Coors Brewing Company Molson Coors Brewing Company 996,032 TAP n/a Consumer Non-Durables Beverages (Production/Distribution)
513 Wyndham Worldwide Corp Wyndham Worldwide Corp 817,299 WYN n/a Consumer Services Hotels/Resorts
30 Allstate Corporation (The) Allstate Corp 1,475,707 ALL^B n/a n/a n/a
411 Southern Company (The) Southern Company 3,797,694 SO n/a Public Utilities Electric Utilities: Central
187 Equifax, Inc. Equifax Inc 755,471 EFX n/a Finance Finance: Consumer Services
Note that since the key file was not cleaned or corrected, when merging the files we still have some wrong matches.
Merge in R: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/merge
Rename in R: https://www.datanovia.com/en/lessons/rename-data-frame-columns-in-r/
World Development Indicators (World Bank): https://databank.worldbank.org/reports.aspx?source=world-development-indicators
https://stats.stackexchange.com/questions/3425/how-to-quasi-match-two-vectors-of-strings-in-r
https://stackoverflow.com/questions/8273313/sample-random-rows-in-dataframe
https://www.barchart.com/stocks/indices/sp/sp500
https://www.nasdaq.com/market-activity/stocks/screener?exchange=NYSE
If you have questions or comments about this guide or method, please email data@Princeton.edu.