Research Guides: Loop in R: regress on multiple csv

Applying same R code to multiple files

Some times files are saved in different csv but they have identical structures within each file. We may be interested in applying same R code to multiple files.

The following code applies the same program to multiple *.csv files and produces one file containing all files by appending them. All files must have the variables with the same spelling and same order.

For the example, download these three *.csv files available in these links:

https://dss.princeton.edu/training/US2.csv

https://dss.princeton.edu/training/UK.csv

https://dss.princeton.edu/training/Mexico.csv

Make sure to save all files in the same folder.

Source of data is World Development Indicators from the World Bank.

# Set working directory to where the *.csv files are saved
# Go to Session -> Set Working directory -> Choose directory
# or type and run:
setwd("write path here")

Now we make a list of all *.csv files in the folder and make sure only the ones you need are in the working directory.

files <- list.files(path = ".", pattern = ".csv")
files
[1] "Mexico.csv" "UK.csv" "US2.csv"

Create a reference data frame with the structure of the final version of your data. Read one file first, then leave the first row as placeholder. If the final version is different from the original # individual files, then create an empty data frame with the expected final structure of the data.

mydata = read.csv(files[1], header = TRUE, stringsAsFactors = FALSE)
mydata = data.frame(mydata[1,]) # Keep first row as placeholder
mydata
  Year Country Countrycode    GDPpc Unempm Unempf Unempt     Exports     Imports
1 1997  Mexico         MEX 5289.168   3.12   6.32   4.24 1.21765e+11 1.22323e+11

# Process each datafile, then append all files into one .This loop uses smartbind() from library -gtools

library(gtools)
for (record in files) {
  # Mandatory line
  temp <- read.csv(record, header = TRUE, stringsAsFactors = FALSE)
  temp$trade =temp$Exports + temp$Imports # Your code from here
  mydata = smartbind(mydata, temp) # Mandatory last line
}

Make sure to change ‘temp’ with the last version in the last mandatory line, only if you created different versions. Also make sure to drop the first line used as placeholder.

mydata <- mydata[-1, ]
View(mydata)

Same regression across subgroups

Now we may look at another example of writing loop. We want to apply the same regression across different subgroups (like country).

First we download the sample datasets from:

http://dss.princeton.edu/training/loop_subgroup.csv

loop<-read.csv("http://dss.princeton.edu/training/loop_subgroup.csv")

Then we create a unique id per group. In this example per country.

library(plyr) #note
loop$id <- id(loop[c("country")], drop = TRUE)

Note: there is another widely used package called 'dplyr'. If you introduce this package into the R environment and then introduce the 'plyr' package, it may cause trouble. If that is the case you may want to restart R before "library(plyr)".

Then we run the same regression per group.

b <- lapply(1:max(unique(loop$id)), function(i) {
  reg <- with(subset(loop, loop$id==i), lm(unempt ~ acc))
  reg$coefficients
})

We now create a data frame consists of the betas.

betas <- as.data.frame(do.call(rbind,b))

Now we have a regression table. Note the variable "acc" means the estimate of voice and accountability of a country.

betas$id = rownames(betas)
head(betas)
(Intercept)         acc id
1   15.086145   -5.639099  1
2  -18.909522  -35.490677  2
3 -108.496407 -102.750321  3
4   11.872425   -3.519227  4
5   11.902054   -7.906690  5
6   -7.265655   13.325477  6

Source of data:

Worldwide Development indicators: https://databank.worldbank.org/source/world-development-indicators#

Worldwide Governance Indicators: https://databank.worldbank.org/reports.aspx?source=worldwide-governance-indicators#