This guide will focus on descriptive analysis in R and touch on some basic data cleaning prior to your analysis.
For the first step we will look at the basic step: importing dataset into R. We use different functions for different types of files. Note that these are coding examples of importing. Rstudio allows you to import dataset using the 'Import Dataset' located at the upright corner of R:
CSV (Comma-Separated Values) file is a widely used data format separating values with commas or other delimiters. It's one of the most common types of datasets.
We can use the built-in function 'read.csv()' in R to import csv files. Look at the example here:
mydata6080 = read.csv("http://www.princeton.edu/~otorres/mydata6080.csv", header=TRUE, # We want the headers to be column names stringsAsFactors = FALSE) # We want the strings to be treated as characters
The header function equals to true indicates that we want the headers to be column names.
We can state either TRUE or FALSE for the stringAsFactors function. If we want to treat the string variables as categorical variables for later analysis, like regressions, we can say TRUE here. If we are not sure about the application of string variables we can say FALSE here. Note that it's easy to change the data type later on if you want to.
XLSX (Excel) file is a popular spreadsheet format that is also widely used.
The example file we are going to use is available here: https://dss.princeton.edu/training/WB_Data.xlsx.
The first sheet('Annual Prices (Nominal)') looks like this in the spreadsheet format:
We can use the read_excel function from the 'readxl' package in R.
Note that this function IS NOT in base R so we need to install the package first and then library it.
# install.packages('readxl') library(readxl) mydata_excel <- read_excel("WB_Data.xlsx", # Note that set your working directory first sheet = 'Annual Prices (Nominal)', skip=6)
The sheet command allows us to specify which sheet we want to import into R as excel files can have multiple sheets.
The skip command indicates that we want to skip the first 6 rows. We can see from the previous photo that we don't need the information from the first 6 lines in the spreadsheet.
Now that's what the data looks like:
It serves the purpose but it does not seem clean enough.
We can further clean it.
names <- c('Year', 'Coffee_$/kg', 'Wheat_$/mt', 'Beef_$/kg', 'Sugar_$/kg', 'Gold_$/oz') mydata_excel <- read_excel("WB_Data.xlsx", # Note that set your working directory first sheet = 'Annual Prices (Nominal)', skip=9, col_names = names)
Now it looks like this:
Appears to be clean enough.
Small trick:
Sometimes it's easier to do the data cleaning in Excel first, and then import the specific sheet that you want into R.
We need to use the 'haven' package.
Note that this function IS NOT in base R so we need to install the package first and then library it.
#install.packages("haven") ##if this is a new package we need to install first.
library(haven) #We have to library the package every time we use
We are using the factor dataset.
mydata_stata <- read_dta('https://dss.princeton.edu/training/factor.dta')
Here is a brief preview:
Addressing missing values is crucial in data analysis as well.
Note that missing answers are noted using different methods in different datasets. For example, when marking missing values for age, in some cases negative values are used, in other cases high numbers(like 999), so we know that these answers are actually missing. There are no set rules for how missing values are treated. Therefore, it is important to read documentation and pay attention to the minimum and maximum values of your variables and make sure they make sense.
In R, missing values appear as NA which is an indicator for missingness. To see which values in each vector R recognizes as missing, we can use the is.na() function. It will return a TRUE/FALSE vector with as any elements as the vector we provide.
is.na(mydata_stata$ches_id)
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[24] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
[47] FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[70] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
[93] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[116] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
[139] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[162] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[185] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
[208] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[231] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
We can see that some of the logical values are true, indicating that there are missing values in this column.
Now look at another column:
is.na(mydata_stata$country_id) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [24] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [47] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [70] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [93] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [116] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [139] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [162] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [185] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [208] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE [231] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
It appears that there is no missing value in this column.
Sometimes when the dataset is large, it would be really hard for us to tell is there any single 'TRUE' in the vector. Under this circumstance. we can use the colSums() function to see the missing value count of a whole dataset.
colSums(is.na(mydata_stata))
country_id party party_id cee ches_id family
0 0 0 0 34 0
cmp_id parlgov_id manichean indivisble generalwill peoplecentrism
39 10 3 3 11 3
antielitism complex emotional lroverall lrecon immigration
3 9 9 1 6 5
eu nativism laworder lifestyle intradem personalised
3 5 12 2 21 12
nr_experts mean_expert_response
0 0
We can see, for example. column 'country_id' and column 'party' does not have any missing values, and column 'ches_id' and 'cmp_id' seem to have some missing values.
Now let's see if we want to calculate the mean value of some columns, using the mean() function from base R. Let's use the 'complex' as an example.
mean(mydata_stata$complex) [1] NA
It returns NA. The reason here is that we observed before that this column contains missing values. We can solve this problem by adding a command: na.rm=TRUE.
> mean(mydata_stata$complex, na.rm=TRUE)
[1] 4.827535
In this case, the command calculates the average value of a column WITHOUT the missing values.
We use the following sample dataset here:
library(foreign)
Panel <- read.dta("http://dss.princeton.edu/training/Panel101.dta")
The data overview:
We can use the summary() function to look at the summary statistics:
summary(Panel)
country year y y_bin x1 x2 x3 opinion op
A:10 Min. :1990 Min. :-7.863e+09 Min. :0.0 Min. :-0.5676 Min. :-1.6218 Min. :-1.16539 Str agree:20 Min. :0.0
B:10 1st Qu.:1992 1st Qu.: 2.466e+08 1st Qu.:1.0 1st Qu.: 0.3290 1st Qu.:-1.2156 1st Qu.:-0.07931 Agree :15 1st Qu.:0.0
C:10 Median :1994 Median : 1.898e+09 Median :1.0 Median : 0.6413 Median :-0.4621 Median : 0.51419 Disag :19 Median :0.5
D:10 Mean :1994 Mean : 1.845e+09 Mean :0.8 Mean : 0.6480 Mean : 0.1339 Mean : 0.76185 Str disag:16 Mean :0.5
E:10 3rd Qu.:1997 3rd Qu.: 3.372e+09 3rd Qu.:1.0 3rd Qu.: 1.0958 3rd Qu.: 1.6078 3rd Qu.: 1.15486 3rd Qu.:1.0
F:10 Max. :1999 Max. : 8.941e+09 Max. :1.0 Max. : 1.4464 Max. : 2.5303 Max. : 7.16892 Max. :1.0
G:10
We can see that for numeric variables, the 'summary' command tells you the mean, median, min and max values and the Quartiles.
For string variables or categorical variables, it tells you the frequency of each category.
We can also use the glimpse() function from the 'dplyr' package to look at an overview of the dataset:
library(dplyr) glimpse(Panel) Rows: 70 Columns: 9 $ country <fct> A, A, A, A, A, A, A, A, A, A, B, B, B, B, B, B, B, B, B, B, C, C, C, C, C, C, C, C, C, C, D, D, D, D, D, D, D, D, D, D, E, E, … $ year <int> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 1990, … $ y <dbl> 1342787840, -1899660544, -11234363, 2645775360, 3008334848, 3229574144, 2756754176, 2771810560, 3397338880, 39770336, -5934699… $ y_bin <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, … $ x1 <dbl> 0.27790365, 0.32068470, 0.36346573, 0.24614404, 0.42462304, 0.47721413, 0.49980500, 0.05162839, 0.36641079, 0.39584252, -0.081… $ x2 <dbl> -1.1079559, -0.9487200, -0.7894840, -0.8855330, -0.7297683, -0.7232460, -0.7815716, -0.7048455, -0.6983712, -0.6431540, 1.4251… $ x3 <dbl> 0.28255358, 0.49253848, 0.70252335, -0.09439092, 0.94613063, 1.02968037, 1.09228814, 1.41590083, 1.54872274, 1.79419804, 0.023… $ opinion <fct> Str agree, Disag, Disag, Disag, Disag, Str agree, Disag, Str agree, Disag, Str disag, Agree, Str agree, Agree, Str disag, Disa… $ op <dbl> 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, …
We can see the datatype of each column.
Here the 'dbl' means it's numerical. We can also see that the country and opinion columns are treated as factors already.
Frequency tables are another common practice examining the data we have.
We can use table() to display the frequencies of the factor levels of categorical variables. Using the opinion as an example.
table(Panel$opinion)
Str agree Agree Disag Str disag
20 15 19 16
We can also combine two factor variables to create Crosstabs.
table(Panel$opinion, Panel$country)
A B C D E F G
Str agree 3 2 2 4 2 5 2
Agree 0 3 4 1 3 3 1
Disag 6 2 2 4 1 1 3
Str disag 1 3 2 1 4 1 4
Let's create a frequency plot using ggplot2() function:
library(ggplot2) library(tidyverse) p <- Panel %>% select(opinion) %>% ggplot(aes(x = opinion)) p + geom_bar(fill = "darkblue",na.rm = TRUE) + xlab("Opinions") + ylab("Frequency")
The frequency plot of opinions look like this:
Also, it's possible to plot two factor variables together.
p1 <- Panel %>% select(opinion, country) %>% ggplot(aes(x = opinion, fill = country)) p1 + geom_bar(position = "dodge",na.rm = TRUE) + xlab("Opinions") + ylab("Frequency") + theme_classic()
The plot:
Sometimes we want to see the summary statistics by group. This can be done using 'dplyr' package in R.
Grouping
Note that this package uses '%>%' instead of the '<-' operator.
library(dplyr) Panel %>% group_by(country)%>% summarise(Mean=mean(x1, na.rm=T)) # A tibble: 7 × 2 country Mean <fct> <dbl> 1 A 0.342 2 B 0.514 3 C 1.24 4 D 0.193 5 E 0.536 6 F 0.626 7 G 1.09
In this example, we calculated the average value of x1 throughout all these years for each country.
Note that the result is a tibble, instead of a dataframe. If we want to put it into a dataframe for later, we can assign a name at the beginning.
country_x1<-Panel %>% group_by(country)%>% summarise(Mean=mean(x1, na.rm=T))
Then it appears in the R environment.
This is a clean dataframe that can be used later. However, it does not have any other information. If we want to add the country mean back to the original dataset, we can use mutate function. Mutate is a function to create new columns or modify existing ones in a data frame based on specified transformations.
Panel_withcountryx1<-Panel %>% group_by(country)%>% mutate(Mean=mean(x1, na.rm=T))
Now we can see that the mean is at the last column.
Filtering
Another powerful function in the package is filter. For example, if we are only interested in the observations where their opinion is 'Str agree', we can do so by:
filtered_mydata<- Panel %>%
filter(opinion == 'Str agree')
Now we have only the observations that we want.
Together
We can also use the filter funcion and group by function together. One reason the package is powerful is that we can combine multiple functions together. For example, we are interested in the average x1 value of each country between 1993 and 1996
filtered_and_grouped_Panel<-Panel%>% filter(year>=1993 & year <= 1996) %>% group_by(country) %>% summarise(mean_value=mean(x1)) # Similarly, if we want to add the information back mydata_withall<-Panel %>% group_by(country)%>% filter(year>=1993 & year <= 1996) %>% mutate(Mean_export=mean(x1, na.rm=T))
Now the dataset looks like:
If you have questions or comments about this guide or method, please email data@Princeton.edu.