Skip to Main Content

Descriptive Statistics in R: Introduction to descriptive statistics

Introduction to descriptive statistics

1. Introduction

This guide will focus on descriptive analysis in R and touch on some basic data cleaning prior to your analysis.

2. Importing datasets into R

For the first step we will look at the basic step: importing dataset into R. We use different functions for different types of files. Note that these are coding examples of importing. Rstudio allows you to import dataset using the 'Import Dataset' located at the upright corner of R:

2.1 Importing CSV files

CSV (Comma-Separated Values) file is a widely used data format separating values with commas or other delimiters. It's one of the most common types of datasets. 

We can use the built-in function 'read.csv()' in R to import csv files. Look at the example here:

mydata6080 = read.csv("http://www.princeton.edu/~otorres/mydata6080.csv",
                      header=TRUE, # We want the headers to be column names
                      stringsAsFactors = FALSE) # We want the strings to be treated as characters

The header function equals to true indicates that we want the headers to be column names.

We can state either TRUE or FALSE for the stringAsFactors function. If we want to treat the string variables as categorical variables for later analysis, like regressions, we can say TRUE here. If we are not sure about the application of string variables we can say FALSE here. Note that it's easy to change the data type later on if you want to.

2.2 Importing Excel files

XLSX (Excel) file is a popular spreadsheet format that is also widely used.

The example file we are going to use is available here: https://dss.princeton.edu/training/WB_Data.xlsx

The first sheet('Annual Prices (Nominal)') looks like this in the spreadsheet format:

We can use the read_excel function from the 'readxl' package in R.

Note that this function IS NOT in base R so we need to install the package first and then library it.

# install.packages('readxl')
library(readxl)
mydata_excel <- read_excel("WB_Data.xlsx", # Note that set your working directory first
                      sheet = 'Annual Prices (Nominal)',
                      skip=6)

The sheet command allows us to specify which sheet we want to import into R as excel files can have multiple sheets.

The skip command indicates that we want to skip the first 6 rows. We can see from the previous photo that we don't need the information from the first 6 lines in the spreadsheet.

Now that's what the data looks like:

It serves the purpose but it does not seem clean enough.

We can further clean it.

names <- c('Year', 'Coffee_$/kg', 'Wheat_$/mt', 'Beef_$/kg', 'Sugar_$/kg', 'Gold_$/oz')
mydata_excel <- read_excel("WB_Data.xlsx", # Note that set your working directory first
                      sheet = 'Annual Prices (Nominal)',
                      skip=9,
                      col_names = names)

Now it looks like this:

Appears to be clean enough.

Small trick:

Sometimes it's easier to do the data cleaning in Excel first, and then import the specific sheet that you want into R.

2.3 Importing Stata files

We need to use the 'haven' package.

Note that this function IS NOT in base R so we need to install the package first and then library it.

#install.packages("haven") ##if this is a new package we need to install first. 
library(haven) #We have to library the package every time we use

We are using the factor dataset.

mydata_stata <- read_dta('https://dss.princeton.edu/training/factor.dta')

Here is a brief preview:

 

3. Missing value

Addressing missing values is crucial in data analysis as well.

Note that missing answers are noted using different methods in different datasets. For example, when marking missing values for age, in some cases negative values are used, in other cases high numbers(like 999), so we know that these answers are actually missing. There are no set rules for how missing values are treated. Therefore, it is important to read documentation and pay attention to the minimum and maximum values of your variables and make sure they make sense.

In R, missing values appear as NA which is an indicator for missingness. To see which values in each vector R recognizes as missing, we can use the is.na() function. It will return a TRUE/FALSE vector with as any elements as the vector we provide. 

is.na(mydata_stata$ches_id)
  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [24]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE
 [47] FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 [70] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
 [93] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[116] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
[139] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[162] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[185] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
[208] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[231] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

We can see that some of the logical values are true, indicating that there are missing values in this column.

Now look at another column:

is.na(mydata_stata$country_id)
  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [24] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [47] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [70] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [93] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[116] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[139] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[162] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[185] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[208] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[231] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

It appears that there is no missing value in this column.

Sometimes when the dataset is large, it would be really hard for us to tell is there any single 'TRUE' in the vector. Under this circumstance. we can use the colSums() function to see the missing value count of a whole dataset.

colSums(is.na(mydata_stata))
          country_id                party             party_id                  cee              ches_id               family 
                   0                    0                    0                    0                   34                    0 
              cmp_id           parlgov_id            manichean           indivisble          generalwill       peoplecentrism 
                  39                   10                    3                    3                   11                    3 
         antielitism              complex            emotional            lroverall               lrecon          immigration 
                   3                    9                    9                    1                    6                    5 
                  eu             nativism             laworder            lifestyle             intradem         personalised 
                   3                    5                   12                    2                   21                   12 
          nr_experts mean_expert_response 
                   0                    0 

We can see, for example. column 'country_id' and column 'party' does not have any missing values, and column 'ches_id' and 'cmp_id' seem to have some missing values.

Now let's see if we want to calculate the mean value of some columns, using the mean() function from base R. Let's use the 'complex' as an example.

mean(mydata_stata$complex) 
[1] NA

It returns NA. The reason here is that we observed before that this column contains missing values. We can solve this problem by adding a command: na.rm=TRUE. 

> mean(mydata_stata$complex, na.rm=TRUE)
[1] 4.827535

In this case, the command calculates the average value of a column WITHOUT the missing values.

4. Descriptive statistics

We use the following sample dataset here:

library(foreign)
Panel <- read.dta("http://dss.princeton.edu/training/Panel101.dta")

The data overview:

We can use the summary() function to look at the summary statistics:

summary(Panel)
 country      year            y                  y_bin           x1                x2                x3                opinion         op     
 A:10    Min.   :1990   Min.   :-7.863e+09   Min.   :0.0   Min.   :-0.5676   Min.   :-1.6218   Min.   :-1.16539   Str agree:20   Min.   :0.0  
 B:10    1st Qu.:1992   1st Qu.: 2.466e+08   1st Qu.:1.0   1st Qu.: 0.3290   1st Qu.:-1.2156   1st Qu.:-0.07931   Agree    :15   1st Qu.:0.0  
 C:10    Median :1994   Median : 1.898e+09   Median :1.0   Median : 0.6413   Median :-0.4621   Median : 0.51419   Disag    :19   Median :0.5  
 D:10    Mean   :1994   Mean   : 1.845e+09   Mean   :0.8   Mean   : 0.6480   Mean   : 0.1339   Mean   : 0.76185   Str disag:16   Mean   :0.5  
 E:10    3rd Qu.:1997   3rd Qu.: 3.372e+09   3rd Qu.:1.0   3rd Qu.: 1.0958   3rd Qu.: 1.6078   3rd Qu.: 1.15486                  3rd Qu.:1.0  
 F:10    Max.   :1999   Max.   : 8.941e+09   Max.   :1.0   Max.   : 1.4464   Max.   : 2.5303   Max.   : 7.16892                  Max.   :1.0  
 G:10         

We can see that for numeric variables, the 'summary' command tells you the mean, median, min and max values and the Quartiles.

For string variables or categorical variables, it tells you the frequency of each category.

We can also use the glimpse() function from the 'dplyr' package to look at an overview of the dataset:

library(dplyr)
glimpse(Panel)
Rows: 70
Columns: 9
$ country <fct> A, A, A, A, A, A, A, A, A, A, B, B, B, B, B, B, B, B, B, B, C, C, C, C, C, C, C, C, C, C, D, D, D, D, D, D, D, D, D, D, E, E, …
$ year    <int> 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 1990, …
$ y       <dbl> 1342787840, -1899660544, -11234363, 2645775360, 3008334848, 3229574144, 2756754176, 2771810560, 3397338880, 39770336, -5934699…
$ y_bin   <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, …
$ x1      <dbl> 0.27790365, 0.32068470, 0.36346573, 0.24614404, 0.42462304, 0.47721413, 0.49980500, 0.05162839, 0.36641079, 0.39584252, -0.081…
$ x2      <dbl> -1.1079559, -0.9487200, -0.7894840, -0.8855330, -0.7297683, -0.7232460, -0.7815716, -0.7048455, -0.6983712, -0.6431540, 1.4251…
$ x3      <dbl> 0.28255358, 0.49253848, 0.70252335, -0.09439092, 0.94613063, 1.02968037, 1.09228814, 1.41590083, 1.54872274, 1.79419804, 0.023…
$ opinion <fct> Str agree, Disag, Disag, Disag, Disag, Str agree, Disag, Str agree, Disag, Str disag, Agree, Str agree, Agree, Str disag, Disa…
$ op      <dbl> 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, …

We can see the datatype of each column.

Here the 'dbl' means it's numerical. We can also see that the country and opinion columns are treated as factors already.

5. Frequency and plots

Frequency tables are another common practice examining the data we have.

We can use table() to display the frequencies of the factor levels of categorical variables. Using the opinion as an example.

table(Panel$opinion)

Str agree     Agree     Disag Str disag 
       20        15        19        16 

We can also combine two factor variables to create Crosstabs.

 table(Panel$opinion, Panel$country)         
            A B C D E F G
  Str agree 3 2 2 4 2 5 2
  Agree     0 3 4 1 3 3 1
  Disag     6 2 2 4 1 1 3
  Str disag 1 3 2 1 4 1 4

Let's create a frequency plot using ggplot2() function:

library(ggplot2)
library(tidyverse)
p <- Panel %>% 
  select(opinion) %>% 
  ggplot(aes(x = opinion))
p + geom_bar(fill = "darkblue",na.rm = TRUE) +
  xlab("Opinions") +
  ylab("Frequency")

The frequency plot of opinions look like this:

Also, it's possible to plot two factor variables together.

p1 <- Panel %>% 
  select(opinion, country) %>% 
  ggplot(aes(x = opinion, fill = country))
p1 + geom_bar(position = "dodge",na.rm = TRUE) +
  xlab("Opinions") +
  ylab("Frequency") +
  theme_classic() 

The plot:

6. Grouping and filtering using dplyr package

Sometimes we want to see the summary statistics by group. This can be done using 'dplyr' package in R.

Grouping

Note that this package uses '%>%' instead of the '<-' operator.

library(dplyr)
Panel %>%
  group_by(country)%>%
  summarise(Mean=mean(x1, na.rm=T))
# A tibble: 7 × 2
  country  Mean
  <fct>   <dbl>
1 A       0.342
2 B       0.514
3 C       1.24 
4 D       0.193
5 E       0.536
6 F       0.626
7 G       1.09 

In this example, we calculated the average value of x1 throughout all these years for each country.

Note that the result is a tibble, instead of a dataframe. If we want to put it into a dataframe for later, we can assign a name at the beginning.

country_x1<-Panel %>%
  group_by(country)%>%
  summarise(Mean=mean(x1, na.rm=T))

Then it appears in the R environment.

This is a clean dataframe that can be used later. However, it does not have any other information. If we want to add the country mean back to the original dataset, we can use mutate function. Mutate is a function to create new columns or modify existing ones in a data frame based on specified transformations.

Panel_withcountryx1<-Panel %>%
  group_by(country)%>%
  mutate(Mean=mean(x1, na.rm=T))

Now we can see that the mean is at the last column.

Filtering

Another powerful function in the package is filter. For example, if we are only interested in the observations where their opinion is 'Str agree', we can do so by:

filtered_mydata<- Panel %>%
  filter(opinion == 'Str agree')

Now we have only the observations that we want.

Together

We can also use the filter funcion and group by function together. One reason the package is powerful is that we can combine multiple functions together. For example, we are interested in the average x1 value of each country between 1993 and 1996

filtered_and_grouped_Panel<-Panel%>%
  filter(year>=1993 & year <= 1996) %>%
  group_by(country) %>%
  summarise(mean_value=mean(x1))

# Similarly, if we want to add the information back
mydata_withall<-Panel %>%
  group_by(country)%>%
  filter(year>=1993 & year <= 1996) %>%
  mutate(Mean_export=mean(x1, na.rm=T))

Now the dataset looks like:

Reference list / Useful links

Comments or Questions?

If you have questions or comments about this guide or method, please email data@Princeton.edu.

Data Consultant

Profile Photo
Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519

Data Consultant

Profile Photo
Muhammad Al Amin
He/Him/His
Contact:
Firestone Library, A-12-F.1
609-258-6051