Data Visualization in R: Introduction

an overview of data visualization in R

Introduction

This guide is designed to introduce fundamental techniques for creating effective visualizations using R, a critical skill in presenting data analysis findings clearly and succinctly. Data visualization serves as an indispensable tool in data exploration, inference making, and results presentation. It transforms complex data sets into intuitive graphical representations, facilitating a deeper understanding of the data and enabling the communication of insights in a universally comprehensible manner.

In this guide, we will both look at data visualization in base R and using packages(like 'ggplot2'). There are many useful packages in R for different complicated usage, you can find them here: https://r-graph-gallery.com/

1.1 Line charts

Line graphs display information as a series of data points connected by straight-line segments. They are primarily used for visualizing data trends over intervals or time.

Line graphs are good at showing changes and trends over time, making it easy to compare multiple series simultaneously. They are intuitive to understand and can effectively communicate the direction and pace of trends. However, they can become confusing when overloaded with too many data series or when the data points are too densely packed.

We will look at both the Base R plots and ggplot2 plots.‘ggplot2' is a powerful visualization package in R enabling users to create a wide variety of charts, enhancing data exploration and presentation.

We will be using the R's built-in dataset 'mtcars' for this purpose, which is a data frame with fuel consumption and 10 aspects of automobile design and performance for 32 automobiles.

First, let's see how to plot a line graph in base R.

```# Simulating a trend line for MPG values in their given order
plot(mtcars\$mpg, type = "o", col = "blue",
xlab = "Car Index", ylab = "Miles Per Gallon",
main = "Simulated Trend of MPG Over 'Time'")
```

Now, let's look at how to plot this graph using ggplot2.

Note that if the package is not downloaded, you have to install it first.

```install.packages("ggplot2") # Install ggplot2 package, this is a one-time thing for your own pc
library(ggplot2) # Load the ggplot2 package, you have to do this everytime
```

Now see the code:

```mtcars\$Index <- 1:nrow(mtcars) ## Create an Index

ggplot(mtcars, aes(x = Index, y = mpg)) +
geom_line(color = "blue") +
geom_point(color = "red") +
theme_minimal() +
labs(title = "Simulated Trend of MPG Over 'Time'",
x = "Car Index",
y = "Miles Per Gallon")
```

Aside from plotting the trend, we can also use line graph to see the relationship between two variables.

This is the example from base R:

```# Sorting mtcars by wt
mtcars_sorted <- mtcars[order(mtcars\$wt),]

# Line graph showing MPG vs. Weight
plot(mtcars_sorted\$wt, mtcars_sorted\$mpg, type = "o", col = "red",
xlab = "Weight (1000 lbs)", ylab = "Miles Per Gallon",
main = "MPG vs. Car Weight")
```

This is the example from ggplot2:

```ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_line(aes(group = 1), color = "red") +  # Ensure a single group for the line
geom_point(color = "blue") +
theme_minimal() +
labs(title = "MPG vs. Car Weight",
x = "Weight (1000 lbs)",
y = "Miles Per Gallon")
```

1.2 Scatter Plot

Scatter plots display values for two variables for a set of data so that we can get an idea of the trend or correlation.

Scatter plots can be useful for identifying correlations, trends, and outliers in data. They visually demonstrate the strength and direction of a relationship between two variables, making them indispensable for exploratory data analysis and regression analysis. Note that while scatter plots are excellent for bivariate analysis, they can become cluttered and less effective with large datasets. Also, scatter plots do not necessarily imply causation.

The base R example graph:

```#base R
# Scatter plot of MPG vs Weight
plot(mtcars\$wt, mtcars\$mpg,
xlab = "Weight (1000 lbs)", ylab = "Miles Per Gallon",
main = "MPG vs. Car Weight",
pch = 19, col = "blue")
```

The ggplot example:

```ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
theme_minimal() +
labs(title = "Scatter plot of MPG vs Car Weight",
x = "Weight (1000 lbs)",
y = "Miles per Gallon")
```

1.3 Regression Plots

We can plot the regression line, indicating the linear relationship between two variables.

```# Basic regression line without confidence interval
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
theme_minimal() +
labs(title = "Regression Line of MPG vs Car Weight",
subtitle = "Without confidence interval",
x = "Weight (1000 lbs)",
y = "Miles per Gallon")
```

We can also plot the Confidence Interval as a shaded area.

```# Regression line with confidence interval
ggplot(mtcars, aes(x = wt, y = mpg)) +
geom_point() +
geom_smooth(method = "lm", se = TRUE) +
theme_minimal() +
labs(title = "Regression Line of MPG vs Car Weight",
subtitle = "With confidence interval",
x = "Weight (1000 lbs)",
y = "Miles per Gallon")
```

2.1 Histogram

A histogram is a type of graph used in statistics to represent the distribution of numerical data by showing the number of data points that fall within a range of values, known as bins. Note that It is similar to a bar chart but differs in that it groups ranges of data into bins and displays the frequency of data points within each bin, making it ideal for showing the shape and spread of continuous data. We often plot the histogram of one continuous variable to have a rough idea about the distribution(like, whether normal or not).

Base R example:

```# Histogram of MPG
hist(mtcars\$mpg, breaks = 10, col = "skyblue",
xlab = "Miles Per Gallon", main = "Histogram of MPG")
```

ggplot example:

```ggplot(mtcars, aes(x = mpg)) +
geom_histogram(binwidth = 5, fill = "yellow", color = "black") +
theme_minimal() +
labs(title = "Histogram of MPG",
x = "Miles per Gallon",
y = "Frequency")
```

2.2 Box plot

Box plots (or box-and-whisker plots) summarize data using a five-number summary: minimum, first quartile (Q1), median(not mean), third quartile (Q3), and maximum. They also highlight outliers. Box plots are highly efficient in depicting the distribution of data, providing insights into the central tendency, variability, and skewness. They are particularly useful for comparing distributions across different categories. Note that they do not convey the exact distribution shape as precisely as density plots or histograms and can be misleading if misinterpreted by those unfamiliar with their construction.

Base R example:

```# Box plot for MPG across different numbers of cylinders
# Convert cyl to a factor if it's not already
mtcars\$cyl <- as.factor(mtcars\$cyl)

# Box plot of MPG by number of cylinders
boxplot(mpg ~ cyl, data = mtcars,
xlab = "Number of Cylinders", ylab = "Miles Per Gallon",
main = "MPG by Number of Cylinders", col = rainbow(length(unique(mtcars\$cyl))))
```

ggplot example:

```ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Box Plot of MPG by Cylinder Count",
x = "Number of Cylinders",
y = "Miles per Gallon")
```

2.3 Violin Plots

Violin plots combine the summary statistics of box plots with the density distributions of kernel density plots, showing data distribution shapes around the median. These plots are excellent for comparing distributions between groups, providing a deep insight into the data's spread, skewness, and multimodality. They can highlight differences in distribution shapes and central tendencies across categories.

Note that for small sample sizes, violin plots can be misleading, as they rely on kernel density estimation. Their interpretation is also not as straightforward as box plots for non-technical audiences.

Compare with Box plots:

While both box plots and violin plots are used to visualize and summarize statistical distributions, violin plots provide a more detailed picture of the data's structure, including its density. Box plots, on the other hand, offer a more straightforward summary focused on the range, quartiles, and outliers. The choice between them depends on the specific needs of the analysis: whether a quick summary is sufficient or a detailed distribution view is necessary.

```## Violin plots
ggplot(mtcars, aes(factor(cyl), mpg)) +
geom_violin(fill = "skyblue") +
geom_boxplot(width = 0.1, fill = "white") +
labs(title = "MPG Distribution by Cylinder Count", x = "Number of Cylinders", y = "Miles Per Gallon")
```

2.4 Heat Map

Heatmaps display data in a matrix as colors and are particularly useful for visualizing the magnitude of phenomena across two dimensions. They can represent various data types, including correlations, missing data patterns, or any matrix data。 They can also reveal patterns or gradients within large datasets, making them ideal for spotting trends, clusters, and outliers. They can efficiently summarize complex information in an intuitive format, allowing for quick comparisons across categories and variables.

However, when datasets are very large, heatmaps can become cluttered or lose detail. The choice of color scale is critical and can significantly impact interpretation.

```library(reshape) # For melting data frames

# Example data: Volcano dataset (a matrix of topographic information)
volcano_melted <- melt(volcano)

ggplot(volcano_melted, aes(X1, X2, fill = value)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
labs(title = "Volcano Topography", x = "X coordinate", y = "Y coordinate", fill = "Height")
```

2.5 Time-Series

Time series plots display data points over time, allowing for the visualization of trends, cycles, and seasonal variations. These plots are crucial for analyzing temporal data, highlighting trends, detecting outliers, and identifying seasonal effects. They are instrumental in forecasting and understanding historical data behavior.

We will generate simulation data in these examples.

Note that for time-series plots, we need to make sure the structure of the x-axis is datetime.

```# Generating example time series data
time_series_data <- data.frame(
Date = seq(as.Date("2020-01-01"), by = "month", length.out = 24),
Value = cumsum(runif(24, min = -5, max = 5))
)

ggplot(time_series_data, aes(Date, Value)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "red", size = 2) +
labs(title = "Example Time Series Data", x = "Date", y = "Value")
```

We can add vertical lines in time-series plots to indicate important event dates so that we can see the trend before and after the event.

```# Generate example time series data
set.seed(123) # For reproducibility
dates <- seq(as.Date("2020-01-01"), by = "month", length.out = 24)
values <- cumsum(runif(24, min = -10, max = 10)) # Simulated values

time_series_data <- data.frame(Date = dates, Value = values)

# Plotting with a vertical line to indicate a specific event date
ggplot(time_series_data, aes(Date, Value)) +
geom_line(color = "blue", size = 1) + # Draw the line for the time series
geom_point(color = "red", size = 2) + # Highlight each data point
geom_vline(xintercept = as.numeric(as.Date("2021-01-01")), linetype = "dashed", color = "red", size = 1) + # Add vertical line for the event
labs(title = "Example Time Series Data with Event Indicator", x = "Date", y = "Value") +
theme_minimal()
```

3.1 Faceted Plots

Faceted plots, or small multiples, divide data into subsets based on a categorical variable and plot each subset in its own panel. They use a consistent scale to facilitate direct comparison. Faceting is powerful for comparing patterns across different levels of a categorical variable and for identifying how relationships between variables differ across subsets. It simplifies complex data comparisons across groups without overcrowding a single plot. We have to be careful when selecting how many facets though. With too many facets or categories, the plots can become small and hard to read. It also requires careful consideration of layout to ensure clarity and effectiveness in communication.

```ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
facet_wrap(~cyl) +
labs(title = "MPG vs. Weight by Cylinder Count", x = "Weight (1000 lbs)", y = "Miles Per Gallon")
```

3.2 Plot different Groups

We can plot different groups together in one plot. We can use different colors to indicate different groups.

```# Convert cyl to a factor for color grouping
mtcars\$cyl <- as.factor(mtcars\$cyl)

# Plotting with ggplot2 by group (cyl) using color
ggplot(mtcars, aes(x = wt, y = mpg, color = cyl)) +
geom_point(size = 3) +
theme_minimal() +
labs(title = "MPG vs. Car Weight by Number of Cylinders",
x = "Weight (1000 lbs)",
y = "Miles per Gallon",
color = "Number of Cylinders") +
scale_color_manual(values = c("4" = "blue", "6" = "green", "8" = "red")) # Customize colors
```

We can also plot based on more than group: this is an example of bar plots based on different 'cyl' and 'gear' groups.

```# First, summarize the data
mtcars_summary <- mtcars %>%
group_by(cyl, gear) %>%
summarise(avg_mpg = mean(mpg), .groups = 'drop') %>%
mutate(cyl = as.factor(cyl), gear = as.factor(gear)) # Ensure cyl and gear are treated as factors

# Now, create the bar plot
ggplot(mtcars_summary, aes(x = cyl, y = avg_mpg, fill = gear)) +
geom_bar(stat = "identity", position = "dodge") +
theme_minimal() +
labs(title = "Average MPG by Cylinder and Gear",
x = "Number of Cylinders",
y = "Average Miles Per Gallon",
fill = "Number of Gears") +
scale_fill_brewer(palette = "Pastel1") # Use a color palette for better visual distinction

```

3.3 Error Bars

In this section, we give an example of bar plots of the average value of mpg in different groups, with error bars. The use of error bars can indicate the variability or uncertainty in the data.

```# First, calculate means and standard deviations
library(dplyr)
mpg_summary <- mtcars %>%
group_by(cyl) %>%
summarise(mean_mpg = mean(mpg), sd_mpg = sd(mpg), se_mpg = sd_mpg / sqrt(n()))

# Bar plot for mean MPG with error bars (using standard error)
ggplot(mpg_summary, aes(x = factor(cyl), y = mean_mpg)) +
geom_bar(stat = "identity", fill = "skyblue") +
geom_errorbar(aes(ymin = mean_mpg - se_mpg, ymax = mean_mpg + se_mpg), width = .2) +
theme_minimal() +
labs(title = "Mean MPG by Cylinder Count with Error Bars",
x = "Number of Cylinders",
y = "Mean Miles per Gallon")
```

Data Consultant

Yufei Qin
Contact:
Firestone Library, A.12F.2
6092582519