Data viz basics with R and ggplot2

Introduction

How to visualize your data with the R package ggplot2
Originally developed by Hadley Wickham, now developed and maintained by a team of experts
Good for everything from quick exploratory plots to carefully formatted publication-quality graphics
This lesson assumes beginner-level knowledge of R (functions, data frames)

How to follow the course

Slides and text version of lessons are online
Fill in code in the worksheet (replace ... with code)
You can always copy and paste code from text version of lesson if you fall behind

Learning objectives

At the end of this course, you will know …

what the “grammar of graphics” is and how ggplot2 uses it
how to map a variable in a data frame to a graphical element in a plot using aes
how to use different geoms to make scatterplots, boxplots, histograms, density plots, and barplots
how to add trendlines to a plot
how to compute summary statistics and plot them with stat functions
how to make plots for different subsets of your data using facets
how to change the style and appearance of your plots

What is the grammar of graphics?

Theory underlying ggplot2 is the “grammar of graphics” (Leland Wilkinson)
Formal way of mapping variables in a dataset to graphical elements of a plot
For example, a dataset has age, weight, and sex of many individuals. Make a scatterplot:
- age variable in the data maps to the x axis of the dataset
- weight variable maps to the y axis
- sex variable maps to the color of the points

How does a ggplot work?

Image excerpted from ggplot2 cheatsheet, describing grammar of graphics

Template for making a ggplot

Image excerpted from ggplot2 cheatsheet, with function template

The data

Three datasets from Kaggle

World Happiness Report 2015
Nutritional values of 77 brands of cereal
Summer Olympics medals awarded in track and field (athletics) from 1896-2014. This is a subset of the original dataset

Loading packages and data

We will use only the ggplot2 package in this tutorial
Use read.csv() to read in each of the three datasets from the web (or from datasets directory if you are on Posit Cloud server)

library(ggplot2)

WHR <- read.csv('https://usda-ree-ars.github.io/SEAStats/data_viz_basics/datasets/WHR_2015.csv')
cereal <- read.csv('https://usda-ree-ars.github.io/SEAStats/data_viz_basics/datasets/cereal.csv')
olympics <- read.csv('https://usda-ree-ars.github.io/SEAStats/data_viz_basics/datasets/olympics.csv')

You can use head(), summary(), or str() to examine each dataset

Our first ggplot

Scatterplot: does money buy happiness?
WHR dataset has a row for each country
Plot GDP per capita on the x axis and happiness score on the y axis
Start by calling the ggplot() function with data argument

ggplot(data = WHR)

Add aesthetic mapping

Specify which columns of the data will map to the x and y elements of the plot

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score))

We see the two axes and coordinate system
But no data appears

Add a geom layer

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + geom_point()

Notice we use + to add each new piece of the plotting code.

Modifying the plot: changing the geom

Changing the geom plots the same data in a different way
Try replacing geom_point() with geom_line()

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + geom_line()

Doesn’t make a lot of sense in this case but it is great for time series data

We can add multiple geoms if we want
Plot a smoothing trendline (geom_smooth()) overlaid on the scatterplot

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + 
  geom_point() +
  geom_smooth()

The geoms are drawn in the order they are added to the plot

Putting each piece on its own line makes the code easier to read!

Linear trendline

By default, geom_smooth() plots a locally weighted regression with standard error as a shaded area
Make trend linear by specifying method = lm
Get rid of the standard error shading by specifying se = FALSE

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + 
  geom_point() +
  geom_smooth(method = lm, se = FALSE)

Modifying the plot: changing the aes

Adding to or changing the aes argument modifies what data are used to plot
Let’s add a color aesthetic to the point plot to color each country’s point by continent

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score, color = Continent)) + 
  geom_point() +
  geom_smooth(method = lm, se = FALSE)

We automatically get a legend. But this also automatically groups the trendline by region as well

If we have multiple geoms, we can add an aes argument to a single geom
For example if we want the points to be colored by continent but a single overall trendline, add aes(color = Continent) inside geom_point()

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + 
  geom_point(aes(color = Continent)) +
  geom_smooth(method = lm, se = FALSE)

If we add arguments to the geoms outside aes(), it will modify their appearance without mapping back to the data
For example bigger points (size = 2), and black trendline (color = 'black')

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + 
  geom_point(aes(color = Continent), size = 2) +
  geom_smooth(method = lm, se = FALSE, color = 'black')

The default for geom_point is size = 1

Modifying the plot: changing the theme

Themes change the overall look of the plot

happy_gdp_plot <- ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + 
  geom_point(aes(color = Continent), size = 2) +
  geom_smooth(method = lm, se = FALSE, color = 'black')

happy_gdp_plot + theme_bw()

theme_bw() changes the default theme to a black-and-white theme.
Note: we can assign the result of ggplot() to an object, in this case happy_gdp_plot
We can add things to an existing ggplot object and print it

You can set a global theme for all plots created in your current R session by using theme_set()
Do this now so we don’t have to look at the ugly gray default theme!

theme_set(theme_bw())

Use theme() to add specific theme arguments to a plot
Here we move the legend to the bottom, remove the gridlines, and change appearance of different text elements

happy_gdp_plot <- happy_gdp_plot +
  theme(panel.grid = element_blank(),
        legend.position = 'bottom',
        axis.text = element_text(color = 'black', size = 14),
        axis.title = element_text(face = 'bold', size = 14))

happy_gdp_plot

Modifying the plot: changing scales

Scales modify how data are mapped to graphics
For example, change the range, breaks, and labels of the x and y axes
Or change the color palette used to color the points
You can optionally add a scale for each aes mapping in your plot

Add x scale with a title and specific labeled breaks

happy_gdp_plot +
  scale_x_continuous(name = 'GDP per capita', breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5))

Add y scale with a title and specific range

happy_gdp_plot +
  scale_x_continuous(name = 'GDP per capita', breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) +
  scale_y_continuous(name = 'happiness score', limits = c(2, 8))

Change the color palette for continents

happy_gdp_plot +
  scale_x_continuous(name = 'GDP per capita', breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) +
  scale_y_continuous(name = 'happiness score', limits = c(2, 8)) +
  scale_color_viridis_d()

scale_color_viridis_d() is a color scheme that can be distinguished by most colorblind people.

Boxplots

Visualize distributions grouped by discrete categories
geom_boxplot() to see distribution of happiness by continent

ggplot(data = WHR, aes(x = Continent, y = Happiness.Score)) +
  geom_boxplot()

Modify appearance of boxplots
Fill color (fill), line color (color), line thickness (size = 1.5 where the default is 1)

ggplot(data = WHR, aes(x = Continent, y = Happiness.Score)) +
  geom_boxplot(fill = 'forestgreen', color = 'gray25', size = 1.5)

Modify the outlier points
Bigger (outlier.size = 2) and 70% transparent (outlier.alpha = 0.7)

ggplot(data = WHR, aes(x = Continent, y = Happiness.Score)) +
  geom_boxplot(fill = 'forestgreen', color = 'gray25', size = 1.5,
               outlier.size = 2, outlier.alpha = 0.7)

Histograms and density plots

Now using the cereal dataset
Distribution of grams of sugar per serving in each type of breakfast cereal (sugars)
We only need to map an x aesthetic
y value computed internally by ggplot()

ggplot(data = cereal, aes(x = sugars)) +
  geom_histogram()

Reduce the number of bins to 10 (default is 30)

ggplot(data = cereal, aes(x = sugars)) +
  geom_histogram(bins = 10)

By default, the y axis has a small gap between the highest and lowest value and the edge of the plot
Great for scatterplots but doesn’t look good for histograms
Change this by adding a y axis scale with an expand argument

ggplot(data = cereal, aes(x = sugars)) +
  geom_histogram(bins = 10) +
  scale_y_continuous(expand = expansion(add = c(0, 1)))

expansion(add = c(0, 1)) indicates 0 units of padding at the low end of the axis, and 1 unit at the high end

Change the fill color of the histogram bar
Add a title and subtitle to the plot with ggtitle()

sugar_hist <- ggplot(data = cereal, aes(x = sugars)) +
  geom_histogram(bins = 10, fill = 'slateblue') +
  scale_y_continuous(expand = expansion(add = c(0, 1))) +
  ggtitle('Distribution of grams of sugar per serving', 'for 77 brands of breakfast cereal')

sugar_hist

Use labs() function to change axis label(s) without having to specify the entire scale

sugar_hist +
  labs(x = 'sugar (g/serving)')

Kernel density plot

An alternative to the histogram is a smoothed kernel density plot
Look at the distribution of calories per serving with a density plot

ggplot(data = cereal, aes(x = calories)) +
  stat_density()

Functions beginning stat_*() will compute some summary statistic or function based on the data and plot it.

Tweak the bandwidth parameter (adjust) of the density smoothing algorithm
Higher adjust (default is 1) gives you a smoother curve

ggplot(data = cereal, aes(x = calories)) +
  stat_density(adjust = 2)

We can specify the geom to be used for stat_density()
Default is 'polygon' but we can change it to 'line'

ggplot(data = cereal, aes(x = calories)) +
  stat_density(adjust = 2, geom = 'line')

Get rid of the gaps on either end of the x-axis and below the 0 line on the y-axis
Change the width and color of the line
Add a title for the plot and x-axis

ggplot(data = cereal, aes(x = calories)) +
  stat_density(adjust = 2, geom = 'line', linewidth = 1.2, color = 'forestgreen') +
  scale_y_continuous(expand = expansion(mult = c(0, 0.02))) +
  scale_x_continuous(expand = expansion(mult = c(0, 0)), name = 'calories per serving') +
  ggtitle('Distribution of calories per serving', 'for 77 brands of breakfast cereal')

Facets

ggplot2 can easily group data and create multiple subplots
Subset of the olympics track and field medals dataset (four countries)

olympics_best <- subset(olympics, Country %in% c('USA', 'GBR', 'URS', 'JAM'))

Make a bar plot of the three types of medals awarded
geom_bar() function used to compute counts within each category

ggplot(olympics_best, aes(x = Medal)) +
  geom_bar(stat = 'count')

Use facet_wrap() function with a one-sided formula ~ Country to make a separate plot for each country

ggplot(olympics_best, aes(x = Medal)) +
  geom_bar(stat = 'count') +
  facet_wrap(~ Country)

By default, all the facets have the same axis limits
We can make each facet have its own limits by setting scales = 'free_y'
(You can also use scales = 'free_x', or scales = 'free' to let both x and y limits vary)

ggplot(olympics_best, aes(x = Medal)) +
  geom_bar(stat = 'count') +
  facet_wrap(~ Country, scales = 'free_y')

Facet in both directions using facet_grid() with a two-sided formula
We can show the medal tally for each country by gender in a 4x2 plot:

ggplot(olympics_best, aes(x = Medal)) +
  geom_bar(stat = 'count') +
  facet_grid(Country ~ Gender, scales = 'free_y')

To change order of medals from alphabetical to Bronze, Silver, Gold, we change the underlying data
Make Medal into a factor and specify order of factor levels
Now when you plot, the bars will appear in the order we set

olympics_best$Medal <- factor(olympics_best$Medal, levels = c('Bronze', 'Silver', 'Gold'))

ggplot(olympics_best, aes(x = Medal)) +
  geom_bar(stat = 'count') +
  facet_grid(Country ~ Gender, scales = 'free_y')

Add theme elements, title, and subtitle
Add custom fill scale to fill each medal bar with the appropriate color

ggplot(olympics_best, aes(x = Medal, fill = Medal)) +
  geom_bar(stat = 'count') +
  facet_grid(Country ~ Gender, scales = 'free_y') +
  ggtitle('Track and field medal tallies for selected countries, 1896-2014', subtitle = 'data source: The Guardian') +
  scale_y_continuous(expand = expansion(add = c(0, 2))) +
  scale_fill_manual(values = c(Bronze = 'chocolate4', Silver = 'gray60', Gold = 'goldenrod')) +
  theme(legend.position = 'none',
        strip.background = element_blank(),
        panel.grid = element_blank())

Writing your ggplot to a file

ggsave() writes your plot to a file.
Automatically determines the file type based on extension on the file name
Sets size and resolution of the image by default

# Not run
ggsave('~/plots/sugar_histogram.png', sugar_hist)

Specify the resolution in dpi (dots per inch)
Size of image using height and width
Default units are inches but we can change this

# Not run
ggsave('~/plots/sugar_histogram.png', sugar_hist, dpi = 400, height = 4, width = 4)
ggsave('~/plots/sugar_histogram.png', sugar_hist, dpi = 400, height = 9, width = 9, units = 'cm')

We can use other file types such as PDF (vector graphics, so you don’t supply a resolution)

# Not run
ggsave('~/plots/sugar_histogram.pdf', sugar_hist, height = 4, width = 4)

If you don’t specify a plot object, ggsave saves the last ggplot you plotted in the plotting window to the file

ggsave('~/plots/my_plot.pdf', height = 4, width = 4)

Going further

We’ve only scratched the surface of ggplot2
ggplot2 itself does not support every possible visualization
open-source ecosystem of extension packages people have written as add-ons

ggthemes

package with a bunch of themes not included in base ggplot2
example: draw our sugar histogram plot in the style of The Economist, FiveThirtyEight.com, and Edward Tufte

library(ggthemes)

sugar_hist + theme_economist()
sugar_hist + theme_fivethirtyeight()
sugar_hist + theme_tufte()

gghighlight

Highlight outliers or other specific bits of our data automatically
example: happiness versus GDP plot, “happiest” and “saddest” countries highlighted and labeled

library(gghighlight)

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + 
  geom_point() +
  gghighlight(Happiness.Score < 3 | Happiness.Score > 7.55, label_key = Country, label_params = list(size = 5))

ggpubr

simplifies the production of publication-ready plots
example: happiness by continent boxplot, with raw data as jittered points, and manually added significance indicators

library(ggpubr)

ggboxplot(data = WHR, x = 'Continent', y = 'Happiness.Score', color = 'Continent', add = 'jitter',
          palette = unname(palette.colors(4)), 
          ylab = 'Happiness Score', ylim = c(2.5, 8.5)) +
  geom_bracket(xmin = 'Europe', xmax = 'Africa', label = '*', y.position = 8.3, label.size = 8) +
  geom_bracket(xmin = 'Americas', xmax = 'Africa', label = '*', y.position = 7.7, label.size = 8)

The list goes on …

official ggplot2 extensions page

Conclusion

What did we just learn? Let’s revisit the learning objectives!

what the “grammar of graphics” is and how ggplot2 uses it
how to map a variable in a data frame to a graphical element in a plot using aes
how to use different geoms to make scatterplots, boxplots, histograms, density plots, and barplots
how to add trendlines to a plot
how to compute summary statistics and plot them with stat functions
how to make plots for different subsets of your data using facets
how to change the style and appearance of your plots

Now get out there and visualize some data!