Introduction

This is a lesson introducing you to making plots with the R package ggplot2. The ggplot2 package was originally developed by Hadley Wickham and is now developed and maintained by a huge team of data visualization experts. It’s an elegant and powerful way of visualizing your data and works great for everything from quick exploratory plots to carefully formatted publication-quality graphics.

Students should already have a beginner-level knowledge of R, including basic knowledge of functions and syntax, and awareness of how data frames in R work.

Download the worksheet for this lesson here.

Learning objectives

At the end of this course, you will know …

  • what the “grammar of graphics” is and how ggplot2 uses it
  • how to map a variable in a data frame to a graphical element in a plot using aes
  • how to use different geoms to make scatterplots, boxplots, histograms, density plots, and barplots
  • how to add trendlines to a plot
  • how to compute summary statistics and plot them with stat functions
  • how to make plots for different subsets of your data using facets
  • how to change the style and appearance of your plots

What is the grammar of graphics?

The theory underlying ggplot2 is the “grammar of graphics.” This concept was originally introduced by Leland Wilkinson in a landmark book. It’s a formal way of mapping variables in a dataset to graphical elements of a plot. For example, you might have a dataset with age, weight, and sex of many individuals. You could make a scatterplot where the age variable in the data maps to the x axis of the dataset, the weight variable maps to the y axis, and the sex variable maps to the color of the points.

In the grammar of graphics, a plot is built in a modular way. We start with data, map variables to visual elements called geoms, and then optionally modify the coordinate system and scales like axes and color gradients. We can also modify the visual appearance of the plot in ways that don’t map back to the data, but just make the plot look better.

If that doesn’t make sense to you, read on to see how this is implemented in ggplot2.

How does a ggplot work?

These images are taken from the ggplot2 cheatsheet. I recommend downloading this cheatsheet and keeping it handy – it’s a great reference!

ggplot2 uses the grammar of graphics to build up all plots from the same set of building blocks. You specify which variables in the data correspond to which visual properties (aesthetics) of the things that are being plotted (geoms).

Image excerpted from ggplot2 cheatsheet, describing grammar of graphics

In practice, that looks like this:

Image excerpted from ggplot2 cheatsheet, with function template

As you can see we at least need data, a mapping of variables to visual properties (called aes), and one or more geom layers. Optionally, we can add coordinate system transformations, scale transformations, facets to split the plot into groups, and themes to change the plot appearance. We will cover all of this (other than coordinate transformation) in this intro lesson.

The data

In this lesson we’ll use three fun datasets from Kaggle, a data science competition site where users upload public domain datasets. Click on each link if you want to learn more about each dataset, including descriptions of each column.

The code

Loading packages and data

We will use only the ggplot2 package in this tutorial.

library(ggplot2)

If you are using the Posit Cloud server, the example datasets are loaded onto the server as CSV files. Use the read.csv() function from base R to import them into your R environment as data frames.

WHR <- read.csv('datasets/WHR_2015.csv')
cereal <- read.csv('datasets/cereal.csv')
olympics <- read.csv('datasets/olympics.csv')

If you are not using the Posit Cloud server, you may read in each of the three datasets from the URL where they are hosted on GitHub.

WHR <- read.csv('https://usda-ree-ars.github.io/SEAStats/data_viz_basics/datasets/WHR_2015.csv')
cereal <- read.csv('https://usda-ree-ars.github.io/SEAStats/data_viz_basics/datasets/cereal.csv')
olympics <- read.csv('https://usda-ree-ars.github.io/SEAStats/data_viz_basics/datasets/olympics.csv')

You can use the head(), summary(), or str() functions to examine each dataset.

Our first ggplot

Let’s start with a simple scatterplot. Does money buy happiness? We will find out. The WHR dataset has a row for each country. We will make a scatterplot with plotting GDP per capita on the x axis and happiness score on the y axis.

Start by calling the ggplot() function with the data argument saying which data frame contains the plotting data.

ggplot(data = WHR)

empty ggplot

Add aesthetic mapping

That didn’t do anything so far. We’ve only specified the dataset to get the plotting data from, without saying which columns of the dataset will be mapped to which graphical elements.

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score))

ggplot with aes but no geom

Add a geom layer

Once we add the x and y mappings, we now can see the two axes and coordinate system, which is already set to the range of each variable, but no data yet. We haven’t added any geom layers.

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + geom_point()

scatter plot of happiness versus GDP

By adding a geom_point() layer to the plotting code, we have now made a scatterplot!

Notice we use + to add each new piece of the plotting code.

Modifying the plot: changing the geom

We can modify the plot in many ways. One way is by changing the geom. This will plot the same data but using a different type of plot. For example we might want to connect data points with lines instead of drawing them as separate points. For that we will replace geom_point() with geom_line().

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + geom_line()

line plot of happiness versus GDP

The geom_line doesn’t make a lot of sense in this case but it is great for time series data.

We can add multiple geoms if we want. For instance, we can plot a smoothing trendline (geom_smooth()) overlaid on the scatterplot.

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + 
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

scatter plot of happiness versus GDP with locally weighted trend

Now we have a point plot with a trendline overlaid on top. The geoms are drawn in the order they are added to the plot.

Notice I have put each piece on its own line. This makes the code much easier to read especially if you are making a complex plot with dozens of lines of code.

By default, the geom_smooth() plots a locally weighted regression with standard error as a shaded area. We can change the type of trend to a linear trend by specifying method = lm as an argument, and get rid of the standard error shading by specifying se = FALSE.

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + 
  geom_point() +
  geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

scatter plot of happiness versus GDP with linear trend

Modifying the plot: changing the aes

If we add to or change the aes arguments, we will modify or change what data are used to plot. For example let’s add a color aesthetic to the point plot to color each country’s point by continent.

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score, color = Continent)) + 
  geom_point() +
  geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

scatter plot of happiness versus GDP colored by continent with separate trends

We automatically get a legend. However, this also automatically groups the trendline by region as well.

If we have multiple geoms, we can add an aes argument to a single geom. For example if we want the points to be colored by continent but a single overall trendline, we add aes(color = Continent) inside geom_point().

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + 
  geom_point(aes(color = Continent)) +
  geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

scatter plot of happiness versus GDP colored by continent with single trend

If we add arguments to the geoms outside aes(), it will modify their appearance without mapping back to the data. For example we might want bigger points, and we might want the trendline to be black. That is not mapping back to any part of the original data, it is just a modification to the look of the plot.

ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + 
  geom_point(aes(color = Continent), size = 2) +
  geom_smooth(method = lm, se = FALSE, color = 'black')
## `geom_smooth()` using formula = 'y ~ x'

scatter plot of happiness versus GDP colored by continent with trend and modified look

The default for geom_point is size = 1.

Modifying the plot: changing the theme

We can add a theme to the plot. Themes change the overall look of the plot.

happy_gdp_plot <- ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + 
  geom_point(aes(color = Continent), size = 2) +
  geom_smooth(method = lm, se = FALSE, color = 'black')

happy_gdp_plot + theme_bw()
## `geom_smooth()` using formula = 'y ~ x'

scatter plot of happiness versus GDP colored by continent with trend and different theme

The theme_bw() function changes the default theme to a black-and-white theme.

Note that here I demonstrate that we can assign the result of ggplot() to an object, in this case happy_gdp_plot. It does not print the plot yet. Then when I type happy_gdp_plot again and add a theme to it, without assigning it to an object, it prints.

You can set a global theme for all plots created in your current R session by using theme_set(). I will do this now so we don’t have to look at the ugly gray default theme!

theme_set(theme_bw())

We can use the theme() function to add specific theme arguments to a plot. For instance we can move the legend to the bottom of the plot, remove the gridlines, and change the appearance of different text elements.

happy_gdp_plot <- happy_gdp_plot +
  theme(panel.grid = element_blank(),
        legend.position = 'bottom',
        axis.text = element_text(color = 'black', size = 14),
        axis.title = element_text(face = 'bold', size = 14))

happy_gdp_plot
## `geom_smooth()` using formula = 'y ~ x'

scatter plot of happiness versus GDP colored by continent with additional theme arguments set

Modifying the plot: changing scales

Scales can be used to modify the ways in which the data are mapped to the graphics appearing on the plot. For example, you can add scales to a plot to modify the range, breaks, and labels of the x and y axes, or the color palette used to color the points. You can optionally add a scale for each aes mapping in your plot.

Here I will add scales for the x, y, and color aesthetics one at a time.

happy_gdp_plot +
  scale_x_continuous(name = 'GDP per capita', breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5))
## `geom_smooth()` using formula = 'y ~ x'

scatter plot of happiness versus GDP colored by continent with x axis scale set

happy_gdp_plot +
  scale_x_continuous(name = 'GDP per capita', breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) +
  scale_y_continuous(name = 'happiness score', limits = c(2, 8))
## `geom_smooth()` using formula = 'y ~ x'

scatter plot of happiness versus GDP colored by continent with x and y scales set

happy_gdp_plot +
  scale_x_continuous(name = 'GDP per capita', breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) +
  scale_y_continuous(name = 'happiness score', limits = c(2, 8)) +
  scale_color_viridis_d()
## `geom_smooth()` using formula = 'y ~ x'

scatter plot of happiness versus GDP colored by continent with x, y, and color scales set

scale_color_viridis_d() is a color scheme that can be distinguished by most colorblind people.

Boxplots

Another common plot type is the boxplot. We can plot the happiness score data on the y-axis again, but instead of a continuous x-axis, we will use continent as a discrete or categorical x-axis. A boxplot is a useful way to visualize distributions grouped by discrete categories. The function we need is geom_boxplot().

ggplot(data = WHR, aes(x = Continent, y = Happiness.Score)) +
  geom_boxplot()

default boxplot of happiness by continent

We can modify the boxplots in different ways. First let’s fill them with a nice color, change the black outlines to a dark gray, and make all the lines thicker (size = 1.5 where the default is 1).

ggplot(data = WHR, aes(x = Continent, y = Happiness.Score)) +
  geom_boxplot(fill = 'forestgreen', color = 'gray25', size = 1.5)

boxplot of happiness by continent with modified look of boxes

We can also modify the outlier points, here I make them bigger and 70% transparent (alpha ranges between 0 for fully transparent and 1 for fully opaque).

ggplot(data = WHR, aes(x = Continent, y = Happiness.Score)) +
  geom_boxplot(fill = 'forestgreen', color = 'gray25', size = 1.5,
               outlier.size = 2, outlier.alpha = 0.7) 

boxplot of happiness by continent with modified look of boxes and outlier points

Histograms and density plots

Let’s switch to a different dataset: the cereal dataset. Let’s look at the distribution of grams of sugar per serving in each type of breakfast cereal (the sugars column). We only have one variable now so we only need to map an x aesthetic. The y value will be computed internally by ggplot().

ggplot(data = cereal, aes(x = sugars)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

default histogram of sugar content in cereals

We can modify this histogram in many ways. For instance let’s reduce the number of bins to 10 from the default value of 30 to get a histogram with fewer gaps.

ggplot(data = cereal, aes(x = sugars)) +
  geom_histogram(bins = 10)

histogram of sugar content in cereals with fewer bins

By default, the y axis has a small gap between the highest and lowest value and the edge of the plot. That is great for scatterplots but doesn’t look good for histograms. We can change this by adding a y axis scale with an expand argument.

ggplot(data = cereal, aes(x = sugars)) +
  geom_histogram(bins = 10) +
  scale_y_continuous(expand = expansion(add = c(0, 1)))

histogram of sugar content in cereals with fewer bins and no gap below bars

I used expansion(add = c(0, 1)) to indicate I wanted to add 0 units of padding at the low end of the axis, and 1 unit at the high end.

Now I am going to change the fill color of the histogram bars, and add a title and subtitle to the plot with the ggtitle() function.

sugar_hist <- ggplot(data = cereal, aes(x = sugars)) +
  geom_histogram(bins = 10, fill = 'slateblue') +
  scale_y_continuous(expand = expansion(add = c(0, 1))) +
  ggtitle('Distribution of grams of sugar per serving', 'for 77 brands of breakfast cereal')

sugar_hist

histogram of sugar content in cereals with blue bars and titles

Finally we can use the labs() function to change the label on one or more axes without having to specify the entire scale.

sugar_hist +
  labs(x = 'sugar (g/serving)')

histogram of sugar content in cereals with x axis label set

An alternative to the histogram is a smoothed kernel density plot. Let’s use the calories variable for that.

ggplot(data = cereal, aes(x = calories)) +
  stat_density()

default kernel density plot of calorie content of cereals

Notice we used stat_density(). In general, functions beginning stat_*() will compute some summary statistic or function based on the data and plot it.

To get a less squiggly density plot, we can tweak the bandwidth parameter (the adjust argument) of the density smoothing algorithm. The higher the adjust argument is set (the default is 1), the smoother the curve.

ggplot(data = cereal, aes(x = calories)) +
  stat_density(adjust = 2)

kernel density plot of calorie content of cereals, bandwidth adjusted

If we just want a line instead of a filled shape we can specify the geom to be used for stat_density(). The default is 'polygon' but we can change it to 'line'.

ggplot(data = cereal, aes(x = calories)) +
  stat_density(adjust = 2, geom = 'line')

kernel density plot of calorie content of cereals, using line geom

Let’s improve the look of this plot by getting rid of the gaps on either end of the x-axis and below the 0 line on the y-axis, changing the width and color of the line, and adding a title for the plot and x-axis.

ggplot(data = cereal, aes(x = calories)) +
  stat_density(adjust = 2, geom = 'line', linewidth = 1.2, color = 'forestgreen') +
  scale_y_continuous(expand = expansion(mult = c(0, 0.02))) +
  scale_x_continuous(expand = expansion(mult = c(0, 0)), name = 'calories per serving') +
  ggtitle('Distribution of calories per serving', 'for 77 brands of breakfast cereal')

kernel density plot of calorie content of cereals, with modified look and labels

Facets

One of the most useful features of ggplot2 is the ability to easily create multiple subplots, one for each subgroup of a dataset. Let’s take a subset of the olympics dataset to compare the athletics performance of some of the best track-and-field nations over the years: the USA, Great Britain (GBR), the Soviet Union (URS), and Jamaica (JAM).

olympics_best <- subset(olympics, Country %in% c('USA', 'GBR', 'URS', 'JAM'))

Make a bar plot of the three types of medals awarded. The geom_bar() function can be used to compute counts within each category.

ggplot(olympics_best, aes(x = Medal)) +
  geom_bar(stat = 'count')

default bar plot of Olympic medal counts

To make this plot more informative, let’s split it up by country. We use the facet_wrap() function with a one-sided formula ~ Country to make a separate plot for each country.

ggplot(olympics_best, aes(x = Medal)) +
  geom_bar(stat = 'count') +
  facet_wrap(~ Country)

default bar plot of Olympic medal counts by country

By default, all the facets have the same y-axis limits. We can make each facet have its own limits by setting scales = 'free_y'.

ggplot(olympics_best, aes(x = Medal)) +
  geom_bar(stat = 'count') +
  facet_wrap(~ Country, scales = 'free_y')

bar plot of Olympic medal counts by country with variable y axes

We can facet in both directions using facet_grid() with a two-sided formula. For instance we can show the medal tally for each country by gender in a 4x2 plot.

ggplot(olympics_best, aes(x = Medal)) +
  geom_bar(stat = 'count') +
  facet_grid(Country ~ Gender, scales = 'free_y')

bar plot of Olympic medal counts by country and gender

Let’s improve the appearance of this plot. First, you can see that the order of the medals is alphabetical (Bronze, Gold, Silver) instead of the logical Bronze, Silver, Gold. The best way to change this is by changing the underlying data before you create the plot. We will change the Medal column to a factor and specify the order of the factor levels. Then, when the plot is created, the bars will appear in that order.

olympics_best$Medal <- factor(olympics_best$Medal, levels = c('Bronze', 'Silver', 'Gold'))

ggplot(olympics_best, aes(x = Medal)) +
  geom_bar(stat = 'count') +
  facet_grid(Country ~ Gender, scales = 'free_y')

bar plot of Olympic medal counts ordered by medal type

Now I’ll add some theme elements, a title and subtitle, and a custom fill scale to fill each medal bar with the appropriate color.

ggplot(olympics_best, aes(x = Medal, fill = Medal)) +
  geom_bar(stat = 'count') +
  facet_grid(Country ~ Gender, scales = 'free_y') +
  ggtitle('Track and field medal tallies for selected countries, 1896-2014', subtitle = 'data source: The Guardian') +
  scale_y_continuous(expand = expansion(add = c(0, 2))) +
  scale_fill_manual(values = c(Bronze = 'chocolate4', Silver = 'gray60', Gold = 'goldenrod')) +
  theme(legend.position = 'none',
        strip.background = element_blank(),
        panel.grid = element_blank())

fully customized bar plot of Olympic medal counts

Notice I had to add fill = Medal to the aes() mapping so that the custom scale I made would be applied to fill the bars by color. The Medal variable is mapped to both the x and fill graphical elements. Strictly speaking this is redundant but it makes the plot look cooler!

Writing your ggplot to a file

You can use the ggsave() function to write your plot to a file. ggsave() will automatically determine the file type based on the extension on the file name. It will also set the size and resolution of the image by default. (Note this example assumes there is a folder called plots in your home directory.)

# Not run
ggsave('~/plots/sugar_histogram.png', sugar_hist)

We can specify the resolution in dpi (dots per inch) and the size of the image. By default the units are inches but we can specify other units if needed.

# Not run
ggsave('~/plots/sugar_histogram.png', sugar_hist, dpi = 400, height = 4, width = 4)
ggsave('~/plots/sugar_histogram.png', sugar_hist, dpi = 400, height = 9, width = 9, units = 'cm')

We can use other file types such as PDF (vector graphics, so you don’t supply a resolution)

# Not run
ggsave('~/plots/sugar_histogram.pdf', sugar_hist, height = 4, width = 4)

If you don’t specify a plot object, it will write the last ggplot you plotted in the plotting window to the file. (This will be the last Olympics plot we made).

ggsave('~/plots/my_plot.pdf', height = 4, width = 4)

Going further

We’ve only scratched the surface of ggplot2 in this lesson.

The ggplot2 package itself doesn’t support every possible kind of data visualization. But because R packages are all open-source and anyone can contribute his or her own package, there is a very diverse “ecosystem” of extension packages people have written as add-ons to ggplot2. We don’t have time to get into any of them in detail right now. Here are a few examples of extension packages and some bits of code showing them in action. I encourage you to explore them more!

ggthemes

This is a package that includes a bunch of themes you can add to your plot to spiff up its appearance. Here I use it to draw our sugar histogram plot in the style of The Economist, FiveThirtyEight.com, and the personal style of data visualization expert Edward Tufte.

library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.3
sugar_hist + theme_economist()

sugar content histograms with different ggthemes

sugar_hist + theme_fivethirtyeight()

sugar content histograms with different ggthemes

sugar_hist + theme_tufte()

sugar content histograms with different ggthemes

gghighlight

This package has functions that allow us to highlight outliers or other specific bits of our data automatically. For example let’s go back to our happiness versus GDP plot and highlight the “happiest” and “saddest” countries, complete with labels.

library(gghighlight)
## Warning: package 'gghighlight' was built under R version 4.3.3
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + 
  geom_point() +
  gghighlight(Happiness.Score < 3 | Happiness.Score > 7.55, label_key = Country, label_params = list(size = 5))

happiness versus GDP scatterplot with extreme data points highlighted

ggpubr

This package tries to streamline the process of formatting plots to make them publication-ready with simplified functions. For example, we can make a pretty boxplot of happiness score by continent with the raw data added as jittered points, without having to specify all the geoms. We can even add specific “significant” comparisons to the plot with connecting lines. Let’s assume in this case that the Americas and Europe had significantly greater happiness score than Africa, but no other pairwise comparisons showed significant differences.

library(ggpubr)

ggboxplot(data = WHR, x = 'Continent', y = 'Happiness.Score', color = 'Continent', add = 'jitter',
          palette = unname(palette.colors(4)), 
          ylab = 'Happiness Score', ylim = c(2.5, 8.5)) +
  geom_bracket(xmin = 'Europe', xmax = 'Africa', label = '*', y.position = 8.3, label.size = 8) +
  geom_bracket(xmin = 'Americas', xmax = 'Africa', label = '*', y.position = 7.7, label.size = 8)

happiness by continent boxplots formatted with ggpubr

The list goes on …

Check out the official ggplot2 extensions page to browse a full list.

Conclusion

What did we just learn? Let’s revisit the learning objectives!

  • what the “grammar of graphics” is and how ggplot2 uses it
  • how to map a variable in a data frame to a graphical element in a plot using aes
  • how to use different geoms to make scatterplots, boxplots, histograms, density plots, and barplots
  • how to add trendlines to a plot
  • how to compute summary statistics and plot them with stat functions
  • how to make plots for different subsets of your data using facets
  • how to change the style and appearance of your plots

Take the time to pat yourself on the back for making it this far. Now get out there and visualize some data!

Further reading

Exercises

The example datasets we worked with in this lesson have a lot more columns that we haven’t worked with yet. In these exercises, we will explore the cereal dataset further. The dataset has a column called rating which is a score on a scale from 0-100 representing how healthy the cereal is (All-Bran with Extra Fiber is the highest at 94 and Cap’n Crunch the lowest at 18).

Exercise 1

1a

Make a density plot showing the distribution of ratings across all 77 cereals.

1b

Change the density plot from the previous exercise to a histogram.

1c

Fill the histogram bars with the color of your choice. Change the y-axis of the histogram so that there is no gap between the bars and the bottom of the plot.

  • Hint: Use scale_y_continuous(expand = expansion(add = ...)).

Exercise 2

2a

Make a boxplot grouped by the column manufacturer to show which manufacturers produce healthier cereals.

2b

Fill the boxplots with the color of your choice. Make the outliers into hollow circles.

  • Hint: Use outlier.shape = 1.

Exercise 3

3a

  • Make a scatterplot with sugars on the x-axis and rating on the y-axis to look at whether healthier-rated cereals have more or less sugar per serving.

3b

Include a linear trend on the scatterplot with standard error.

  • Hint: Use geom_smooth() and modify the argument method.

3c

Change the y-axis range on the scatterplot so that it goes from 0 to 100, with labels at 0, 25, 50, 75, and 100.

  • Hint: Use scale_y_continuous(), and modify the arguments limits and breaks.

3d

Split the scatterplot with trendline up by the type column to produce one scatterplot for cold cereals and one scatterplot for hot cereals.

  • Hint: Use facet_wrap(~ ...).

Click here for answers