This is a lesson introducing you to making plots with the R package ggplot2. The ggplot2 package was originally developed by Hadley Wickham and is now developed and maintained by a huge team of data visualization experts. It’s an elegant and powerful way of visualizing your data and works great for everything from quick exploratory plots to carefully formatted publication-quality graphics.
Students should already have a beginner-level knowledge of R, including basic knowledge of functions and syntax, and awareness of how data frames in R work.
Download the worksheet for this lesson here.
At the end of this course, you will know …
aes
geom
s to make scatterplots,
boxplots, histograms, density plots, and barplotsstat
functionsfacet
sThe theory underlying ggplot2 is the “grammar of graphics.” This concept was originally introduced by Leland Wilkinson in a landmark book. It’s a formal way of mapping variables in a dataset to graphical elements of a plot. For example, you might have a dataset with age, weight, and sex of many individuals. You could make a scatterplot where the age variable in the data maps to the x axis of the dataset, the weight variable maps to the y axis, and the sex variable maps to the color of the points.
In the grammar of graphics, a plot is built in a modular way. We
start with data, map variables to visual elements called
geom
s, and then optionally modify the coordinate system and
scales like axes and color gradients. We can also modify the visual
appearance of the plot in ways that don’t map back to the data, but just
make the plot look better.
If that doesn’t make sense to you, read on to see how this is implemented in ggplot2.
These images are taken from the ggplot2 cheatsheet. I recommend downloading this cheatsheet and keeping it handy – it’s a great reference!
ggplot2 uses the grammar of graphics to build up all plots from the same set of building blocks. You specify which variables in the data correspond to which visual properties (aesthetics) of the things that are being plotted (geoms).
In practice, that looks like this:
As you can see we at least need data, a mapping of variables to
visual properties (called aes
), and one or more
geom
layers. Optionally, we can add coordinate system
transformations, scale transformations, facets to split the plot into
groups, and themes to change the plot appearance. We will cover all of
this (other than coordinate transformation) in this intro lesson.
In this lesson we’ll use three fun datasets from Kaggle, a data science competition site where users upload public domain datasets. Click on each link if you want to learn more about each dataset, including descriptions of each column.
We will use only the ggplot2 package in this tutorial.
library(ggplot2)
If you are using the Posit Cloud server, the example datasets are
loaded onto the server as CSV files. Use the read.csv()
function from base R to import them into your R environment as data
frames.
WHR <- read.csv('datasets/WHR_2015.csv')
cereal <- read.csv('datasets/cereal.csv')
olympics <- read.csv('datasets/olympics.csv')
If you are not using the Posit Cloud server, you may read in each of the three datasets from the URL where they are hosted on GitHub.
WHR <- read.csv('https://usda-ree-ars.github.io/SEAStats/data_viz_basics/datasets/WHR_2015.csv')
cereal <- read.csv('https://usda-ree-ars.github.io/SEAStats/data_viz_basics/datasets/cereal.csv')
olympics <- read.csv('https://usda-ree-ars.github.io/SEAStats/data_viz_basics/datasets/olympics.csv')
You can use the head()
, summary()
, or
str()
functions to examine each dataset.
Let’s start with a simple scatterplot. Does money buy happiness? We
will find out. The WHR
dataset has a row for each country.
We will make a scatterplot with plotting GDP per capita on the
x axis and happiness score on the y axis.
Start by calling the ggplot()
function with the
data
argument saying which data frame contains the plotting
data.
ggplot(data = WHR)
That didn’t do anything so far. We’ve only specified the dataset to get the plotting data from, without saying which columns of the dataset will be mapped to which graphical elements.
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score))
Once we add the x and y mappings, we now can see
the two axes and coordinate system, which is already set to the range of
each variable, but no data yet. We haven’t added any geom
layers.
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + geom_point()
By adding a geom_point()
layer to the plotting code, we
have now made a scatterplot!
Notice we use
+
to add each new piece of the plotting code.
We can modify the plot in many ways. One way is by changing the
geom
. This will plot the same data but using a different
type of plot. For example we might want to connect data points with
lines instead of drawing them as separate points. For that we will
replace geom_point()
with geom_line()
.
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) + geom_line()
The geom_line
doesn’t make a lot of sense in this case
but it is great for time series data.
We can add multiple geom
s if we want. For instance, we
can plot a smoothing trendline (geom_smooth()
) overlaid on
the scatterplot.
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Now we have a point plot with a trendline overlaid on top. The
geom
s are drawn in the order they are added to the
plot.
Notice I have put each piece on its own line. This makes the code much easier to read especially if you are making a complex plot with dozens of lines of code.
By default, the geom_smooth()
plots a locally weighted
regression with standard error as a shaded area. We can change the type
of trend to a linear trend by specifying method = lm
as an
argument, and get rid of the standard error shading by specifying
se = FALSE
.
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
If we add to or change the aes
arguments, we will modify
or change what data are used to plot. For example let’s add a
color
aesthetic to the point plot to color each country’s
point by continent.
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score, color = Continent)) +
geom_point() +
geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
We automatically get a legend. However, this also automatically groups the trendline by region as well.
If we have multiple geom
s, we can add an
aes
argument to a single geom
. For example if
we want the points to be colored by continent but a single overall
trendline, we add aes(color = Continent)
inside
geom_point()
.
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) +
geom_point(aes(color = Continent)) +
geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
If we add arguments to the geom
s outside
aes()
, it will modify their appearance without mapping back
to the data. For example we might want bigger points, and we might want
the trendline to be black. That is not mapping back to any part of the
original data, it is just a modification to the look of the plot.
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) +
geom_point(aes(color = Continent), size = 2) +
geom_smooth(method = lm, se = FALSE, color = 'black')
## `geom_smooth()` using formula = 'y ~ x'
The default for
geom_point
issize = 1
.
We can add a theme to the plot. Themes change the overall look of the plot.
happy_gdp_plot <- ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) +
geom_point(aes(color = Continent), size = 2) +
geom_smooth(method = lm, se = FALSE, color = 'black')
happy_gdp_plot + theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
The theme_bw()
function changes the default theme to a
black-and-white theme.
Note that here I demonstrate that we can assign the result of
ggplot()
to an object, in this casehappy_gdp_plot
. It does not print the plot yet. Then when I typehappy_gdp_plot
again and add a theme to it, without assigning it to an object, it prints.
You can set a global theme for all plots created in your current R
session by using theme_set()
. I will do this now so we
don’t have to look at the ugly gray default theme!
theme_set(theme_bw())
We can use the theme()
function to add specific theme
arguments to a plot. For instance we can move the legend to the bottom
of the plot, remove the gridlines, and change the appearance of
different text elements.
happy_gdp_plot <- happy_gdp_plot +
theme(panel.grid = element_blank(),
legend.position = 'bottom',
axis.text = element_text(color = 'black', size = 14),
axis.title = element_text(face = 'bold', size = 14))
happy_gdp_plot
## `geom_smooth()` using formula = 'y ~ x'
Scales can be used to modify the ways in which the data are mapped to
the graphics appearing on the plot. For example, you can add scales to a
plot to modify the range, breaks, and labels of the x and
y axes, or the color palette used to color the points. You can
optionally add a scale for each aes
mapping in your
plot.
Here I will add scales for the x
, y
, and
color
aesthetics one at a time.
happy_gdp_plot +
scale_x_continuous(name = 'GDP per capita', breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5))
## `geom_smooth()` using formula = 'y ~ x'
happy_gdp_plot +
scale_x_continuous(name = 'GDP per capita', breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) +
scale_y_continuous(name = 'happiness score', limits = c(2, 8))
## `geom_smooth()` using formula = 'y ~ x'
happy_gdp_plot +
scale_x_continuous(name = 'GDP per capita', breaks = c(0, 0.25, 0.5, 0.75, 1, 1.25, 1.5)) +
scale_y_continuous(name = 'happiness score', limits = c(2, 8)) +
scale_color_viridis_d()
## `geom_smooth()` using formula = 'y ~ x'
scale_color_viridis_d()
is a color scheme that can be distinguished by most colorblind people.
Another common plot type is the boxplot. We can plot the happiness
score data on the y-axis again, but instead of a continuous x-axis, we
will use continent as a discrete or categorical x-axis. A boxplot is a
useful way to visualize distributions grouped by discrete categories.
The function we need is geom_boxplot()
.
ggplot(data = WHR, aes(x = Continent, y = Happiness.Score)) +
geom_boxplot()
We can modify the boxplots in different ways. First let’s fill them
with a nice color, change the black outlines to a dark gray, and make
all the lines thicker (size = 1.5
where the default is
1).
ggplot(data = WHR, aes(x = Continent, y = Happiness.Score)) +
geom_boxplot(fill = 'forestgreen', color = 'gray25', size = 1.5)
We can also modify the outlier points, here I make them bigger and
70% transparent (alpha
ranges between 0 for fully
transparent and 1 for fully opaque).
ggplot(data = WHR, aes(x = Continent, y = Happiness.Score)) +
geom_boxplot(fill = 'forestgreen', color = 'gray25', size = 1.5,
outlier.size = 2, outlier.alpha = 0.7)
Let’s switch to a different dataset: the cereal
dataset.
Let’s look at the distribution of grams of sugar per serving in each
type of breakfast cereal (the sugars
column). We only have
one variable now so we only need to map an x
aesthetic. The
y
value will be computed internally by
ggplot()
.
ggplot(data = cereal, aes(x = sugars)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can modify this histogram in many ways. For instance let’s reduce the number of bins to 10 from the default value of 30 to get a histogram with fewer gaps.
ggplot(data = cereal, aes(x = sugars)) +
geom_histogram(bins = 10)
By default, the y axis has a small gap between the highest and lowest
value and the edge of the plot. That is great for scatterplots but
doesn’t look good for histograms. We can change this by adding a y axis
scale with an expand
argument.
ggplot(data = cereal, aes(x = sugars)) +
geom_histogram(bins = 10) +
scale_y_continuous(expand = expansion(add = c(0, 1)))
I used expansion(add = c(0, 1))
to indicate I wanted to
add 0 units of padding at the low end of the axis, and 1 unit at the
high end.
Now I am going to change the fill color of the histogram bars, and
add a title and subtitle to the plot with the ggtitle()
function.
sugar_hist <- ggplot(data = cereal, aes(x = sugars)) +
geom_histogram(bins = 10, fill = 'slateblue') +
scale_y_continuous(expand = expansion(add = c(0, 1))) +
ggtitle('Distribution of grams of sugar per serving', 'for 77 brands of breakfast cereal')
sugar_hist
Finally we can use the labs()
function to change the
label on one or more axes without having to specify the entire
scale.
sugar_hist +
labs(x = 'sugar (g/serving)')
An alternative to the histogram is a smoothed kernel density plot.
Let’s use the calories
variable for that.
ggplot(data = cereal, aes(x = calories)) +
stat_density()
Notice we used
stat_density()
. In general, functions beginningstat_*()
will compute some summary statistic or function based on the data and plot it.
To get a less squiggly density plot, we can tweak the bandwidth
parameter (the adjust
argument) of the density smoothing
algorithm. The higher the adjust
argument is set (the
default is 1), the smoother the curve.
ggplot(data = cereal, aes(x = calories)) +
stat_density(adjust = 2)
If we just want a line instead of a filled shape we can specify the
geom
to be used for stat_density()
. The
default is 'polygon'
but we can change it to
'line'
.
ggplot(data = cereal, aes(x = calories)) +
stat_density(adjust = 2, geom = 'line')
Let’s improve the look of this plot by getting rid of the gaps on either end of the x-axis and below the 0 line on the y-axis, changing the width and color of the line, and adding a title for the plot and x-axis.
ggplot(data = cereal, aes(x = calories)) +
stat_density(adjust = 2, geom = 'line', linewidth = 1.2, color = 'forestgreen') +
scale_y_continuous(expand = expansion(mult = c(0, 0.02))) +
scale_x_continuous(expand = expansion(mult = c(0, 0)), name = 'calories per serving') +
ggtitle('Distribution of calories per serving', 'for 77 brands of breakfast cereal')
One of the most useful features of ggplot2 is the
ability to easily create multiple subplots, one for each subgroup of a
dataset. Let’s take a subset of the olympics
dataset to
compare the athletics performance of some of the best track-and-field
nations over the years: the USA, Great Britain (GBR), the Soviet Union
(URS), and Jamaica (JAM).
olympics_best <- subset(olympics, Country %in% c('USA', 'GBR', 'URS', 'JAM'))
Make a bar plot of the three types of medals awarded. The
geom_bar()
function can be used to compute counts within
each category.
ggplot(olympics_best, aes(x = Medal)) +
geom_bar(stat = 'count')
To make this plot more informative, let’s split it up by country. We
use the facet_wrap()
function with a one-sided formula
~ Country
to make a separate plot for each country.
ggplot(olympics_best, aes(x = Medal)) +
geom_bar(stat = 'count') +
facet_wrap(~ Country)
By default, all the facets have the same y-axis limits. We can make
each facet have its own limits by setting
scales = 'free_y'
.
ggplot(olympics_best, aes(x = Medal)) +
geom_bar(stat = 'count') +
facet_wrap(~ Country, scales = 'free_y')
We can facet in both directions using facet_grid()
with
a two-sided formula. For instance we can show the medal tally for each
country by gender in a 4x2 plot.
ggplot(olympics_best, aes(x = Medal)) +
geom_bar(stat = 'count') +
facet_grid(Country ~ Gender, scales = 'free_y')
Let’s improve the appearance of this plot. First, you can see that
the order of the medals is alphabetical (Bronze, Gold, Silver) instead
of the logical Bronze, Silver, Gold. The best way to change this is by
changing the underlying data before you create the plot. We will change
the Medal
column to a factor and specify the order of the
factor levels. Then, when the plot is created, the bars will appear in
that order.
olympics_best$Medal <- factor(olympics_best$Medal, levels = c('Bronze', 'Silver', 'Gold'))
ggplot(olympics_best, aes(x = Medal)) +
geom_bar(stat = 'count') +
facet_grid(Country ~ Gender, scales = 'free_y')
Now I’ll add some theme elements, a title and subtitle, and a custom fill scale to fill each medal bar with the appropriate color.
ggplot(olympics_best, aes(x = Medal, fill = Medal)) +
geom_bar(stat = 'count') +
facet_grid(Country ~ Gender, scales = 'free_y') +
ggtitle('Track and field medal tallies for selected countries, 1896-2014', subtitle = 'data source: The Guardian') +
scale_y_continuous(expand = expansion(add = c(0, 2))) +
scale_fill_manual(values = c(Bronze = 'chocolate4', Silver = 'gray60', Gold = 'goldenrod')) +
theme(legend.position = 'none',
strip.background = element_blank(),
panel.grid = element_blank())
Notice I had to add
fill = Medal
to theaes()
mapping so that the custom scale I made would be applied to fill the bars by color. TheMedal
variable is mapped to both thex
andfill
graphical elements. Strictly speaking this is redundant but it makes the plot look cooler!
You can use the ggsave()
function to write your plot to
a file. ggsave()
will automatically determine the file type
based on the extension on the file name. It will also set the size and
resolution of the image by default. (Note this example assumes there is
a folder called plots
in your home directory.)
# Not run
ggsave('~/plots/sugar_histogram.png', sugar_hist)
We can specify the resolution in dpi (dots per inch) and the size of the image. By default the units are inches but we can specify other units if needed.
# Not run
ggsave('~/plots/sugar_histogram.png', sugar_hist, dpi = 400, height = 4, width = 4)
ggsave('~/plots/sugar_histogram.png', sugar_hist, dpi = 400, height = 9, width = 9, units = 'cm')
We can use other file types such as PDF (vector graphics, so you don’t supply a resolution)
# Not run
ggsave('~/plots/sugar_histogram.pdf', sugar_hist, height = 4, width = 4)
If you don’t specify a plot object, it will write the last ggplot you plotted in the plotting window to the file. (This will be the last Olympics plot we made).
ggsave('~/plots/my_plot.pdf', height = 4, width = 4)
We’ve only scratched the surface of ggplot2 in this lesson.
The ggplot2 package itself doesn’t support every possible kind of data visualization. But because R packages are all open-source and anyone can contribute his or her own package, there is a very diverse “ecosystem” of extension packages people have written as add-ons to ggplot2. We don’t have time to get into any of them in detail right now. Here are a few examples of extension packages and some bits of code showing them in action. I encourage you to explore them more!
This is a package that includes a bunch of themes you can add to your plot to spiff up its appearance. Here I use it to draw our sugar histogram plot in the style of The Economist, FiveThirtyEight.com, and the personal style of data visualization expert Edward Tufte.
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.3.3
sugar_hist + theme_economist()
sugar_hist + theme_fivethirtyeight()
sugar_hist + theme_tufte()
This package has functions that allow us to highlight outliers or other specific bits of our data automatically. For example let’s go back to our happiness versus GDP plot and highlight the “happiest” and “saddest” countries, complete with labels.
library(gghighlight)
## Warning: package 'gghighlight' was built under R version 4.3.3
ggplot(data = WHR, aes(x = GDP.per.Capita, y = Happiness.Score)) +
geom_point() +
gghighlight(Happiness.Score < 3 | Happiness.Score > 7.55, label_key = Country, label_params = list(size = 5))
This package tries to streamline the process of formatting plots to
make them publication-ready with simplified functions. For example, we
can make a pretty boxplot of happiness score by continent with the raw
data added as jittered points, without having to specify all the
geom
s. We can even add specific “significant” comparisons
to the plot with connecting lines. Let’s assume in this case that the
Americas and Europe had significantly greater happiness score than
Africa, but no other pairwise comparisons showed significant
differences.
library(ggpubr)
ggboxplot(data = WHR, x = 'Continent', y = 'Happiness.Score', color = 'Continent', add = 'jitter',
palette = unname(palette.colors(4)),
ylab = 'Happiness Score', ylim = c(2.5, 8.5)) +
geom_bracket(xmin = 'Europe', xmax = 'Africa', label = '*', y.position = 8.3, label.size = 8) +
geom_bracket(xmin = 'Americas', xmax = 'Africa', label = '*', y.position = 7.7, label.size = 8)
Check out the official ggplot2 extensions page to browse a full list.
What did we just learn? Let’s revisit the learning objectives!
aes
geom
s to make scatterplots,
boxplots, histograms, density plots, and barplotsstat
functionsfacet
sTake the time to pat yourself on the back for making it this far. Now get out there and visualize some data!
The example datasets we worked with in this lesson have a lot more
columns that we haven’t worked with yet. In these exercises, we will
explore the cereal
dataset further. The dataset has a
column called rating
which is a score on a scale from 0-100
representing how healthy the cereal is (All-Bran with Extra Fiber is the
highest at 94 and Cap’n Crunch the lowest at 18).
Make a density plot showing the distribution of ratings across all 77 cereals.
Change the density plot from the previous exercise to a histogram.
Fill the histogram bars with the color of your choice. Change the y-axis of the histogram so that there is no gap between the bars and the bottom of the plot.
scale_y_continuous(expand = expansion(add = ...))
.Make a boxplot grouped by the column manufacturer
to
show which manufacturers produce healthier cereals.
Fill the boxplots with the color of your choice. Make the outliers into hollow circles.
outlier.shape = 1
.sugars
on the x-axis and
rating
on the y-axis to look at whether healthier-rated
cereals have more or less sugar per serving.Include a linear trend on the scatterplot with standard error.
geom_smooth()
and modify the argument method
.Change the y-axis range on the scatterplot so that it goes from 0 to 100, with labels at 0, 25, 50, 75, and 100.
scale_y_continuous()
, and modify the arguments
limits
and breaks
.Split the scatterplot with trendline up by the type
column to produce one scatterplot for cold cereals and one scatterplot
for hot cereals.
facet_wrap(~ ...)
.