At the end of this lesson, students will …
In this course, we’re going to learn about mixed models in R.
The course is divided up into lessons. For each lesson, I’m going to
give you a worksheet. The worksheet will have a bunch of code in it, but
parts of the code will be missing and replaced with a ...
.
Your job will be to fill in the ...
with the code. Don’t
worry, you will have access to the full code on the lesson webpage as
well as being able to see it when I present it on the screen. I strongly
recommend TYPING IN THE CODE yourself instead of
copy+pasting it. Research shows that this helps you learn it better.
Also, if you make a mistake and get an error or different output than
you expected, you might learn something about how the code works (or
doesn’t work).
Each lesson has some exercises at the end. They are an optional way to get more practice, but don’t feel any pressure to do them. We will have “office hours” with open Q and A time for two hours each afternoon, which would be a good time to go over the exercises.
This course is intended to teach some practical skills. It is NOT intended to teach elementary principles of statistics. There’s no way to do that in two days! We may gloss over some technical details of the models. I’m happy to answer questions about them later.
R and RStudio are software tools to help you work with and analyze your data.
R is a statistical programming language, created by two statisticians from New Zealand in 1993. It is free and open-source. Users contribute code in the form of packages that anyone can download from the central R package repository, CRAN (Comprehensive R Archive Network). Thanks to the broad and diverse base of users there are R packages for all kinds of applications: stats, data visualization, GIS, text analysis, machine learning, phylogenetics, the list goes on! In this workshop, we are going to learn a few basics of data manipulation and explore some of the simpler statistical models you can fit with R.
RStudio is a tool to help you write and run code in R.
There are four panes that you see when you open RStudio:
The two main types of objects you need to know about in R (which is true for many other programming languages as well) are variables and functions.
c(1, 2, 3)
"USDA"
log(10)
: The function log()
takes a
numeric value as input and returns a numeric value as output.c(1, 5, 6)
: The function c()
takes
multiple values as input and returns a vector as output.read.csv('myfile.csv')
: The function
read.csv()
takes a character string with a the file name of
a CSV file as input, reads that CSV file, and returns a data frame as
output.Let’s start writing our first R code!
We are going to start by entering commands into the R console (the terminal where you can enter individual lines of code and run them).
Here are some simple things you can do with R.
+
and -
.2 + 3
## [1] 5
<-
is used to create a new
variable and give it a value. The syntax is
variable <- <value>
..
or _
but
can’t contain spaces or start with a number.=
as an assignment operator but we
will use <-
in this workshop. Consistent code is
readable code!x <- 2 + 3
y = 3.5
x
## [1] 5
x + y
## [1] 8.5
x * 4
## [1] 20
x <- x + 1
z <- x * 4
z
## [1] 24
()
,
like function(<value>)
, will input a value to a
function and return some output.log(1000)
## [1] 6.907755
sin(pi)
## [1] 1.224606e-16
#
is a comment and will not be
evaluated.# This is a comment
,
.'single quotes'
or "double quotes"
. Which one
you choose is up to your personal style preferences, but again remember
to be consistent!my_name <- "Quentin"
paste('Hello,', my_name)
## [1] "Hello, Quentin"
?
to get help about a function.?paste
??
to search all help documentation for a
term.??sequence
When you execute some R code, there may be some output. As we just
mentioned, it will print to the console unless you assign the output to
a variable. Some code may produce other output as a “side effect” other
than what is printed to the console. For instance, plot()
produces a plot image in a different window.
As an example here is some plotting code using a built-in R dataset
called mtcars
which will plot gas mileage mpg
as a function of horsepower hp
.
plot(mpg ~ hp, data = mtcars)
There are also three types of messages that code can produce in addition to its output: errors, warnings, and notes.
An error indicates that something is wrong with the code so that it cannot produce any output. For instance if we use an uneven number of parentheses:
sin(pi))
## Error: <text>:1:8: unexpected ')'
## 1: sin(pi))
## ^
Both warnings and notes mean that the code ran and produced output,
but they are R trying to tell you something potentially important. For
instance if we try to take the logarithm of a negative number, the
result will be returned as NaN
(not a number), and R will
issue a warning.
log(-5)
## Warning in log(-5): NaNs produced
## [1] NaN
A note is just that, a note. Everything is still fine! For example if we try to print an extremely long sequence of 0s to the console, it will only print a limited number and then give us a message of how many were omitted.
rep(0, 100000)
You might have noticed the [1]
before some of the output
we made. Why is that [1]
there? It means that the variable
is a vector of length 1. We can make longer vectors as
well. A vector is one or more elements of the same data type.
For example, here are two ways to create a numeric
vector with the sequence of integers from 1 to 100. This code also
demonstrates how to use named arguments to a function. The way with
seq()
is the “longhand” way, but you can also use a special
shorthand notation with :
to produce a sequence of
integers.
seq(from = 1, to = 100, by = 1)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
1:100
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
You can see that the output provides a few indexes along the left side.
Those are integer vectors. Numeric data can be integer or double (floating point numbers with a decimal point). There are other data types, including character or text data. We put character data in quotes (single or double).
For example, here are a few ways to make vectors of character values.
This code also demonstrates how to index vectors, using square brackets
[]
to extract one or more values from a vector. R has a
“built in” character vector called letters
which includes
all the lowercase letters. You can supply one or more integer values to
the index to extract individual letters from this vector and create a
new one.
This code also demonstrates the use of the c()
function.
c()
, or “combine”, takes one or more arguments separated
with commas (,
) and puts them together into a vector of the
same data type.
c('a', 'b', 'c', 'd', 'e', 'f', 'g')
## [1] "a" "b" "c" "d" "e" "f" "g"
letters[1:7]
## [1] "a" "b" "c" "d" "e" "f" "g"
letters[c(1, 18, 19)]
## [1] "a" "r" "s"
c('USDA', 'ARS', 'SEA')
## [1] "USDA" "ARS" "SEA"
If you pass the wrong data type to a function, you usually get an error.
log('hello')
## Error in log("hello"): non-numeric argument to mathematical function
If you try to combine numeric and character data in one vector, R will usually interpret the whole thing as character data. This is a common issue when loading data.
c(100, 5.323, 'missing value', 12)
## [1] "100" "5.323" "missing value" "12"
Factor is another important data type. It appears like a character but it can only contain predefined values. These values, called “levels,” can be sorted in any order. This is useful for model fitting if you have a categorical predictor or response variable. If you specify the order of the levels, the first one is usually treated as the control or reference level in models. Factors are a little bit confusing but they are important to be aware of.
Here is an example factor.
treatment <- factor(c('low', 'low', 'medium', 'medium', 'high', 'high'))
treatment
## [1] low low medium medium high high
## Levels: high low medium
Re-sort the levels so that they are sorted in a logical order instead of alphabetical.
treatment <- factor(treatment, levels = c('low', 'medium', 'high'))
treatment
## [1] low low medium medium high high
## Levels: low medium high
Logical is another important data type. It can take
two values, TRUE
and FALSE
. We usually get a
logical vector when we do a comparison. The following examples
illustrate how the logical operators work:
x == y
: is x
equal to y
?x != y
: is x
not equal to
y
?x > y
: is x
greater than
y
?x >= y
: is x
greater than or equal to
y
?x < y
: is x
less than
y
?x <= y
: is x
less than or equal to
y
?x > y & x < z
: is x
greater than
y
and less than z
?x > y | x < z
: is x
greater than
y
or less than z
?x <- 1:5
x > 4
## [1] FALSE FALSE FALSE FALSE TRUE
x <= 2
## [1] TRUE TRUE FALSE FALSE FALSE
x == 3
## [1] FALSE FALSE TRUE FALSE FALSE
x != 2
## [1] TRUE FALSE TRUE TRUE TRUE
x > 1 & x < 5
## [1] FALSE TRUE TRUE TRUE FALSE
x <= 1 | x >= 5
## [1] TRUE FALSE FALSE FALSE TRUE
Other important logical operators are !
and
%in%
.
!
is the negation operator. It converts all
TRUE
values to FALSE
and vice versa.
!(x == 3)
## [1] TRUE TRUE FALSE TRUE TRUE
%in%
is an operator that goes through the vector on the
left-hand side and returns TRUE
for the values that appear
anywhere in the vector on the right-hand side, and FALSE
otherwise.
c(1, 5, 6, 7) %in% x
## [1] TRUE TRUE FALSE FALSE
x %in% c(1, 5, 6, 7)
## [1] TRUE FALSE FALSE FALSE TRUE
The %in%
operator is often used to subset data frames.
We will see an example of this in the next lesson.
Some functions work on vectors and return one value for each element
in the vector. Here are some examples. The function exp()
takes a vector as input and returns a vector of the same length, the
exponential of each element in the vector. Here we create a vector of
1000 draws from a standard normal distribution with the
rnorm()
function.
set.seed(123)
random_numbers <- rnorm(n = 1000, mean = 0, sd = 1)
head(exp(random_numbers))
## [1] 0.5709374 0.7943926 4.7526783 1.0730536 1.1380175 5.5570366
PROTIP: We use
set.seed()
to ensure the code produces the same result each time we generate random numbers.
PROTIP 2: Using
head()
only prints the first few values of a vector, preventing us having to scroll through 1000 values.
Other functions with vector inputs return only one or a few values
regardless of the length of the vector input to them. The functions
length()
, mean()
, median()
, and
sd()
take a vector as input and return a single value. The
function range()
returns a vector of two values, the
minimum and maximum of the vector. The function quantile()
is a little bit more complicated. It takes two vectors as input. The
second vector, probs
, contains the probabilities we want to
calculate the quantiles for. The function returns a vector with the same
length as probs
containing the percentiles.
length(random_numbers)
## [1] 1000
mean(random_numbers)
## [1] 0.01612787
median(random_numbers)
## [1] 0.009209639
sd(random_numbers)
## [1] 0.991695
range(random_numbers)
## [1] -2.809775 3.241040
quantile(random_numbers, probs = c(0.025, 0.5, 0.975))
## 2.5% 50% 97.5%
## -1.941554327 0.009209639 2.037886785
In the code just above, we saw rnorm()
, which generates
random draws from a normal distribution. R has a lot of built-in
statistical distributions. All of them have four functions beginning
with r
, d
, p
, and q
,
and followed by the (abbreviated) name of the distribution.
The four functions are:
r
: random draws from the distributiond
: probability density function (what is the y-value of
the function given x?)p
: cumulative density function: (what is the cumulative
probability given x?)q
: quantile (what is the x-value given the cumulative
probability?); q
is the inverse of p
.For example, the functions for the normal distribution are
rnorm()
, dnorm()
, pnorm()
, and
qnorm()
. All of these default to the standard normal
distribution with mean = 0
and sd = 1
, but you
can change those parameters by modifying the mean
and
sd
arguments.
We’ve already seen rnorm()
above. These figures
illustrate what dnorm()
, pnorm()
, and
qnorm()
do.
## Warning: package 'ggplot2' was built under R version 4.2.3
Other distributions you might work with:
rbinom()
, dbinom()
,
pbinom()
, qbinom()
)runif()
, dunif()
,
punif()
, qunif()
)rt()
, dt()
,
pt()
, qt()
)Type ?Distributions
in your console to see help
documentation about all the built-in distributions. Many R packages have
implemented other less common distributions.
If you get an error or your code doesn’t work, here are some things to check.
(5+3))/2 # Nope
(5+3)/2 # Yep
## Error: <text>:1:6: unexpected ')'
## 1: (5+3))
## ^
my_variable <- 100000
myvariable
## Error in eval(expr, envir, enclos): object 'myvariable' not found
x<-log(500,base=2)
is the same as
x <- log(500, base = 2)
but the latter is much easier to
read especially when you have long and complicated expressions. But you
can’t put spaces in the middle of the name of a function or variable,
otherwise you will get an error.some_numbers <- 1:5
( some_numbers + 3 ) ^ 2
(some_numbers+3)^2
(some numbers + 3)^2
## Error: <text>:7:7: unexpected symbol
## 6:
## 7: (some numbers
## ^
sum(1:10)
## [1] 55
Sum(1:10)
## Error in Sum(1:10): could not find function "Sum"
So far, we haven’t had to load any packages. We have only used code from “base R.” But almost any R script requires loading one or more packages. Packages are sets of functions contributed by R users that are available for download on CRAN, the online R code repository.
You install a package for the first time either via the RStudio
dialog or with the function install.packages()
. This
only needs to be done once!
install.packages('cowsay')
PROTIP: You can specify the location of the library the package will install into. This means you can specify one that doesn’t require administrator level access.
When you want to use functions from a package, you load it from the
code library where packages are installed using the function
library()
. This needs to be done every time you load a
package!
library(cowsay)
## Warning: package 'cowsay' was built under R version 4.2.3
say('USDA statisticians are the best!', by = 'cow')
##
## -----
## USDA statisticians are the best!
## ------
## \ ^__^
## \ (oo)\ ________
## (__)\ )\ /\
## ||------w|
## || ||
If you want to use a function without calling library()
,
or there are functions of the same name in different packages and you
want to be explicit about which one you are calling, you can use the
package name followed by ::
like this:
cowsay::say('Always close your parentheses!', by = 'chicken')
##
## -----
## Always close your parentheses!
## ------
## \
## \
## _
## _/ }
## `>' \
## `| \
## | /'-. .-.
## \' ';`--' .'
## \'. `'-./
## '.`-..-;`
## `;-..'
## _| _|
## /` /` [nosig]
##
To access all the help documentation for a package, use
help(package = 'packagename')
.
Google is your friend. Especially try to copy and paste your error message.
StackOverflow is your friend too (and stats.stackexchange.com if you have a question about stats that isn’t specific to R programming). As a beginner, it is almost a certainty that you will find your question already answered on there. If you can’t, read their guidelines on how to create a great reproducible example in R and post your question there!
We can simply type individual lines of code and run them one by one in the console, but that is not a good practice when you are doing complex data wrangling and analysis that you may want to save and run again later. For that we use scripts, or text files containing code.
We can run individual lines or selected blocks of code from the
script editor by pressing Ctrl+Enter
or
Cmd+Enter
while the cursor is on the line or some code is
selected.
Those are really important things but we aren’t going to cover them in this lesson. Functions and lists are really important if you start to do more complex analyses in R where you have to write your own code and can no longer use pre-packaged “off the shelf” functions to do everything you need to do. If, else, and for maybe aren’t quite as crucial but still very important. I will not cover them today but I strongly encourage you to explore the R resources I’ve provided to learn more. And maybe I’ll discuss them in a future workshop.
Use R to find the arc-sine of the square root of 0.5. (Hint: you may need to use help documentation or Google to figure out what function or functions to use).
Use R to find the sum of all positive integers from 1 to 12345.
Use R to “flip a coin” with 0.5 probability of getting heads 1000 times. You can do this by generating 1 random variable from a binomial distribution with 1000 trials and probability 0.5. How many heads were flipped in the simulation if you set the random seed to 1 beforehand?
set.seed()
to set the random seed.rbinom()
. The arguments
n
, size
, and prob
are given in
the question above in the correct order! Type ?rbinom
into
the console if you need help.Install the dadjokeapi
package, load the package, and
use the package to produce a dad joke. You may need to use the help
documentation to figure out how to get R to tell you a dad joke.
install.packages
with the package name in quotes to install, library
to
load, and help(package = ...)
to get help.