Lesson 1 learning objectives

At the end of this lesson, students will …

Introduction to this course

In this course, we’re going to learn about mixed models in R.

The course is divided up into lessons. For each lesson, I’m going to give you a worksheet. The worksheet will have a bunch of code in it, but parts of the code will be missing and replaced with a .... Your job will be to fill in the ... with the code. Don’t worry, you will have access to the full code on the lesson webpage as well as being able to see it when I present it on the screen. I strongly recommend TYPING IN THE CODE yourself instead of copy+pasting it. Research shows that this helps you learn it better. Also, if you make a mistake and get an error or different output than you expected, you might learn something about how the code works (or doesn’t work).

Each lesson has some exercises at the end. They are an optional way to get more practice, but don’t feel any pressure to do them. We will have “office hours” with open Q and A time for two hours each afternoon, which would be a good time to go over the exercises.

A practical toolkit

This course is intended to teach some practical skills. It is NOT intended to teach elementary principles of statistics. There’s no way to do that in two days! We may gloss over some technical details of the models. I’m happy to answer questions about them later.

Introduction to R and RStudio

R and RStudio are software tools to help you work with and analyze your data.

What is R?

R is a statistical programming language, created by two statisticians from New Zealand in 1993. It is free and open-source. Users contribute code in the form of packages that anyone can download from the central R package repository, CRAN (Comprehensive R Archive Network). Thanks to the broad and diverse base of users there are R packages for all kinds of applications: stats, data visualization, GIS, text analysis, machine learning, phylogenetics, the list goes on! In this workshop, we are going to learn a few basics of data manipulation and explore some of the simpler statistical models you can fit with R.

What is RStudio?

RStudio is a tool to help you write and run code in R.

There are four panes that you see when you open RStudio:

  • Console: Here you can enter R code line-by-line and run it interactively. The output of the code appears here.
  • Scripts: This lets you type code into scripts (text files). You can run code from the script pane, or save it to run later.
  • Environment: This shows what variables are currently defined in your R workspace, among other things. We won’t worry about the other tabs at the moment.
  • Files/Plots/Help: This pane has a few different tabs that you need to know about. The Files tab lets you navigate through directories on your system, the Plots tab shows any plots or images generated by your R code, the Packages tab shows you what packages are installed and lets you install other ones, and the Help tab shows documentation for functions and packages.

The basic moving parts of R

The two main types of objects you need to know about in R (which is true for many other programming languages as well) are variables and functions.

  • variable: a structure that holds data. Examples:
    • a vector of integers c(1, 2, 3)
    • a character string "USDA"
    • a data frame with 1000 rows and 10 columns… all these things are variables
  • function: something that takes arguments as input, does something, and returns output. Examples:
    • log(10): The function log() takes a numeric value as input and returns a numeric value as output.
    • c(1, 5, 6): The function c() takes multiple values as input and returns a vector as output.
    • read.csv('myfile.csv'): The function read.csv() takes a character string with a the file name of a CSV file as input, reads that CSV file, and returns a data frame as output.

How to R

Let’s start writing our first R code!

Using the R console

We are going to start by entering commands into the R console (the terminal where you can enter individual lines of code and run them).

Here are some simple things you can do with R.

  • You can use R as a calculator using operators like + and -.
2 + 3
## [1] 5
  • The assignment operator <- is used to create a new variable and give it a value. The syntax is variable <- <value>.
  • Variable names can contain . or _ but can’t contain spaces or start with a number.
  • You can also use = as an assignment operator but we will use <- in this workshop. Consistent code is readable code!
x <- 2 + 3
y = 3.5
  • Entering the name of a variable prints that variable’s value to the console.
  • If you assign a value to a new variable, nothing will print to the console. But the variable is now defined in your environment and can be used later.
x
## [1] 5
x + y
## [1] 8.5
x * 4
## [1] 20
x <- x + 1
z <- x * 4
z
## [1] 24
  • A function followed by an argument in parentheses (), like function(<value>), will input a value to a function and return some output.
log(1000)
## [1] 6.907755
sin(pi)
## [1] 1.224606e-16
  • Any line preceded by # is a comment and will not be evaluated.
# This is a comment
  • Functions can take multiple arguments separated by commas ,.
  • When you put text inside quotes, you can use either 'single quotes' or "double quotes". Which one you choose is up to your personal style preferences, but again remember to be consistent!
my_name <- "Quentin"

paste('Hello,', my_name)
## [1] "Hello, Quentin"
  • Use ? to get help about a function.
?paste
  • If you don’t even know the name of the function you’re searching for, use ?? to search all help documentation for a term.
??sequence

Types of output

When you execute some R code, there may be some output. As we just mentioned, it will print to the console unless you assign the output to a variable. Some code may produce other output as a “side effect” other than what is printed to the console. For instance, plot() produces a plot image in a different window.

As an example here is some plotting code using a built-in R dataset called mtcars which will plot gas mileage mpg as a function of horsepower hp.

plot(mpg ~ hp, data = mtcars)

example scatterplot of mileage versus horsepower

There are also three types of messages that code can produce in addition to its output: errors, warnings, and notes.

Errors

An error indicates that something is wrong with the code so that it cannot produce any output. For instance if we use an uneven number of parentheses:

sin(pi))
## Error: <text>:1:8: unexpected ')'
## 1: sin(pi))
##            ^

Warnings

Both warnings and notes mean that the code ran and produced output, but they are R trying to tell you something potentially important. For instance if we try to take the logarithm of a negative number, the result will be returned as NaN (not a number), and R will issue a warning.

log(-5)
## Warning in log(-5): NaNs produced
## [1] NaN

Notes

A note is just that, a note. Everything is still fine! For example if we try to print an extremely long sequence of 0s to the console, it will only print a limited number and then give us a message of how many were omitted.

rep(0, 100000)

Data types in R

You might have noticed the [1] before some of the output we made. Why is that [1] there? It means that the variable is a vector of length 1. We can make longer vectors as well. A vector is one or more elements of the same data type.

For example, here are two ways to create a numeric vector with the sequence of integers from 1 to 100. This code also demonstrates how to use named arguments to a function. The way with seq() is the “longhand” way, but you can also use a special shorthand notation with : to produce a sequence of integers.

seq(from = 1, to = 100, by = 1)
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100
1:100
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100

You can see that the output provides a few indexes along the left side.

Those are integer vectors. Numeric data can be integer or double (floating point numbers with a decimal point). There are other data types, including character or text data. We put character data in quotes (single or double).

For example, here are a few ways to make vectors of character values. This code also demonstrates how to index vectors, using square brackets [] to extract one or more values from a vector. R has a “built in” character vector called letters which includes all the lowercase letters. You can supply one or more integer values to the index to extract individual letters from this vector and create a new one.

This code also demonstrates the use of the c() function. c(), or “combine”, takes one or more arguments separated with commas (,) and puts them together into a vector of the same data type.

c('a', 'b', 'c', 'd', 'e', 'f', 'g')
## [1] "a" "b" "c" "d" "e" "f" "g"
letters[1:7]
## [1] "a" "b" "c" "d" "e" "f" "g"
letters[c(1, 18, 19)]
## [1] "a" "r" "s"
c('USDA', 'ARS', 'SEA')
## [1] "USDA" "ARS"  "SEA"

If you pass the wrong data type to a function, you usually get an error.

log('hello')
## Error in log("hello"): non-numeric argument to mathematical function

If you try to combine numeric and character data in one vector, R will usually interpret the whole thing as character data. This is a common issue when loading data.

c(100, 5.323, 'missing value', 12)
## [1] "100"           "5.323"         "missing value" "12"

Factor is another important data type. It appears like a character but it can only contain predefined values. These values, called “levels,” can be sorted in any order. This is useful for model fitting if you have a categorical predictor or response variable. If you specify the order of the levels, the first one is usually treated as the control or reference level in models. Factors are a little bit confusing but they are important to be aware of.

Here is an example factor.

treatment <- factor(c('low', 'low', 'medium', 'medium', 'high', 'high'))

treatment
## [1] low    low    medium medium high   high  
## Levels: high low medium

Re-sort the levels so that they are sorted in a logical order instead of alphabetical.

treatment <- factor(treatment, levels = c('low', 'medium', 'high'))

treatment
## [1] low    low    medium medium high   high  
## Levels: low medium high

Logical is another important data type. It can take two values, TRUE and FALSE. We usually get a logical vector when we do a comparison. The following examples illustrate how the logical operators work:

x <- 1:5

x > 4
## [1] FALSE FALSE FALSE FALSE  TRUE
x <= 2
## [1]  TRUE  TRUE FALSE FALSE FALSE
x == 3
## [1] FALSE FALSE  TRUE FALSE FALSE
x != 2
## [1]  TRUE FALSE  TRUE  TRUE  TRUE
x > 1 & x < 5
## [1] FALSE  TRUE  TRUE  TRUE FALSE
x <= 1 | x >= 5
## [1]  TRUE FALSE FALSE FALSE  TRUE

Other important logical operators are ! and %in%.

! is the negation operator. It converts all TRUE values to FALSE and vice versa.

!(x == 3)
## [1]  TRUE  TRUE FALSE  TRUE  TRUE

%in% is an operator that goes through the vector on the left-hand side and returns TRUE for the values that appear anywhere in the vector on the right-hand side, and FALSE otherwise.

c(1, 5, 6, 7) %in% x
## [1]  TRUE  TRUE FALSE FALSE
x %in% c(1, 5, 6, 7)
## [1]  TRUE FALSE FALSE FALSE  TRUE

The %in% operator is often used to subset data frames. We will see an example of this in the next lesson.

Some functions work on vectors and return one value for each element in the vector. Here are some examples. The function exp() takes a vector as input and returns a vector of the same length, the exponential of each element in the vector. Here we create a vector of 1000 draws from a standard normal distribution with the rnorm() function.

set.seed(123)

random_numbers <- rnorm(n = 1000, mean = 0, sd = 1)

head(exp(random_numbers))
## [1] 0.5709374 0.7943926 4.7526783 1.0730536 1.1380175 5.5570366

PROTIP: We use set.seed() to ensure the code produces the same result each time we generate random numbers.

PROTIP 2: Using head() only prints the first few values of a vector, preventing us having to scroll through 1000 values.

Other functions with vector inputs return only one or a few values regardless of the length of the vector input to them. The functions length(), mean(), median(), and sd() take a vector as input and return a single value. The function range() returns a vector of two values, the minimum and maximum of the vector. The function quantile() is a little bit more complicated. It takes two vectors as input. The second vector, probs, contains the probabilities we want to calculate the quantiles for. The function returns a vector with the same length as probs containing the percentiles.

length(random_numbers)
## [1] 1000
mean(random_numbers)
## [1] 0.01612787
median(random_numbers)
## [1] 0.009209639
sd(random_numbers)
## [1] 0.991695
range(random_numbers)
## [1] -2.809775  3.241040
quantile(random_numbers, probs = c(0.025, 0.5, 0.975))
##         2.5%          50%        97.5% 
## -1.941554327  0.009209639  2.037886785

Statistical distributions

In the code just above, we saw rnorm(), which generates random draws from a normal distribution. R has a lot of built-in statistical distributions. All of them have four functions beginning with r, d, p, and q, and followed by the (abbreviated) name of the distribution.

The four functions are:

For example, the functions for the normal distribution are rnorm(), dnorm(), pnorm(), and qnorm(). All of these default to the standard normal distribution with mean = 0 and sd = 1, but you can change those parameters by modifying the mean and sd arguments.

We’ve already seen rnorm() above. These figures illustrate what dnorm(), pnorm(), and qnorm() do.

## Warning: package 'ggplot2' was built under R version 4.2.3

normal distribution density plot showing value of dnorm(0)normal distribution density plot showing relationship between pnorm and qnorm

Other distributions you might work with:

Type ?Distributions in your console to see help documentation about all the built-in distributions. Many R packages have implemented other less common distributions.

Common pitfalls

If you get an error or your code doesn’t work, here are some things to check.

(5+3))/2 # Nope

(5+3)/2 # Yep
## Error: <text>:1:6: unexpected ')'
## 1: (5+3))
##          ^
my_variable <- 100000

myvariable
## Error in eval(expr, envir, enclos): object 'myvariable' not found
some_numbers <- 1:5

( some_numbers + 3 ) ^ 2

(some_numbers+3)^2

(some numbers + 3)^2
## Error: <text>:7:7: unexpected symbol
## 6: 
## 7: (some numbers
##          ^
sum(1:10)
## [1] 55
Sum(1:10)
## Error in Sum(1:10): could not find function "Sum"

R packages

So far, we haven’t had to load any packages. We have only used code from “base R.” But almost any R script requires loading one or more packages. Packages are sets of functions contributed by R users that are available for download on CRAN, the online R code repository.

You install a package for the first time either via the RStudio dialog or with the function install.packages(). This only needs to be done once!

install.packages('cowsay')

PROTIP: You can specify the location of the library the package will install into. This means you can specify one that doesn’t require administrator level access.

When you want to use functions from a package, you load it from the code library where packages are installed using the function library(). This needs to be done every time you load a package!

library(cowsay)
## Warning: package 'cowsay' was built under R version 4.2.3
say('USDA statisticians are the best!', by = 'cow')
## 
##  ----- 
## USDA statisticians are the best! 
##  ------ 
##     \   ^__^ 
##      \  (oo)\ ________ 
##         (__)\         )\ /\ 
##              ||------w|
##              ||      ||

If you want to use a function without calling library(), or there are functions of the same name in different packages and you want to be explicit about which one you are calling, you can use the package name followed by :: like this:

cowsay::say('Always close your parentheses!', by = 'chicken')
## 
##  ----- 
## Always close your parentheses! 
##  ------ 
##     \   
##      \
##          _
##        _/ }
##       `>' \
##       `|   \
##        |   /'-.     .-.
##         \'     ';`--' .'
##          \'.    `'-./
##           '.`-..-;`
##             `;-..'
##             _| _|
##             /` /` [nosig]
## 

To access all the help documentation for a package, use help(package = 'packagename').

Learning R best practices

How do I get help?

Google is your friend. Especially try to copy and paste your error message.

StackOverflow is your friend too (and stats.stackexchange.com if you have a question about stats that isn’t specific to R programming). As a beginner, it is almost a certainty that you will find your question already answered on there. If you can’t, read their guidelines on how to create a great reproducible example in R and post your question there!

Console versus script editor

We can simply type individual lines of code and run them one by one in the console, but that is not a good practice when you are doing complex data wrangling and analysis that you may want to save and run again later. For that we use scripts, or text files containing code.

We can run individual lines or selected blocks of code from the script editor by pressing Ctrl+Enter or Cmd+Enter while the cursor is on the line or some code is selected.

Hey! What about … ?

Those are really important things but we aren’t going to cover them in this lesson. Functions and lists are really important if you start to do more complex analyses in R where you have to write your own code and can no longer use pre-packaged “off the shelf” functions to do everything you need to do. If, else, and for maybe aren’t quite as crucial but still very important. I will not cover them today but I strongly encourage you to explore the R resources I’ve provided to learn more. And maybe I’ll discuss them in a future workshop.

Exercises

Exercise 1

Use R to find the arc-sine of the square root of 0.5. (Hint: you may need to use help documentation or Google to figure out what function or functions to use).

Exercise 2

Use R to find the sum of all positive integers from 1 to 12345.

Exercise 3

Use R to “flip a coin” with 0.5 probability of getting heads 1000 times. You can do this by generating 1 random variable from a binomial distribution with 1000 trials and probability 0.5. How many heads were flipped in the simulation if you set the random seed to 1 beforehand?

  • Hint 1: Use set.seed() to set the random seed.
  • Hint 2: There are a few ways to do this, but one way is to use rbinom(). The arguments n, size, and prob are given in the question above in the correct order! Type ?rbinom into the console if you need help.

Exercise 4

Install the dadjokeapi package, load the package, and use the package to produce a dad joke. You may need to use the help documentation to figure out how to get R to tell you a dad joke.

  • Hint: install.packages with the package name in quotes to install, library to load, and help(package = ...) to get help.

Click here for answers