At the end of this lesson, students will …

- Know what R is and what it can do.
- Use the R console to interactively issue R commands.
- Know the most common data types in R.
- Know how statistical distributions work in R.
- Know what R packages are and how to install and load them.

In this course, we’re going to learn about mixed models in R.

The course is divided up into lessons. For each lesson, I’m going to
give you a worksheet. The worksheet will have a bunch of code in it, but
parts of the code will be missing and replaced with a `...`

.
Your job will be to fill in the `...`

with the code. Don’t
worry, you will have access to the full code on the lesson webpage as
well as being able to see it when I present it on the screen. I strongly
recommend **TYPING IN THE CODE** yourself instead of
copy+pasting it. Research shows that this helps you learn it better.
Also, if you make a mistake and get an error or different output than
you expected, you might learn something about how the code works (or
doesn’t work).

Each lesson has some exercises at the end. They are an optional way to get more practice, but don’t feel any pressure to do them. We will have “office hours” with open Q and A time for two hours each afternoon, which would be a good time to go over the exercises.

This course is intended to teach some practical skills. It is
**NOT** intended to teach elementary principles of
statistics. There’s no way to do that in two days! We may gloss over
some technical details of the models. I’m happy to answer questions
about them later.

R and RStudio are software tools to help you work with and analyze your data.

R is a statistical programming language, created by two statisticians from New Zealand in 1993. It is free and open-source. Users contribute code in the form of packages that anyone can download from the central R package repository, CRAN (Comprehensive R Archive Network). Thanks to the broad and diverse base of users there are R packages for all kinds of applications: stats, data visualization, GIS, text analysis, machine learning, phylogenetics, the list goes on! In this workshop, we are going to learn a few basics of data manipulation and explore some of the simpler statistical models you can fit with R.

RStudio is a tool to help you write and run code in R.

There are four panes that you see when you open RStudio:

**Console**: Here you can enter R code line-by-line and run it interactively. The output of the code appears here.**Scripts**: This lets you type code into scripts (text files). You can run code from the script pane, or save it to run later.**Environment**: This shows what variables are currently defined in your R workspace, among other things. We won’t worry about the other tabs at the moment.**Files/Plots/Help**: This pane has a few different tabs that you need to know about. The Files tab lets you navigate through directories on your system, the Plots tab shows any plots or images generated by your R code, the Packages tab shows you what packages are installed and lets you install other ones, and the Help tab shows documentation for functions and packages.

The two main types of objects you need to know about in R (which is true for many other programming languages as well) are variables and functions.

**variable**: a structure that holds data. Examples:- a vector of integers
`c(1, 2, 3)`

- a character string
`"USDA"`

- a data frame with 1000 rows and 10 columns… all these things are variables

- a vector of integers
**function**: something that takes arguments as input, does something, and returns output. Examples:`log(10)`

: The function`log()`

takes a numeric value as input and returns a numeric value as output.`c(1, 5, 6)`

: The function`c()`

takes multiple values as input and returns a vector as output.`read.csv('myfile.csv')`

: The function`read.csv()`

takes a character string with a the file name of a CSV file as input, reads that CSV file, and returns a data frame as output.

Let’s start writing our first R code!

We are going to start by entering commands into the R console (the terminal where you can enter individual lines of code and run them).

Here are some simple things you can do with R.

- You can use R as a calculator using operators like
`+`

and`-`

.

`2 + 3`

`## [1] 5`

- The assignment operator
`<-`

is used to create a new variable and give it a value. The syntax is`variable <- <value>`

. - Variable names can contain
`.`

or`_`

but can’t contain spaces or start with a number. - You can also use
`=`

as an assignment operator but we will use`<-`

in this workshop. Consistent code is readable code!

```
x <- 2 + 3
y = 3.5
```

- Entering the name of a variable prints that variable’s value to the console.
- If you assign a value to a new variable, nothing will print to the console. But the variable is now defined in your environment and can be used later.

`x`

`## [1] 5`

`x + y`

`## [1] 8.5`

`x * 4`

`## [1] 20`

```
x <- x + 1
z <- x * 4
z
```

`## [1] 24`

- A function followed by an argument in parentheses
`()`

, like`function(<value>)`

, will input a value to a function and return some output.

`log(1000)`

`## [1] 6.907755`

`sin(pi)`

`## [1] 1.224606e-16`

- Any line preceded by
`#`

is a comment and will not be evaluated.

`# This is a comment`

- Functions can take multiple arguments separated by commas
`,`

. - When you put text inside quotes, you can use either
`'single quotes'`

or`"double quotes"`

. Which one you choose is up to your personal style preferences, but again remember to be consistent!

```
my_name <- "Quentin"
paste('Hello,', my_name)
```

`## [1] "Hello, Quentin"`

- Use
`?`

to get help about a function.

`?paste`

- If you don’t even know the name of the function you’re searching
for, use
`??`

to search all help documentation for a term.

`??sequence`

When you execute some R code, there may be some output. As we just
mentioned, it will print to the console unless you assign the output to
a variable. Some code may produce other output as a “side effect” other
than what is printed to the console. For instance, `plot()`

produces a plot image in a different window.

As an example here is some plotting code using a built-in R dataset
called `mtcars`

which will plot gas mileage `mpg`

as a function of horsepower `hp`

.

`plot(mpg ~ hp, data = mtcars)`

There are also three types of messages that code can produce in
addition to its output: *errors*, *warnings*, and
*notes*.

An error indicates that something is wrong with the code so that it cannot produce any output. For instance if we use an uneven number of parentheses:

`sin(pi))`

```
## Error: <text>:1:8: unexpected ')'
## 1: sin(pi))
## ^
```

Both warnings and notes mean that the code ran and produced output,
but they are R trying to tell you something potentially important. For
instance if we try to take the logarithm of a negative number, the
result will be returned as `NaN`

(not a number), and R will
issue a warning.

`log(-5)`

`## Warning in log(-5): NaNs produced`

`## [1] NaN`

A note is just that, a note. Everything is still fine! For example if we try to print an extremely long sequence of 0s to the console, it will only print a limited number and then give us a message of how many were omitted.

`rep(0, 100000)`

You might have noticed the `[1]`

before some of the output
we made. Why is that `[1]`

there? It means that the variable
is a **vector** of length 1. We can make longer vectors as
well. A vector is one or more elements of the same data type.

For example, here are two ways to create a **numeric**
vector with the sequence of integers from 1 to 100. This code also
demonstrates how to use named arguments to a function. The way with
`seq()`

is the “longhand” way, but you can also use a special
shorthand notation with `:`

to produce a sequence of
integers.

`seq(from = 1, to = 100, by = 1)`

```
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
```

`1:100`

```
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
```

You can see that the output provides a few indexes along the left side.

Those are integer vectors. Numeric data can be integer or double
(floating point numbers with a decimal point). There are other data
types, including **character** or text data. We put
character data in quotes (single or double).

For example, here are a few ways to make vectors of character values.
This code also demonstrates how to index vectors, using square brackets
`[]`

to extract one or more values from a vector. R has a
“built in” character vector called `letters`

which includes
all the lowercase letters. You can supply one or more integer values to
the index to extract individual letters from this vector and create a
new one.

This code also demonstrates the use of the `c()`

function.
`c()`

, or “combine”, takes one or more arguments separated
with commas (`,`

) and puts them together into a vector of the
same data type.

`c('a', 'b', 'c', 'd', 'e', 'f', 'g')`

`## [1] "a" "b" "c" "d" "e" "f" "g"`

`letters[1:7]`

`## [1] "a" "b" "c" "d" "e" "f" "g"`

`letters[c(1, 18, 19)]`

`## [1] "a" "r" "s"`

`c('USDA', 'ARS', 'SEA')`

`## [1] "USDA" "ARS" "SEA"`

If you pass the wrong data type to a function, you usually get an error.

`log('hello')`

`## Error in log("hello"): non-numeric argument to mathematical function`

If you try to combine numeric and character data in one vector, R will usually interpret the whole thing as character data. This is a common issue when loading data.

`c(100, 5.323, 'missing value', 12)`

`## [1] "100" "5.323" "missing value" "12"`

**Factor** is another important data type. It appears
like a character but it can only contain predefined values. These
values, called “levels,” can be sorted in any order. This is useful for
model fitting if you have a categorical predictor or response variable.
If you specify the order of the levels, the first one is usually treated
as the control or reference level in models. Factors are a little bit
confusing but they are important to be aware of.

Here is an example factor.

```
treatment <- factor(c('low', 'low', 'medium', 'medium', 'high', 'high'))
treatment
```

```
## [1] low low medium medium high high
## Levels: high low medium
```

Re-sort the levels so that they are sorted in a logical order instead of alphabetical.

```
treatment <- factor(treatment, levels = c('low', 'medium', 'high'))
treatment
```

```
## [1] low low medium medium high high
## Levels: low medium high
```

**Logical** is another important data type. It can take
two values, `TRUE`

and `FALSE`

. We usually get a
logical vector when we do a comparison. The following examples
illustrate how the logical operators work:

`x == y`

: is`x`

equal to`y`

?`x != y`

: is`x`

not equal to`y`

?`x > y`

: is`x`

greater than`y`

?`x >= y`

: is`x`

greater than or equal to`y`

?`x < y`

: is`x`

less than`y`

?`x <= y`

: is`x`

less than or equal to`y`

?`x > y & x < z`

: is`x`

greater than`y`

and less than`z`

?`x > y | x < z`

: is`x`

greater than`y`

or less than`z`

?

```
x <- 1:5
x > 4
```

`## [1] FALSE FALSE FALSE FALSE TRUE`

`x <= 2`

`## [1] TRUE TRUE FALSE FALSE FALSE`

`x == 3`

`## [1] FALSE FALSE TRUE FALSE FALSE`

`x != 2`

`## [1] TRUE FALSE TRUE TRUE TRUE`

`x > 1 & x < 5`

`## [1] FALSE TRUE TRUE TRUE FALSE`

`x <= 1 | x >= 5`

`## [1] TRUE FALSE FALSE FALSE TRUE`

Other important logical operators are `!`

and
`%in%`

.

`!`

is the negation operator. It converts all
`TRUE`

values to `FALSE`

and vice versa.

`!(x == 3)`

`## [1] TRUE TRUE FALSE TRUE TRUE`

`%in%`

is an operator that goes through the vector on the
left-hand side and returns `TRUE`

for the values that appear
anywhere in the vector on the right-hand side, and `FALSE`

otherwise.

`c(1, 5, 6, 7) %in% x`

`## [1] TRUE TRUE FALSE FALSE`

`x %in% c(1, 5, 6, 7)`

`## [1] TRUE FALSE FALSE FALSE TRUE`

The `%in%`

operator is often used to subset data frames.
We will see an example of this in the next lesson.

Some functions work on vectors and return one value for each element
in the vector. Here are some examples. The function `exp()`

takes a vector as input and returns a vector of the same length, the
exponential of each element in the vector. Here we create a vector of
1000 draws from a standard normal distribution with the
`rnorm()`

function.

```
set.seed(123)
random_numbers <- rnorm(n = 1000, mean = 0, sd = 1)
head(exp(random_numbers))
```

`## [1] 0.5709374 0.7943926 4.7526783 1.0730536 1.1380175 5.5570366`

PROTIP: We use

`set.seed()`

to ensure the code produces the same result each time we generate random numbers.

PROTIP 2: Using

`head()`

only prints the first few values of a vector, preventing us having to scroll through 1000 values.

Other functions with vector inputs return only one or a few values
regardless of the length of the vector input to them. The functions
`length()`

, `mean()`

, `median()`

, and
`sd()`

take a vector as input and return a single value. The
function `range()`

returns a vector of two values, the
minimum and maximum of the vector. The function `quantile()`

is a little bit more complicated. It takes two vectors as input. The
second vector, `probs`

, contains the probabilities we want to
calculate the quantiles for. The function returns a vector with the same
length as `probs`

containing the percentiles.

`length(random_numbers)`

`## [1] 1000`

`mean(random_numbers)`

`## [1] 0.01612787`

`median(random_numbers)`

`## [1] 0.009209639`

`sd(random_numbers)`

`## [1] 0.991695`

`range(random_numbers)`

`## [1] -2.809775 3.241040`

`quantile(random_numbers, probs = c(0.025, 0.5, 0.975))`

```
## 2.5% 50% 97.5%
## -1.941554327 0.009209639 2.037886785
```

In the code just above, we saw `rnorm()`

, which generates
random draws from a normal distribution. R has a lot of built-in
statistical distributions. All of them have four functions beginning
with `r`

, `d`

, `p`

, and `q`

,
and followed by the (abbreviated) name of the distribution.

The four functions are:

`r`

: random draws from the distribution`d`

: probability density function (what is the y-value of the function given x?)`p`

: cumulative density function: (what is the cumulative probability given x?)`q`

: quantile (what is the x-value given the cumulative probability?);`q`

is the inverse of`p`

.

For example, the functions for the normal distribution are
`rnorm()`

, `dnorm()`

, `pnorm()`

, and
`qnorm()`

. All of these default to the standard normal
distribution with `mean = 0`

and `sd = 1`

, but you
can change those parameters by modifying the `mean`

and
`sd`

arguments.

We’ve already seen `rnorm()`

above. These figures
illustrate what `dnorm()`

, `pnorm()`

, and
`qnorm()`

do.

`## Warning: package 'ggplot2' was built under R version 4.2.3`

Other distributions you might work with:

- Binomial (
`rbinom()`

,`dbinom()`

,`pbinom()`

,`qbinom()`

) - Uniform (
`runif()`

,`dunif()`

,`punif()`

,`qunif()`

) - Student’s
*t*(`rt()`

,`dt()`

,`pt()`

,`qt()`

) - The list goes on …

Type `?Distributions`

in your console to see help
documentation about all the built-in distributions. Many R packages have
implemented other less common distributions.

If you get an error or your code doesn’t work, here are some things to check.

*Punctuation*: close all parentheses, brackets, and quotation marks. (Notice parentheses can be used both for function arguments and to group expressions together.)

```
(5+3))/2 # Nope
(5+3)/2 # Yep
```

```
## Error: <text>:1:6: unexpected ')'
## 1: (5+3))
## ^
```

*Spelling*: are the functions and variables spelled correctly?

```
my_variable <- 100000
myvariable
```

`## Error in eval(expr, envir, enclos): object 'myvariable' not found`

*Spaces*: You can put spaces in between most things in R to make the code more readable. I recommend using spaces in this way. For example`x<-log(500,base=2)`

is the same as`x <- log(500, base = 2)`

but the latter is much easier to read especially when you have long and complicated expressions. But you can’t put spaces in the middle of the name of a function or variable, otherwise you will get an error.

```
some_numbers <- 1:5
( some_numbers + 3 ) ^ 2
(some_numbers+3)^2
(some numbers + 3)^2
```

```
## Error: <text>:7:7: unexpected symbol
## 6:
## 7: (some numbers
## ^
```

*Case*: R is CASE-SENSITIVE.

`sum(1:10)`

`## [1] 55`

`Sum(1:10)`

`## Error in Sum(1:10): could not find function "Sum"`

So far, we haven’t had to load any packages. We have only used code from “base R.” But almost any R script requires loading one or more packages. Packages are sets of functions contributed by R users that are available for download on CRAN, the online R code repository.

You install a package for the first time either via the RStudio
dialog or with the function `install.packages()`

. *This
only needs to be done once!*

`install.packages('cowsay')`

PROTIP: You can specify the location of the library the package will install into. This means you can specify one that doesn’t require administrator level access.

When you want to use functions from a package, you load it from the
code library where packages are installed using the function
`library()`

. *This needs to be done every time you load a
package!*

```
library(cowsay)
say('USDA statisticians are the best!', by = 'cow')
```

```
##
## -----
## USDA statisticians are the best!
## ------
## \ ^__^
## \ (oo)\ ________
## (__)\ )\ /\
## ||------w|
## || ||
```

If you want to use a function without calling `library()`

,
or there are functions of the same name in different packages and you
want to be explicit about which one you are calling, you can use the
package name followed by `::`

like this:

`cowsay::say('Always close your parentheses!', by = 'chicken')`

```
##
## -----
## Always close your parentheses!
## ------
## \
## \
## _
## _/ }
## `>' \
## `| \
## | /'-. .-.
## \' ';`--' .'
## \'. `'-./
## '.`-..-;`
## `;-..'
## _| _|
## /` /` [nosig]
##
```

To access all the help documentation for a package, use
`help(package = 'packagename')`

.

Google is your friend. Especially try to copy and paste your error message.

StackOverflow is your friend too (and stats.stackexchange.com if you have a question about stats that isn’t specific to R programming). As a beginner, it is almost a certainty that you will find your question already answered on there. If you can’t, read their guidelines on how to create a great reproducible example in R and post your question there!

We can simply type individual lines of code and run them one by one in the console, but that is not a good practice when you are doing complex data wrangling and analysis that you may want to save and run again later. For that we use scripts, or text files containing code.

We can run individual lines or selected blocks of code from the
script editor by pressing `Ctrl+Enter`

or
`Cmd+Enter`

while the cursor is on the line or some code is
selected.

- Functions
- Lists
- Flow control (if, else, for)

Those are really important things but we aren’t going to cover them in this lesson. Functions and lists are really important if you start to do more complex analyses in R where you have to write your own code and can no longer use pre-packaged “off the shelf” functions to do everything you need to do. If, else, and for maybe aren’t quite as crucial but still very important. I will not cover them today but I strongly encourage you to explore the R resources I’ve provided to learn more. And maybe I’ll discuss them in a future workshop.

Use R to find the arc-sine of the square root of 0.5. (Hint: you may need to use help documentation or Google to figure out what function or functions to use).

Use R to find the sum of all positive integers from 1 to 12345.

Use R to “flip a coin” with 0.5 probability of getting heads 1000 times. You can do this by generating 1 random variable from a binomial distribution with 1000 trials and probability 0.5. How many heads were flipped in the simulation if you set the random seed to 1 beforehand?

*Hint 1*: Use`set.seed()`

to set the random seed.*Hint 2*: There are a few ways to do this, but one way is to use`rbinom()`

. The arguments`n`

,`size`

, and`prob`

are given in the question above in the correct order! Type`?rbinom`

into the console if you need help.

Install the `dadjokeapi`

package, load the package, and
use the package to produce a dad joke. You may need to use the help
documentation to figure out how to get R to tell you a dad joke.

*Hint*:`install.packages`

with the package name in quotes to install,`library`

to load, and`help(package = ...)`

to get help.