Lesson 1: R basics

Course introduction: mixed models in R

  • Follow along with lesson slides and lesson text
  • Each lesson has a worksheet
  • Fill in the ... in the worksheet with code
  • Typing in the code yourself is better than copy+paste!
  • Optional exercises at the end
  • This is a practical skills course, not “Principles of Statistics 101!”

Day 1 Schedule

Time Activity
9:00-9:15 AM Introductions, troubleshooting
9:15-10:15 AM Lesson 1: R Boot Camp: the very basics
10:15-10:45 AM break
10:45-11:30 AM Lesson 2: R Boot Camp: working with data frames
11:30-11:45 AM break
11:45 AM-12:45 PM Lesson 3: From linear model to linear mixed model
12:45-2:00 PM lunch break
2:00 PM-4:00 PM office hours

(Day 2 schedule has the same format)

Lesson 1 learning objectives

At the end of this lesson, students will …

  • Know what R is and what it can do.
  • Use the R console to interactively issue R commands.
  • Know the most common data types in R.
  • Know how statistical distributions work in R.
  • Know what R packages are and how to install and load them.

Introduction to R and RStudio

R and RStudio are software tools to help you work with and analyze your data.

What is R?

  • A statistical programming language
  • Users contribute packages
  • Free and open-source

What is RStudio?

  • A tool to help you write and run code in R
  • RStudio is not R, it is an interface for R (you need to also have R installed to run RStudio)
  • We will access RStudio through Posit Cloud for this course
  • Or you can run RStudio locally if you prefer

RStudio panes

  • Console: Enter individual lines of code, see output
  • Scripts: Edit and run scripts (text files containing code)
  • Environment: Shows variables that you have created
  • Files/Plots/Help: Includes several tabs
    • Files: navigate your filesystem
    • Plots: display images generated by your R code
    • Packages: view and install R packages
    • Help: documentation for functions and packages

The basic moving parts of R

  • variable: a structure that holds data. Examples:
    • a vector of integers c(1, 2, 3)
    • a character string "USDA"
    • a data frame with 1000 rows and 10 columns

The basic moving parts of R

  • function: something that takes arguments as input, does something, and returns output.
    • log(10): takes a numeric value as input and returns a numeric value as output
    • c(1, 5, 6): The function c() takes multiple values as input and returns a vector as output.
    • read.csv('myfile.csv'): takes a character string as input and returns a data frame as output.

How to R

  • Let’s start writing our first R code!
  • Enter the example code in the console

Using R as a calculator

  • Use operators: +, -, *, /, ^ to use R as a calculator
2 + 3

The assignment operator

  • The assignment operator <- is used to create a new variable and give it a value. The syntax is variable <- <value>.
  • Variable names can contain . or _ but can’t contain spaces or start with a number.
  • You can also use = as an assignment operator but we will use <- in this workshop. Consistent code is readable code!
x <- 2 + 3
y = 3.5
  • Entering the name of a variable prints that variable’s value to the console.
  • If you assign a value to a new variable, nothing will print to the console. But the variable is now defined in your environment and can be used later.
x
x + y
x * 4

x <- x + 1
z <- x * 4
z

Comments

  • Any line preceded by # is a comment and will not be evaluated.
# This is a comment

Functions with arguments

  • A function followed by an argument in parentheses (), like function(<value>), will input a value to a function and return some output
log(1000)

sin(pi)
  • Functions can take multiple arguments separated by commas ,
  • You can use either 'single quotes' or "double quotes"
my_name <- "Quentin"

paste('Hello,', my_name)

Getting help

  • Use ? to get help about a function
?paste
  • Use ?? to search all help documentation for a term
??sequence

Types of output

  • Usually output prints to the console unless assigned to a variable
  • Some code produces other output as a “side effect,” such as a plot
plot(mpg ~ hp, data = mtcars)

Errors, warnings and notes

Code can produce messages instead of or in addition to output:

  • Errors
  • Warnings
  • Notes

Errors

  • Indicates something went wrong
  • No output is produced
sin(pi))

Warnings

  • Indicates the result may not be what you expected
  • Code still runs and produces output
log(-5)

Notes

  • Just a note. Everything is still fine!
rep(0, 100000)

Data types in R

  • The [1] in the output from earlier indicates it is a vector of length 1
  • Vectors are sequences of one or more elements of the same data type
    • numeric
    • character
    • factor
    • logical

Numeric

  • Here are two ways to make a numeric vector with a sequence of integers 1 to 100
  • The first way uses a function seq() with three named arguments
  • Separate arguments with ,
  • The notation with : is shorthand
seq(from = 1, to = 100, by = 1)

1:100

Character

  • Text values
  • Use single quotes ' or double quotes " to create character vectors
  • We can index vectors with brackets [] containing one or more integer values
c('a', 'b', 'c', 'd', 'e', 'f', 'g')

letters[1:7]

letters[c(1, 18, 19)]

c('USDA', 'ARS', 'SEA')

Issues with numeric and character data types

  • Wrong data type often results in an error
log('hello')
  • Combination of numeric and character is forced to character
  • This is a common problem when reading data from a spreadsheet
c(100, 5.323, 'missing value', 12)

Factor

  • Looks like character but can only contain predefined values (levels)
  • Levels are sorted in a specific order
  • Used for categorical variables in models
  • The first level is usually considered the control or intercept in models
treatment <- factor(c('low', 'low', 'medium', 'medium', 'high', 'high'))

treatment

Sorting factor levels

  • Default order is alphabetical
  • We can sort the levels in a logical order instead of alphabetical
treatment <- factor(treatment, levels = c('low', 'medium', 'high'))

treatment

Logical

  • Can take two values, TRUE and FALSE
  • The result of a comparison is a logical vector
  • Logical operators in R:
    • x == y: is x equal to y?
    • x != y: is x not equal to y?
    • x > y: is x greater than y?
    • x >= y: is x greater than or equal to y?
    • x < y: is x less than y?
    • x <= y: is x less than or equal to y?
    • x > y & x < z: is x greater than y and less than z?
    • x > y | x < z: is x greater than y or less than z?

Examples of comparisons with logical operators

x <- 1:5

x > 4

x <= 2

x == 3

x != 2

x > 1 & x < 5

x <= 1 | x >= 5

The ! operator

  • ! is the negation operator
  • Converts all TRUE values to FALSE and vice versa.
!(x == 3)

The %in% operator

  • %in% is an operator comparing two vectors
  • Goes through the vector on the left-hand side and returns TRUE for the values that appear anywhere in the vector on the right-hand side, and FALSE otherwise
c(1, 5, 6, 7) %in% x

x %in% c(1, 5, 6, 7)

Functions that take vectors as input

  • Some functions take a vector as input and return a vector of the same length.
    • exp(): the exponential of each element in the vector
set.seed(123)

random_numbers <- rnorm(n = 1000, mean = 0, sd = 1)

head(exp(random_numbers))

PROTIP: set.seed() ensures the code produces the same result each time, and head() means only print the first few values of a result

  • Other functions take a vector as input and return only one or a few values
  • length(), mean(), median(), and sd() return a single value.
length(random_numbers)
mean(random_numbers)
median(random_numbers)
sd(random_numbers)
  • range() returns a vector of two values, the minimum and maximum of the vector
  • quantile() takes two vectors as input.
    • First argument is the vector we want the quantiles from
    • The second vector, probs, contains the probabilities we want to calculate the quantiles for
    • The function returns a vector with the same length as probs containing the percentiles
range(random_numbers)
quantile(random_numbers, probs = c(0.025, 0.5, 0.975))

Statistical distributions

  • R has a lot of built-in statistical distributions
  • All of them have four functions beginning with r, d, p, and q and followed by the (abbreviated) name of the distribution.
    • r: random draws from the distribution
    • d: probability density function (what is the y-value of the function given x?)
    • p: cumulative density function: (what is the cumulative probability given x?)
    • q: quantile (what is the x-value given the cumulative probability?); q is the inverse of p.
  • For example, the functions for the normal distribution are rnorm(), dnorm(), pnorm(), and qnorm()
  • Default to the standard normal distribution with mean = 0 and sd = 1
  • You can change those parameters by modifying the mean and sd arguments

What does dnorm do?

normal distribution density plot showing value of dnorm(0)

What does qnorm do?

normal distribution density plot showing relationship between pnorm and qnorm

Other distributions you might work with

  • Binomial (rbinom(), dbinom(), pbinom(), qbinom())
  • Uniform (runif(), dunif(), punif(), qunif())
  • Student’s t (rt(), dt(), pt(), qt())
  • The list goes on …

Type ?Distributions in your console to see help documentation about all the built-in distributions.

Common pitfalls

If you get an error or your code doesn’t work, here are some things to check.

  • Punctuation: close all parentheses, brackets, and quotation marks.
(5+3))/2 # Nope

(5+3)/2 # Yep
  • Spelling: are the functions and variables spelled correctly?
my_variable <- 100000

myvariable
  • Spaces
    • Spaces are good for making code more readable
    • Compare x<-log(500,base=2) and x <- log(500, base = 2)
    • But you can’t put spaces in the middle of the name of a function or variable
some_numbers <- 1:5

( some_numbers + 3 ) ^ 2

(some_numbers+3)^2

(some numbers + 3)^2
  • Case: R is CASE-SENSITIVE (unlike SAS)
sum(1:10)
Sum(1:10)

R packages

  • So far we have only used code from “base R.”
  • But almost any R script requires one or more packages
  • Packages are sets of functions contributed by R users that are available for download on CRAN

Installing a package

  • Install a package for the first time either via the RStudio dialog or with the function install.packages()
  • This only needs to be done once!
install.packages('cowsay')

PROTIP: You can specify the location of the library the package will install into. This means you can specify one that doesn’t require administrator level access.

Loading and using a package

  • Load a package from the code library where packages are installed using the function library()
  • This needs to be done every time you load a package!
library(cowsay)
say('USDA statisticians are the best!', by = 'cow')

 ----- 
USDA statisticians are the best! 
 ------ 
    \   ^__^ 
     \  (oo)\ ________ 
        (__)\         )\ /\ 
             ||------w|
             ||      ||
  • You can also use the package name followed by :: to be explicit
cowsay::say("Always close your parentheses!", by = 'chicken')

 ----- 
Always close your parentheses! 
 ------ 
    \   
     \
         _
       _/ }
      `>' \
      `|   \
       |   /'-.     .-.
        \'     ';`--' .'
         \'.    `'-./
          '.`-..-;`
            `;-..'
            _| _|
            /` /` [nosig]
  
  • To access all the help documentation for a package, use help(package = 'packagename').

Learning R best practices

How do I get help?

  • Google is your friend (copy and paste your error message)
  • StackOverflow is your friend too
  • stats.stackexchange.com if you have a question about stats that isn’t specific to R programming

Console versus script editor

  • Typing and running individual lines of code is great for exploring
  • It is not as good when you are doing complex data wrangling and analysis
  • You can save scripts (text files of code) to run again later
  • Run individual lines or selected blocks of code from the script editor by pressing Ctrl+Enter (Win) or Cmd+Enter (Mac)

Hey! What about … ?

  • Functions
  • Lists
  • Flow control (if, else, for)

Those are really important things but we aren’t going to cover them in this lesson. I strongly encourage you to explore the R resources I’ve provided to learn more. And maybe I’ll discuss them in a future workshop.

Exercises

Go to the lesson page and try out the exercises!