Introduction

What is this workshop?

This workshop is intended for SAS users who want to learn R. The people who will get the most out of this course are practicing researchers who have a decent working knowledge of SAS, and of basic statistical analysis (descriptive stats and regression models) as it applies to their field.

This is lesson 1 in a series. Lessons 2 and 3 are also available on the SEAStats page.

Download the worksheet for this lesson here.

A word of warning: I have much more experience working with R compared to SAS. So if you notice any issues or glaring flaws in the SAS code, chalk that up to my poor SAS skills! The SAS code is provided here mainly for comparison purposes, so that you will be able to better understand what the R code is trying to do by comparing it to a familiar bit of SAS code.

What will you learn from this workshop?

Conceptual learning objectives

During this workshop, you will …

  • Learn the similarities and differences between SAS and R, both different tools for the job
  • Get introduced to what R packages are, in particular the “tidyverse”
  • Learn which common SAS statistical procedures correspond to which R packages/functions

Practical skills

In this workshop, participants will go through a “data to doc” pipeline in R, comparing R code to SAS code each step of the way. As you go through the pipeline, you will …

  • Import the data from a CSV file
  • Clean and reshape the data
  • Calculate some summary statistics and make some exploratory plots
  • Fit a linear model and a linear mixed-effects model
  • Make plots and tables of results

How to follow along with this workshop

  • Slides and text version of lessons are online
  • Fill in R code in the worksheet (replace ... with code)
  • You can always copy and paste code from text version of lesson if you fall behind
  • Notes on best practices will be marked with PROTIP as we go along!

Background

R versus SAS

Before we get into the code, let’s talk a little bit about R and SAS. SAS has been around quite a bit longer than R. It was developed in the late 1960s as a “statistical analysis system” for agricultural researchers at North Carolina State University, and got spun off as an independent business in 1976. In contrast, R was first released in 1993, when it was created by statisticians Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand. All that is to say that SAS has been in use longer, especially in agricultural research in government and academia in the United States. So, many ARS long-timers cut their teeth on SAS.

photo of SAS Hall at NC State University
SAS Hall at NC State University. There is no R Hall … yet!

While SAS is a powerful and useful tool, R has a lot of comparative advantages. For one thing, it is free and open-source. This is a huge advantage because any user can contribute a “package” to R’s online repository. A package is a collection of functions that go together and are used for a common purpose. For example, there are packages for making graphics, fitting specialized statistical models, calculating effect sizes, and that kind of thing. R has a huge diversity of packages that are about more than just stats. There are packages for doing GIS and geospatial data science, working with genomic data, running code in parallel on high-performance computing systems, and making websites. Different research communities have all worked together to contribute to R’s code base: there are R user communities in linguistics, ecology, social science, economics, pharmacology, and many others.

In contrast to R, SAS is more top-down and centralized. Yes, people do write SAS programs and macros that others can find and use, but the community of people doing that is much smaller and less active. So R tends to develop more quickly and be closer to the cutting edge of new methods. Usually this is an advantage but there are some downsides. For instance, SAS has basically one way to fit any given type of regression model. You can be confident that it was verified by qualified statisticians. But with R, there may be dozens of packages that fit that kind of model, all with slightly different implementations. It is hard for beginners to figure out which one is best to use. However, the longer R has been around, the more the good packages rise to the top.

Image of the “hex sticker” logos of many R packages
Just a few of the R packages out there

Currently, I think it is more important than ever to transition away from SAS and towards R. There are indications that SAS is winding down support for its desktop software and is moving toward a cloud-only model. This is great for their corporate customers but I don’t think it is as good for researchers in government and academia. In my opinion, R is a more reliable option for the future. Another alternative is Python, which is probably more common in industry versus government and academia, and has a lot of users in the data science community rather than the stats community.

The bottom line is that they are both tools that can be used to do a job. Neither is perfect and they both have advantages and disadvantages. Some people prefer one to the other, and some people have more experience with one than the other. But I strongly recommend that new ARS researchers just starting their “stats and data science journey” begin with R. This is thanks to its superiority for open and reproducible science, the fact that it is free and thus you will be able to use it even if you move to an institution that doesn’t have a SAS subscription, its greater diversity of tools and bag of tricks, and because it just does some stuff flat out better.

All of that said, the point of this workshop is not to convert people from SAS to R. It is simply to give SAS users some exposure to R functionality and demonstrate how you can translate familiar statistical procedures in SAS into R.

SAS procedures vs. R packages/functions: an overview table

In many cases, a package in R is roughly equivalent to a procedure or proc in SAS. Here’s a quick reference table I came up with that shows which SAS procedure corresponds to which package or function in R. There are a diversity of R packages and functions for each of these tasks but I am showing you my “favorite” or recommended alternatives.

task SAS R
import data proc import, data step read.csv (base R), read_csv (readr package in tidyverse), read_xlsx (readxl package)
clean, manipulate, and sort data proc transpose, data step dplyr and tidyr packages in tidyverse
make graphs proc sgplot, etc. plot (base R), ggplot (ggplot2 package in tidyverse)
calculate summary statistics proc means, proc freq table (base R), group_by and summarize (dplyr)
fit statistical models (frequentist) proc reg, proc glm, proc mixed, proc glimmix, etc. lm, glm (base R), lme4 package, nlme package, etc.
fit statistical models (Bayesian) proc bglimm brms
look at model diagnostics diagnostics produced within model fitting procedures some diagnostics produced by model fitting packages, also see easystats package
extract predictions from models lsmeans statement, also see proc plm emmeans, marginaleffects

Okay, that’s enough lecturing. Let’s actually do some coding.

Data to doc pipeline

A typical “pipeline” to get from raw data at one end, to a finished document (report or presentation) at the other, includes the following steps, whether you are working in SAS or R.

  • Import the data from a file
  • Clean, manipulate, and sort the data
  • Make exploratory graphs and tables
  • Fit statistical models
  • Look at model diagnostics to make sure our models are valid
  • Extract predictions from the statistical models
  • Make graphs and tables of model predictions
  • Put results into a document

We will go through much of this pipeline today. In each step, first we’ll look at some SAS code that does a particular thing (without going into detailed explanation of how the SAS code works), then we’ll co-create some R code that does that same thing while explaining in detail how each bit of the R code works.

We will use example datasets for this lesson from the R package agridat, which is a curated collection of datasets from agricultural sciences. (You can learn a lot from it whether or not you are an “R person!”)

Load R packages

Here we’ll load the R packages we are going to work with today. This includes the tidyverse package for reading, manipulating, and plotting data, the lme4 package for fitting linear mixed models, the lmerTest package for doing an ANOVA on the mixed model, and the easystats package which has some good model diagnostic plots.

The “tidyverse” is actually a set of related packages used for common data science tasks. It is pretty easy for beginners to learn so I usually teach introductory R lessons using it. But it’s not the only way to do that kind of thing. In fact, all of the data wrangling part of this lesson could be done with base R functions that don’t require a package.

PROTIP: It’s good practice to load all necessary packages for a script at the beginning. Among other benefits, this ensures that if anyone who’s running the script needs to install one or more packages to run your script, they can do that all at once instead of having to interrupt the workflow later to install packages.

library(tidyverse)
library(lme4)
library(lmerTest)
library(easystats)

Import the data from a file

If you are running the code for this lesson on the cloud server, the data file is located in the data directory. If you are running the code on your own machine, you can import the dataset directly from the URL where it is hosted on GitHub.

PROTIP: Notice the data are in CSV (comma-separated values) format. Although both SAS and R have the capability to read data directly from Microsoft Excel files (.xlsx), I recommend using CSV wherever possible. This is a good idea for a couple reasons: first, it is a plain-text format that works on any operating system, and is not a proprietary Microsoft format. Also, it does not allow formatting so you can’t be tempted to encode important information in formatting. For example, highlighting cells to flag them – a better way would be to add an additional column with a text label for the flag that can be read in to SAS or R more easily.

SAS

Some people like to import data in the data step in SAS, or use datalines;, but I recommend using proc import.

filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/NASS_soybean.csv";
proc import datafile = csvFile out = nass_soybeans replace dbms = csv; guessingrows = 2000; run;

R

We are using the read_csv() function to read in the data.

This shows how a variable is created in R. The syntax is variable <- value meaning take the value and give it a name, variable, which creates an object that you can use later. It also shows how you call a function in R. The syntax is function(argument). Here the argument is a character string with the URL of the file, and we pass that URL as input to the read_csv() function, resulting in a “data frame” which we call nass_soybeans.

Import the data from the file in the data directory if you are on the cloud server.

nass_soybeans <- read_csv('data/NASS_soybean.csv')
Rows: 2528 Columns: 4
── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (1): state
dbl (3): year, acres, yield

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Import the data from the file on the web if you are not on the cloud server.

nass_soybeans <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/NASS_soybean.csv')

Clean, manipulate, and sort the data

Examining the data

First, let’s look at what we are working with. The SAS dataset nass_soybeans and the R data frame nass_soybeans are very similar objects. They both have ~2500 rows of data and 4 named columns.

Here we will print the first 10 rows of the data in both SAS and R, and show some options for displaying summaries of the data.

SAS

proc print data = nass_soybeans(obs = 10); run;

ods select variables;
proc contents data = nass_soybeans; run;
Obs year state acres yield
1 1924 Alabama 3000 6.5
2 1925 Alabama 3000 7
3 1926 Alabama 4000 4.5
4 1927 Alabama 5000 6
5 1928 Alabama 4000 6
6 1929 Alabama 7000 6
7 1930 Alabama 8000 5
8 1931 Alabama 9000 4.8
9 1932 Alabama 7000 6
10 1933 Alabama 9000 5.5

Alphabetic List of Variables and Attributes
# Variable Type Len Format Informat
3 acres Num 8 BEST12. BEST32.
2 state Char 16 $16. $16.
1 year Num 8 BEST12. BEST32.
4 yield Num 8 BEST12. BEST32.

R

head(nass_soybeans, 10)
# A tibble: 10 × 4
    year state   acres yield
   <dbl> <chr>   <dbl> <dbl>
 1  1924 Alabama  3000   6.5
 2  1925 Alabama  3000   7  
 3  1926 Alabama  4000   4.5
 4  1927 Alabama  5000   6  
 5  1928 Alabama  4000   6  
 6  1929 Alabama  7000   6  
 7  1930 Alabama  8000   5  
 8  1931 Alabama  9000   4.8
 9  1932 Alabama  7000   6  
10  1933 Alabama  9000   5.5
summary(nass_soybeans)
      year         state               acres              yield     
 Min.   :1924   Length:2528        Min.   :    1000   Min.   : 3.0  
 1st Qu.:1948   Class :character   1st Qu.:   60000   1st Qu.:15.5  
 Median :1969   Mode  :character   Median :  337000   Median :23.0  
 Mean   :1969                      Mean   : 1302321   Mean   :23.4  
 3rd Qu.:1991                      3rd Qu.: 1632500   3rd Qu.:30.0  
 Max.   :2011                      Max.   :10920000   Max.   :54.5  
glimpse(nass_soybeans)
Rows: 2,528
Columns: 4
$ year  <dbl> 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 19…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "…
$ acres <dbl> 3000, 3000, 4000, 5000, 4000, 7000, 8000, 9000, 7000, 9000, 10000, 11000, 14000, 11000, 10000, 10000, 11000, 19000, 24000,…
$ yield <dbl> 6.5, 7.0, 4.5, 6.0, 6.0, 6.0, 5.0, 4.8, 6.0, 5.5, 6.0, 5.5, 6.0, 7.0, 5.5, 6.2, 6.5, 8.0, 8.5, 10.5, 10.0, 12.0, 17.5, 18.…

Subsetting the data

Let’s say we wanted to work with only the data from the states in the Southeast region. In SAS we usually use the data step to create subsets of the data, whereas in R tidyverse, we use the filter() function from the dplyr package.

SAS

data se_soybeans; set nass_soybeans;
    where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
run;

R

We create a character vector (list of elements that are each character strings), then use the tidyverse filter() function which takes a data frame as first argument, and a condition as second argument. It gets rid of all rows that do not match the condition. Then, we can assign the result to a new data frame, se_soybeans.

se_states <- c('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee')

se_soybeans <- filter(nass_soybeans, state %in% se_states)

Doing calculations on the data

Next, let’s calculate a new column from existing columns. For example we could multiply acreage by yield (bushels per acre) to get total yield in bushels.

SAS

In SAS, this is typically done in the data step.

data se_soybeans; set se_soybeans;
    total_yield = acres * yield;
run;

R

In R tidyverse, we use the mutate() function. We assign the result back to se_soybeans, which “overwrites” the data frame with a new one that has an additional column.

se_soybeans <- mutate(se_soybeans, total_yield = acres * yield)

Combining data-wrangling operations

We can make the above code more concise in both SAS and R. In SAS, we would combine the row subsetting and the operation to create a new calculated column into the same data step.

SAS

data se_soybeans; set nass_soybeans;
  where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
    total_yield = acres * yield;
run;

R

In R, we can put them into one statement by using the “pipe” operator: %>%. This operator passes the result on one line to the next line. So here we have two %>%. The first %>% passes nass_soybeans to the filter() function, then we subset the rows and pass the result with another %>% to the mutate() function which creates a new column.

se_soybeans <- nass_soybeans %>%
  filter(state %in% se_states) %>%
  mutate(total_yield = acres * yield)

Sorting the data

Finally, our data is sorted first by state and then by year. Let’s say we wanted to sort the rows first by year and then by state.

SAS

We can sort the data in SAS using proc sort.

proc sort data = se_soybeans;
    by year state;
run;

R

In R tidyverse, we can use the arrange() function from dplyr.

se_soybeans <- arrange(se_soybeans, year, state)

We can even put the filter(), mutate(), and arrange() operations into the same pipe statement.

se_soybeans <- nass_soybeans %>%
  filter(state %in% se_states) %>%
  mutate(total_yield = acres * yield) %>%
  arrange(year, state)

Reshaping the data

This dataset is in a format where each row is a unique observation: measurements of soybean harvested acreage and yield in bushels per acre, identified by a unique combination of year and state. The R tidyverse packages call this format “tidy” data.

However, the long and skinny format of most “tidy” datasets is not ideal for visualizing the data in a table. We might want to reshape the data into wide format. Let’s do this for only the total_yield column with the years going down the rows and the states each getting a separate column.

animation of pivoting a data frame between wide and long formats
animation by Garrick Aden-Buie. See more tidyexplain animations here.

SAS

We use proc transpose specifying by, id, and var and creating a new reshaped dataset.

proc transpose data = se_soybeans out = total_yield_wide;
    by year; id state; var total_yield;
run;

R

We use pivot_wider() from tidyverse, with named arguments.

total_yield_wide <- se_soybeans %>%
  pivot_wider(id_cols = year, names_from = state, values_from = total_yield)

Notice how the SAS and R statements are really pretty similar, it’s just slightly different syntax and different names of the arguments. For example in proc transpose the id means which column will identify the columns, whereas in pivot_wider(), id_cols means which column will identify the rows!

PROTIP: With proc transpose, we had to make sure the data was sorted by year and then state to transpose it in this way, so that the by groups are in contiguous blocks. R’s pivot_wider() has no such requirement.

We can pivot from a wide to a long format as well. In SAS we’d use proc transpose again, versus pivot_longer() in R tidyverse.

SAS

proc transpose data = total_yield_wide out = total_yield_long;
    by year;
run;

R

total_yield_long <- total_yield_wide %>%
  pivot_longer(-year, names_to = 'state', values_to = 'total_yield')

Notice in both cases we get more rows than we started with because Florida is missing data for some of the years. Those rows with missing data are now explicitly included in the new long-format dataset/data frame. Also note that in SAS you would have to do an additional data step to rename the columns to what they were before; in R we can do this more easily inside the pivot_longer() statement.

Make exploratory graphs

There are a lot of books and online tutorials out there about how to make great data visualizations in R, so here I’ll just provide you a couple of examples. Let’s say we wanted to plot the time series of yield per acre for four of the states as different colored lines on the same plot.

SAS

proc sgplot data = se_soybeans description = 'SAS PROC SGPLOT line plot of soybean yield over time for each state';
    where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
    series x = year y = yield / group = state; 
    yaxis label='yield (bu/ac)';
run;
SAS PROC SGPLOT line plot of soybean yield over time for each state 1920 1940 1960 1980 2000 year 10 20 30 40 yield (bu/ac) Tennessee North Carolina Georgia Arkansas state

R

We use ggplot() from the ggplot2 package which is loaded when you load tidyverse. I will not go through all the code in great detail, check out my ggplot2 lesson or any of the many online tutorials on ggplot2.

fourstates <- filter(se_soybeans, state %in% c('Arkansas', 'Tennessee', 'North Carolina', 'Georgia'))

ggplot(fourstates, aes(x = year, y = yield, color = state)) +
  geom_line(linewidth = 1) +
  theme_bw() +
  scale_y_continuous(name = 'yield (bu/ac)') +
  theme(legend.position = c(0.2, 0.8))

line plot of soybean yield over time, colored by state

Alternatively, we could plot each of the time series on a separate panel. In SAS this requires a different procedure (proc sgpanel) but in R the existing ggplot() statement can be modified.

SAS

proc sgpanel data = se_soybeans description = 'SAS PROC SGPANEL line plot of soybean yield over time with panel for each state';
    where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
    panelby state;
    series x = year y = yield;
    rowaxis label = 'yield (bu/ac)';
run;
SAS PROC SGPANEL line plot of soybean yield over time with panel for each state year yield (bu/ac) state = Tennessee state = North Carolina state = Georgia state = Arkansas 1920 1940 1960 1980 2000 1920 1940 1960 1980 2000 10 20 30 40 10 20 30 40

R

ggplot(fourstates, aes(x = year, y = yield)) +
  facet_wrap(~ state) +
  geom_line(linewidth = 1) +
  theme_bw() +
  scale_y_continuous(name = 'yield (bu/ac)')

line plot of soybean yield over time with panel for each state

Make tables of summary statistics

Let’s say we want to know the total acreage harvested per year, the total yield in bushels, and the weighted mean of yield per acre across states (weighted by acreage harvested), for every 10th year.

SAS

In SAS we can use proc sql to do this – there are other ways but that’s probably the most concise.

proc sql;
    select 
        year,
        sum(acres) as grand_total_acres,
        sum(total_yield) as grand_total_yield,
        sum(yield * acres) / sum(acres) as mean_yield
    from se_soybeans
    where mod(year, 10) = 0
    group by year;
quit;
year grand_total_acres grand_total_yield mean_yield
1930 212000 1961000 9.25
1940 361000 4024000 11.14681
1950 1628000 33272500 20.43765
1960 5250000 1.125E8 21.42943
1970 12954000 2.949E8 22.76494
1980 22140000 3.6031E8 16.27416
1990 11565000 2.7108E8 23.43969
2000 8835000 2.2571E8 25.54726
2010 10188000 3.3907E8 33.28082

R

In R tidyverse, we can calculate those different summary stats with a piped statement: filter() to retain only the rows where year is divisible by 10, group_by() to identify the column by which we’re grouping the data, and summarize() which contains a comma-separated list of summary statistics to calculate.

se_soybeans %>%
  filter(year %% 10 == 0) %>%
  group_by(year) %>%
  summarize(
    grand_total_acres = sum(acres),
    grand_total_yield = sum(total_yield),
    mean_yield = weighted.mean(yield, acres)
  )
# A tibble: 9 × 4
   year grand_total_acres grand_total_yield mean_yield
  <dbl>             <dbl>             <dbl>      <dbl>
1  1930            212000           1961000       9.25
2  1940            361000           4024000      11.1 
3  1950           1628000          33272500      20.4 
4  1960           5250000         112504500      21.4 
5  1970          12954000         294897000      22.8 
6  1980          22140000         360310000      16.3 
7  1990          11565000         271080000      23.4 
8  2000           8835000         225710000      25.5 
9  2010          10188000         339065000      33.3 

PROTIP: Notice the similarities between proc sql and the R tidyverse code. That’s because the tidyverse syntax was partially inspired by SQL.

PROTIP 2: Notice that in SAS, a single equal sign = is used to test whether two values are equal, while in R (and in many other languages such as C and Python) you use the double equal sign ==.

Fit statistical models

Statistical models are the bread and butter of SAS and R. We can fit all kinds of different models in SAS and R: traditional parametric regression models and machine learning models, frequentist and Bayesian models, the list goes on. In this section, I am going to focus on a few flavors of linear regression model because they’re most likely the type of model that people taking this lesson commonly work with. We’ll compare how to fit them in both SAS and R.

You will probably notice that the SAS code can be quite a bit more concise than the R code for fitting these models and examining the results. That’s because when you invoke a SAS procedure to fit a statistical model, it usually does a bunch of stuff automatically, like showing some diagnostic statistics, summaries of the effects, and that kind of thing. Usually (but not always) in R, you need to explicitly write the code to produce those diagnostic plots and summaries. This may be annoying at first if you are transitioning from SAS to R. Even so, I believe that it ultimately fosters a deeper understanding of what you are doing (and why) when you are fitting a statistical model.

Simple linear regression

Let’s say we want to fit a linear model to the yield data, to determine if there is a trend over time in yield per acre.

SAS

In SAS we could use proc reg.

proc reg data = se_soybeans;
    model yield = year;
run;    

Model: MODEL1

Dependent Variable: yield

Number of Observations Read 767
Number of Observations Used 767
Analysis of Variance
Source DF Sum of
Squares
Mean
Square
F Value Pr > F
Model 1 39861 39861 2099.92 <.0001
Error 765 14521 18.98200    
Corrected Total 766 54382      
Root MSE 4.35683 R-Square 0.7330
Dependent Mean 20.20913 Adj R-Sq 0.7326
Coeff Var 21.55874    
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 -544.64441 12.32735 -44.18 <.0001
year 1 0.28694 0.00626 45.82 <.0001

Model: MODEL1

Dependent Variable: yield

Panel of fit diagnostics for yield. Fit Diagnostics for yield 0.7326 Adj R-Square 0.733 R-Square 18.982 MSE 765 Error DF 2 Parameters 767 Observations Proportion Less 0.0 0.4 0.8 Residual 0.0 0.4 0.8 Fit–Mean -10 -5 0 5 10 -13 -9 -5 -1 3 7 11 Residual 0 5 10 15 Percent 0 200 400 600 800 Observation 0.000 0.005 0.010 0.015 0.020 Cook's D 10 20 30 40 Predicted Value 10 20 30 40 yield -2 0 2 Quantile -15 -10 -5 0 5 10 15 Residual 0.002 0.005 Leverage -2 0 2 RStudent 10 15 20 25 30 Predicted Value -2 0 2 RStudent 10 15 20 25 30 Predicted Value -10 -5 0 5 10 Residual
Scatter plot of residuals by year for yield. 1920 1940 1960 1980 2000 year -10 -5 0 5 10 Residual Residuals for yield
Scatterplot of yield by year overlaid with the fit line, a 95% confidence band and lower and upper 95% prediction limits. 1920 1940 1960 1980 2000 year 0 10 20 30 40 yield 95% Prediction Limits 95% Confidence Limits Fit 0.7326 Adj R-Square 0.733 R-Square 18.982 MSE 765 Error DF 2 Parameters 767 Observations Fit Plot for yield

R

In R, we use the lm() function. The model formula takes the form y ~ x. We also provide a data argument to tell R which data frame the variables are found in.

yield_fit <- lm(yield ~ year, data = se_soybeans)

We use functions on the fitted model object like summary() to get a summary of the model fit statistics and parameters, anova() to see the F-tests, and check_model() from the performance package, part of easystats, gives us nice regression diagnostic plots.

summary(yield_fit)

Call:
lm(formula = yield ~ year, data = se_soybeans)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.9532  -3.0494   0.3767   3.0941  13.1774 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.446e+02  1.233e+01  -44.18   <2e-16 ***
year         2.869e-01  6.262e-03   45.83   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.357 on 765 degrees of freedom
Multiple R-squared:  0.733, Adjusted R-squared:  0.7326 
F-statistic:  2100 on 1 and 765 DF,  p-value: < 2.2e-16
anova(yield_fit)
Analysis of Variance Table

Response: yield
           Df Sum Sq Mean Sq F value    Pr(>F)    
year        1  39861   39861  2099.9 < 2.2e-16 ***
Residuals 765  14521      19                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
check_model(yield_fit)

regression diagnostic plots for linear model

In both the R and SAS output, the residual diagnostic plots show that the assumptions of normally distributed residuals with homogeneous error variance are reasonably well met.

Mixed models

The models we fit above may not be the most appropriate. That’s because linear regression assumes that all data points are independent of one another. Clearly that is not the case for the time series of yield for each state, because yield values from the same state in successive years are not independent of one another. Our dataset has a “multilevel” or “nested” structure. In particular, there are repeated measures across time within each individual state. So we will fit a linear mixed model (LMM).

SAS

In SAS, we can use proc glimmix with the appropriate random statement to fit random intercepts by state. More complex models are possible that explicitly model the temporal autocorrelation within states. We will cover some of that in a later lesson. For now, we will just consider the case where the mean yield may differ by state, but the trend over time is the same for all states (random intercepts).

proc glimmix data = se_soybeans plots = residualpanel;
    class state;
    model yield = year / solution;
    random intercept / subject = state;
run;
Model Information
Data Set WORK.SE_SOYBEANS
Response Variable yield
Response Distribution Gaussian
Link Function Identity
Variance Function Default
Variance Matrix Blocked By state
Estimation Technique Restricted Maximum Likelihood
Degrees of Freedom Method Containment
Class Level Information
Class Levels Values
state 9 Alabama Arkansas Florida Georgia Louisiana Mississippi North Carolina South Carolina Tennessee
Number of Observations Read 767
Number of Observations Used 767
Dimensions
G-side Cov. Parameters 1
R-side Cov. Parameters 1
Columns in X 2
Columns in Z per Subject 1
Subjects (Blocks in V) 9
Max Obs per Subject 88
Optimization Information
Optimization Technique Dual Quasi-Newton
Parameters in Optimization 1
Lower Boundaries 1
Upper Boundaries 0
Fixed Effects Profiled
Residual Variance Profiled
Starting From Data
Iteration History
Iteration Restarts Evaluations Objective
Function
Change Max
Gradient
0 0 4 4300.0547387 . 0.186885
1 0 5 4300.0545574 0.00018124 0.001746
2 0 2 4300.0545574 0.00000002 0.000023
Convergence criterion (GCONV=1E-8) satisfied.
Fit Statistics
-2 Res Log Likelihood 4300.05
AIC (smaller is better) 4304.05
AICC (smaller is better) 4304.07
BIC (smaller is better) 4304.45
CAIC (smaller is better) 4306.45
HQIC (smaller is better) 4303.20
Generalized Chi-Square 11658.18
Gener. Chi-Square / DF 15.24
Covariance Parameter Estimates
Cov Parm Subject Estimate Standard
Error
Intercept state 4.2257 2.2039
Residual   15.2394 0.7833
Solutions for Fixed Effects
Effect Estimate Standard
Error
DF t Value Pr > |t|
Intercept -538.04 11.1674 8 -48.18 <.0001
year 0.2836 0.005661 757 50.10 <.0001
Type III Tests of Fixed Effects
Effect Num DF Den DF F Value Pr > F
year 1 757 2510.22 <.0001
Panel of conditional residuals for yield. Each panel consists of a scatterplot of the residuals, a histogram with normal density, a Q-Q plot, and a box plot of the residuals. Conditional Residuals for yield -15 -10 -5 0 5 10 Residual -2 0 2 Quantile -15 -10 -5 0 5 10 Residual -13 -9 -5 -1 3 7 11 Residual 0 5 10 15 20 Percent 10 20 30 Linear Predictor -15 -10 -5 0 5 10 Residual

R

The equivalent in R uses lmer() from the lme4 package.

yield_fit_lmm <- lmer(yield ~ year + (1 | state), data = se_soybeans)

As you can see, the call to lmer() includes the same components as the SAS proc glimmix statement, but with different syntax. We specify a model formula including fixed and random components to the first argument. The first part of the formula, yield ~ year, is the same as in the linear model above. Then comes the random component, + (1 | state). This includes a random intercept (1) grouped by state. The | separates the terms that are random, in this case only the intercept 1, from the grouping factor state. Finally, both the SAS and R code include a data argument to say which dataset or data frame contains the variables to be used for model fitting.

The regression diagnostics look good. The Q-Q plot of the residuals shows near-perfect agreement with the normal distribution quantile line. There is a slight positive trend in the homogeneity of variance plot. If this trend were more pronounced we would have to consider transforming the yield data or fitting a generalized linear mixed model. But in this case it is not severe enough to cause concern.

PROTIP: You may find this too subjective and want to do formal tests of normality and homogeneity of variance. I do not usually recommend doing that. With linear regression, you may slightly deviate from normality and homogeneity of variance and still get good results. As you do more stats, you will get better at glancing at diagnostic plots and telling whether the model is “good enough” or needs revising.

check_model(yield_fit_lmm)

regression diagnostic plots for linear mixed model

The summary output for the fitted model object shows us the fixed-effect intercept and slope, and the variance of the random intercepts.

summary(yield_fit_lmm)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: yield ~ year + (1 | state)
   Data: se_soybeans

REML criterion at convergence: 4300.1

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.4750 -0.6300  0.0015  0.6958  3.0399 

Random effects:
 Groups   Name        Variance Std.Dev.
 state    (Intercept)  4.226   2.056   
 Residual             15.239   3.904   
Number of obs: 767, groups:  state, 9

Fixed effects:
              Estimate Std. Error         df t value Pr(>|t|)    
(Intercept) -5.380e+02  1.117e+01  7.627e+02  -48.18   <2e-16 ***
year         2.836e-01  5.661e-03  7.584e+02   50.10   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
     (Intr)
year -0.998

The anova() function shows F-tests for the overall slope.

anova(yield_fit_lmm)
Type III Analysis of Variance Table with Satterthwaite's method
     Sum Sq Mean Sq NumDF  DenDF F value    Pr(>F)    
year  38254   38254     1 758.42  2510.2 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The coef() function gives us a table of intercepts and slopes for each of the states. Because we fit a random intercept model, the slope by year is the same for each state. However we see that our model picks up some random variation by state in the intercept, on the order of a few bushels per acre.

coef(yield_fit_lmm)
$state
               (Intercept)      year
Alabama          -538.7362 0.2836196
Arkansas         -536.7887 0.2836196
Florida          -536.0610 0.2836196
Georgia          -540.9151 0.2836196
Louisiana        -536.5377 0.2836196
Mississippi      -537.7909 0.2836196
North Carolina   -537.1872 0.2836196
South Carolina   -541.6771 0.2836196
Tennessee        -536.6588 0.2836196

attr(,"class")
[1] "coef.mer"

Because we fit a random intercept model, the slope by year in the above table is the same for each state. However we see that our model picks up some random variation by state in the intercept, on the order of a few bushels per acre. Between 1924 and 2011, a linear trend fit to soybean yield per acre shows an increase of 0.28 bushels per acre per year.

Make graph of model predictions

Here we generate predictions from the model for the population-level expectation (overall trend across states) and the state-level trends. We will plot the observed and predicted yields on the same plot.

I have included code for making this plot with SAS in the .sas file available for download accompanying this lesson. I didn’t include it here in the interest of space.

yield_pred_bystate <- expand_grid(year = c(1924, 2011), state = se_states) %>%
  mutate(yield = as.numeric(predict(yield_fit_lmm, newdata = .)))

yield_pred_overall <- data.frame(state = 'overall', year = c(1924, 2011)) %>% 
  mutate(yield = as.numeric(predict(yield_fit_lmm, newdata = ., re.form = NA)))
ggplot(mapping = aes(x = year, y = yield, color = state, group = state)) +
  geom_line(data = se_soybeans, alpha = 0.3) +
  geom_line(data = yield_pred_bystate, linewidth = 0.7) +
  geom_line(data = yield_pred_overall, color = 'black', linewidth = 1.2) +
  theme_bw() +
  ggtitle('soybean yields by state, observations and modeled trends, 1924-2011',
          'black line is overall modeled trend')

plots of observed soybean yield trends and modeled trends for each state

The linear trend is a reasonably good model. The straight colored lines illustrate how our model allows the intercept to vary by state, but the lines are parallel because the slope is modeled as being the same for all states. If I were really analyzing these data, I would probably think about fitting a time series model that could account for the different slopes we observe in different decades, as well as the slight increase in variance over time as the yield increases. But this isn’t bad for starters.

Conclusion

What have you learned so far?

  • The pros and cons of R and SAS
  • How to create a data frame in R by reading data from an external file
  • How to clean, manipulate, sort, and reshape your data frame
  • How to calculate summary statistics from a data frame
  • How to fit a linear model and a linear mixed model

Impressive!

This was Lesson 1 in a series. Future lessons will cover:

  • generalized linear mixed models with different error distributions
  • fitting models for different experimental designs
  • post hoc comparisons (lsmeans statement in SAS, emmeans() function in R)
  • generation of reports and documents using RMarkdown

Exercises

For some additional practice, here are a few exercises. These exercises ask you to “translate” some simple SAS code into the R equivalent. This will take you through the pipeline steps of importing a file, manipulating the data, calculating summary statistics, making a plot, and fitting a model. Each question builds off the previous one, so I recommend running each step in R sequentially. The exercises use data from an oat breeding program originally published in Crop Science by Edwards & Jannink.

If you are starting on a fresh R session for these exercises, you will need to load library(tidyverse) and library(easystats) again first.

Exercise 1

Write the R code that does the same thing as this SAS code (import a CSV file from a URL to a dataset called oats). Note that if you are on the cloud server you may import the data from data/Edwards_oats.csv instead.

filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/Edwards_oats.csv";
proc import datafile = csvFile out = oats replace dbms = csv; run;

Exercise 2

Write the R code that does the same thing as this SAS code (create a subset of the oats data with only data from year 2001 and from the genotypes gen Belle, Blaze, Brawn, and Chaps).

data oats_subset; set oats;
    where year = 2001 & gen in ('Belle', 'Blaze', 'Brawn', 'Chaps');
run;
  • Hint: Use filter() with two conditions, one for year and one for gen, separated by ,.

Exercise 3

Write the R code that does the same thing as this SAS code (find the mean and standard deviation of yield for each genotype gen in the data we subset in Exercise 2).

proc sql;
    select 
        gen,
        avg(yield) as mean_yield,
        std(yield) as stdev_yield
    from oats_subset
    group by gen;
quit;
  • Hint: Use group_by() and summarize(), with mean() for the mean and sd() for the standard deviation.

Exercise 4

Write the R code that does the same thing as this SAS code (make a boxplot of yield for each location loc, in the data we subset in Exercise 2). You will need to use geom_boxplot() to create this plot.

proc sgplot data = oats_subset;
    vbox yield / group = gen;
run;
  • Hint: Use aes(x = loc, y = yield).

Exercise 5

Write the R code that does the same thing as this SAS code (fit a linear model to yield as a function of genotype gen, using the data we subset in Exercise 2, output regression diagnostics, examine model coefficients, and generate ANOVA table).

proc glm data = oats_subset plots(unpack) = diagnostics;
    class gen;
    model yield = gen;
run;    
  • Hint: You can still use lm() to fit the model, with gen as the predictor variable.

Exercise 6 (bonus exercise!)

Include a term for environment loc and genotype by environment interaction gen:loc in the R linear model above, similar to the SAS code shown here.

proc glm data = oats_subset plots(unpack) = diagnostics;
    class gen loc;
    model yield = gen loc gen*loc;
run;    
  • Hint: The right-hand side of the model formula will now be gen + loc + gen:loc.

Click here for answers