This workshop is intended for SAS users who want to learn R. The people who will get the most out of this course are practicing researchers who have a decent working knowledge of SAS, and of basic statistical analysis (descriptive stats and regression models) as it applies to their field.
This is lesson 1 in a series. Lessons 2 and 3 are also available on the SEAStats page.
Download the worksheet for this lesson here.
A word of warning: I have much more experience working with R compared to SAS. So if you notice any issues or glaring flaws in the SAS code, chalk that up to my poor SAS skills! The SAS code is provided here mainly for comparison purposes, so that you will be able to better understand what the R code is trying to do by comparing it to a familiar bit of SAS code.
During this workshop, you will …
In this workshop, participants will go through a “data to doc” pipeline in R, comparing R code to SAS code each step of the way. As you go through the pipeline, you will …
...
with
code)Before we get into the code, let’s talk a little bit about R and SAS. SAS has been around quite a bit longer than R. It was developed in the late 1960s as a “statistical analysis system” for agricultural researchers at North Carolina State University, and got spun off as an independent business in 1976. In contrast, R was first released in 1993, when it was created by statisticians Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand. All that is to say that SAS has been in use longer, especially in agricultural research in government and academia in the United States. So, many ARS long-timers cut their teeth on SAS.
SAS Hall at NC State University. There is no R Hall … yet!
While SAS is a powerful and useful tool, R has a lot of comparative advantages. For one thing, it is free and open-source. This is a huge advantage because any user can contribute a “package” to R’s online repository. A package is a collection of functions that go together and are used for a common purpose. For example, there are packages for making graphics, fitting specialized statistical models, calculating effect sizes, and that kind of thing. R has a huge diversity of packages that are about more than just stats. There are packages for doing GIS and geospatial data science, working with genomic data, running code in parallel on high-performance computing systems, and making websites. Different research communities have all worked together to contribute to R’s code base: there are R user communities in linguistics, ecology, social science, economics, pharmacology, and many others.
In contrast to R, SAS is more top-down and centralized. Yes, people do write SAS programs and macros that others can find and use, but the community of people doing that is much smaller and less active. So R tends to develop more quickly and be closer to the cutting edge of new methods. Usually this is an advantage but there are some downsides. For instance, SAS has basically one way to fit any given type of regression model. You can be confident that it was verified by qualified statisticians. But with R, there may be dozens of packages that fit that kind of model, all with slightly different implementations. It is hard for beginners to figure out which one is best to use. However, the longer R has been around, the more the good packages rise to the top.
Just a few of the R packages out there
Currently, I think it is more important than ever to transition away from SAS and towards R. There are indications that SAS is winding down support for its desktop software and is moving toward a cloud-only model. This is great for their corporate customers but I don’t think it is as good for researchers in government and academia. In my opinion, R is a more reliable option for the future. Another alternative is Python, which is probably more common in industry versus government and academia, and has a lot of users in the data science community rather than the stats community.
The bottom line is that they are both tools that can be used to do a job. Neither is perfect and they both have advantages and disadvantages. Some people prefer one to the other, and some people have more experience with one than the other. But I strongly recommend that new ARS researchers just starting their “stats and data science journey” begin with R. This is thanks to its superiority for open and reproducible science, the fact that it is free and thus you will be able to use it even if you move to an institution that doesn’t have a SAS subscription, its greater diversity of tools and bag of tricks, and because it just does some stuff flat out better.
All of that said, the point of this workshop is not to convert people from SAS to R. It is simply to give SAS users some exposure to R functionality and demonstrate how you can translate familiar statistical procedures in SAS into R.
In many cases, a package in R is roughly equivalent to a procedure or
proc
in SAS. Here’s a quick reference table I came up with
that shows which SAS procedure corresponds to which package or function
in R. There are a diversity of R packages and functions for each of
these tasks but I am showing you my “favorite” or recommended
alternatives.
task | SAS | R |
---|---|---|
import data | proc import , data step |
read.csv (base R), read_csv
(readr package in tidyverse), read_xlsx
(readxl package) |
clean, manipulate, and sort data | proc transpose , data
step |
dplyr and tidyr packages in tidyverse |
make graphs | proc sgplot , etc. |
plot (base R), ggplot
(ggplot2 package in tidyverse) |
calculate summary statistics | proc means , proc freq |
table (base R), group_by and
summarize (dplyr) |
fit statistical models (frequentist) | proc reg , proc glm ,
proc mixed , proc glimmix , etc. |
lm , glm (base R),
lme4 package, nlme package, etc. |
fit statistical models (Bayesian) | proc bglimm |
brms |
look at model diagnostics | diagnostics produced within model fitting procedures | some diagnostics produced by model fitting packages, also see easystats package |
extract predictions from models | lsmeans statement, also see
proc plm |
emmeans, marginaleffects |
Okay, that’s enough lecturing. Let’s actually do some coding.
A typical “pipeline” to get from raw data at one end, to a finished document (report or presentation) at the other, includes the following steps, whether you are working in SAS or R.
We will go through much of this pipeline today. In each step, first we’ll look at some SAS code that does a particular thing (without going into detailed explanation of how the SAS code works), then we’ll co-create some R code that does that same thing while explaining in detail how each bit of the R code works.
We will use example datasets for this lesson from the R package agridat, which is a curated collection of datasets from agricultural sciences. (You can learn a lot from it whether or not you are an “R person!”)
Here we’ll load the R packages we are going to work with today. This includes the tidyverse package for reading, manipulating, and plotting data, the lme4 package for fitting linear mixed models, the lmerTest package for doing an ANOVA on the mixed model, and the easystats package which has some good model diagnostic plots.
The “tidyverse” is actually a set of related packages used for common data science tasks. It is pretty easy for beginners to learn so I usually teach introductory R lessons using it. But it’s not the only way to do that kind of thing. In fact, all of the data wrangling part of this lesson could be done with base R functions that don’t require a package.
PROTIP: It’s good practice to load all necessary packages for a script at the beginning. Among other benefits, this ensures that if anyone who’s running the script needs to install one or more packages to run your script, they can do that all at once instead of having to interrupt the workflow later to install packages.
library(tidyverse)
library(lme4)
library(lmerTest)
library(easystats)
If you are running the code for this lesson on the cloud server, the
data file is located in the data
directory. If you are
running the code on your own machine, you can import the dataset
directly from the URL where it is hosted on GitHub.
PROTIP: Notice the data are in CSV (comma-separated values) format. Although both SAS and R have the capability to read data directly from Microsoft Excel files (.xlsx), I recommend using CSV wherever possible. This is a good idea for a couple reasons: first, it is a plain-text format that works on any operating system, and is not a proprietary Microsoft format. Also, it does not allow formatting so you can’t be tempted to encode important information in formatting. For example, highlighting cells to flag them – a better way would be to add an additional column with a text label for the flag that can be read in to SAS or R more easily.
Some people like to import data in the data
step in SAS,
or use datalines;
, but I recommend using
proc import
.
filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/NASS_soybean.csv";
proc import datafile = csvFile out = nass_soybeans replace dbms = csv; guessingrows = 2000; run;
We are using the read_csv()
function to read in the
data.
This shows how a variable is created in R. The syntax is
variable <- value
meaning take the value
and give it a name, variable
, which creates an object that
you can use later. It also shows how you call a function in R. The
syntax is function(argument)
. Here the argument is a
character string with the URL of the file, and we pass that URL as input
to the read_csv()
function, resulting in a “data frame”
which we call nass_soybeans
.
Import the data from the file in the data
directory if
you are on the cloud server.
nass_soybeans <- read_csv('data/NASS_soybean.csv')
Rows: 2528 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): state
dbl (3): year, acres, yield
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Import the data from the file on the web if you are not on the cloud server.
nass_soybeans <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/NASS_soybean.csv')
First, let’s look at what we are working with. The SAS dataset
nass_soybeans
and the R data frame
nass_soybeans
are very similar objects. They both have
~2500 rows of data and 4 named columns.
Here we will print the first 10 rows of the data in both SAS and R, and show some options for displaying summaries of the data.
proc print data = nass_soybeans(obs = 10); run;
ods select variables;
proc contents data = nass_soybeans; run;
Obs | year | state | acres | yield |
---|---|---|---|---|
1 | 1924 | Alabama | 3000 | 6.5 |
2 | 1925 | Alabama | 3000 | 7 |
3 | 1926 | Alabama | 4000 | 4.5 |
4 | 1927 | Alabama | 5000 | 6 |
5 | 1928 | Alabama | 4000 | 6 |
6 | 1929 | Alabama | 7000 | 6 |
7 | 1930 | Alabama | 8000 | 5 |
8 | 1931 | Alabama | 9000 | 4.8 |
9 | 1932 | Alabama | 7000 | 6 |
10 | 1933 | Alabama | 9000 | 5.5 |
Alphabetic List of Variables and Attributes | |||||
---|---|---|---|---|---|
# | Variable | Type | Len | Format | Informat |
3 | acres | Num | 8 | BEST12. | BEST32. |
2 | state | Char | 16 | $16. | $16. |
1 | year | Num | 8 | BEST12. | BEST32. |
4 | yield | Num | 8 | BEST12. | BEST32. |
head(nass_soybeans, 10)
# A tibble: 10 × 4
year state acres yield
<dbl> <chr> <dbl> <dbl>
1 1924 Alabama 3000 6.5
2 1925 Alabama 3000 7
3 1926 Alabama 4000 4.5
4 1927 Alabama 5000 6
5 1928 Alabama 4000 6
6 1929 Alabama 7000 6
7 1930 Alabama 8000 5
8 1931 Alabama 9000 4.8
9 1932 Alabama 7000 6
10 1933 Alabama 9000 5.5
summary(nass_soybeans)
year state acres yield
Min. :1924 Length:2528 Min. : 1000 Min. : 3.0
1st Qu.:1948 Class :character 1st Qu.: 60000 1st Qu.:15.5
Median :1969 Mode :character Median : 337000 Median :23.0
Mean :1969 Mean : 1302321 Mean :23.4
3rd Qu.:1991 3rd Qu.: 1632500 3rd Qu.:30.0
Max. :2011 Max. :10920000 Max. :54.5
glimpse(nass_soybeans)
Rows: 2,528
Columns: 4
$ year <dbl> 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934…
$ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"…
$ acres <dbl> 3000, 3000, 4000, 5000, 4000, 7000, 8000, 9000, 7000, 9000, 1000…
$ yield <dbl> 6.5, 7.0, 4.5, 6.0, 6.0, 6.0, 5.0, 4.8, 6.0, 5.5, 6.0, 5.5, 6.0,…
Let’s say we wanted to work with only the data from the states in the
Southeast region. In SAS we usually use the data
step to
create subsets of the data, whereas in R tidyverse, we use the
filter()
function from the dplyr package.
data se_soybeans; set nass_soybeans;
where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
run;
We create a character vector (list of elements that are each
character strings), then use the tidyverse filter()
function which takes a data frame as first argument, and a condition as
second argument. It gets rid of all rows that do not match the
condition. Then, we can assign the result to a new data frame,
se_soybeans
.
se_states <- c('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee')
se_soybeans <- filter(nass_soybeans, state %in% se_states)
Next, let’s calculate a new column from existing columns. For example we could multiply acreage by yield (bushels per acre) to get total yield in bushels.
In SAS, this is typically done in the data
step.
data se_soybeans; set se_soybeans;
total_yield = acres * yield;
run;
In R tidyverse, we use the mutate()
function. We assign
the result back to se_soybeans
, which “overwrites” the data
frame with a new one that has an additional column.
se_soybeans <- mutate(se_soybeans, total_yield = acres * yield)
We can make the above code more concise in both SAS and R. In SAS, we
would combine the row subsetting and the operation to create a new
calculated column into the same data
step.
data se_soybeans; set nass_soybeans;
where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
total_yield = acres * yield;
run;
In R, we can put them into one statement by using the “pipe”
operator: %>%
. This operator passes the result on one
line to the next line. So here we have two %>%
. The
first %>%
passes nass_soybeans
to the
filter()
function, then we subset the rows and pass the
result with another %>%
to the mutate()
function which creates a new column.
se_soybeans <- nass_soybeans %>%
filter(state %in% se_states) %>%
mutate(total_yield = acres * yield)
Finally, our data is sorted first by state and then by year. Let’s say we wanted to sort the rows first by year and then by state.
We can sort the data in SAS using proc sort
.
proc sort data = se_soybeans;
by year state;
run;
In R tidyverse, we can use the arrange()
function from
dplyr.
se_soybeans <- arrange(se_soybeans, year, state)
We can even put the filter()
, mutate()
, and
arrange()
operations into the same pipe statement.
se_soybeans <- nass_soybeans %>%
filter(state %in% se_states) %>%
mutate(total_yield = acres * yield) %>%
arrange(year, state)
This dataset is in a format where each row is a unique observation: measurements of soybean harvested acreage and yield in bushels per acre, identified by a unique combination of year and state. The R tidyverse packages call this format “tidy” data.
However, the long and skinny format of most “tidy” datasets is not
ideal for visualizing the data in a table. We might want to reshape the
data into wide format. Let’s do this for only the
total_yield
column with the years going down the rows and
the states each getting a separate column.
animation by Garrick Aden-Buie. See more
tidyexplain animations here.
We use proc transpose
specifying by
,
id
, and var
and creating a new reshaped
dataset.
proc transpose data = se_soybeans out = total_yield_wide;
by year; id state; var total_yield;
run;
We use pivot_wider()
from tidyverse, with named
arguments.
total_yield_wide <- se_soybeans %>%
pivot_wider(id_cols = year, names_from = state, values_from = total_yield)
Notice how the SAS and R statements are really pretty similar, it’s
just slightly different syntax and different names of the arguments. For
example in proc transpose
the id
means which
column will identify the columns, whereas in pivot_wider()
,
id_cols
means which column will identify the rows!
PROTIP: With
proc transpose
, we had to make sure the data was sorted by year and then state to transpose it in this way, so that theby
groups are in contiguous blocks. R’spivot_wider()
has no such requirement.
We can pivot from a wide to a long format as well. In SAS we’d use
proc transpose
again, versus pivot_longer()
in
R tidyverse.
proc transpose data = total_yield_wide out = total_yield_long;
by year;
run;
total_yield_long <- total_yield_wide %>%
pivot_longer(-year, names_to = 'state', values_to = 'total_yield')
Notice in both cases we get more rows than we started with because
Florida is missing data for some of the years. Those rows with missing
data are now explicitly included in the new long-format dataset/data
frame. Also note that in SAS you would have to do an additional
data
step to rename the columns to what they were before;
in R we can do this more easily inside the pivot_longer()
statement.
There are a lot of books and online tutorials out there about how to make great data visualizations in R, so here I’ll just provide you a couple of examples. Let’s say we wanted to plot the time series of yield per acre for four of the states as different colored lines on the same plot.
proc sgplot data = se_soybeans description = 'SAS PROC SGPLOT line plot of soybean yield over time for each state';
where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
series x = year y = yield / group = state;
yaxis label='yield (bu/ac)';
run;
We use ggplot()
from the ggplot2 package which
is loaded when you load tidyverse. I will not go
through all the code in great detail, check out my
ggplot2 lesson or any of the many online tutorials on
ggplot2.
fourstates <- filter(se_soybeans, state %in% c('Arkansas', 'Tennessee', 'North Carolina', 'Georgia'))
ggplot(fourstates, aes(x = year, y = yield, color = state)) +
geom_line(linewidth = 1) +
theme_bw() +
scale_y_continuous(name = 'yield (bu/ac)') +
theme(legend.position = c(0.2, 0.8))
Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
3.5.0.
ℹ Please use the `legend.position.inside` argument of `theme()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
Alternatively, we could plot each of the time series on a separate
panel. In SAS this requires a different procedure
(proc sgpanel
) but in R the existing ggplot()
statement can be modified.
proc sgpanel data = se_soybeans description = 'SAS PROC SGPANEL line plot of soybean yield over time with panel for each state';
where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
panelby state;
series x = year y = yield;
rowaxis label = 'yield (bu/ac)';
run;
ggplot(fourstates, aes(x = year, y = yield)) +
facet_wrap(~ state) +
geom_line(linewidth = 1) +
theme_bw() +
scale_y_continuous(name = 'yield (bu/ac)')
Let’s say we want to know the total acreage harvested per year, the total yield in bushels, and the weighted mean of yield per acre across states (weighted by acreage harvested), for every 10th year.
In SAS we can use proc sql
to do this – there are other
ways but that’s probably the most concise.
proc sql;
select
year,
sum(acres) as grand_total_acres,
sum(total_yield) as grand_total_yield,
sum(yield * acres) / sum(acres) as mean_yield
from se_soybeans
where mod(year, 10) = 0
group by year;
quit;
year | grand_total_acres | grand_total_yield | mean_yield |
---|---|---|---|
1930 | 212000 | 1961000 | 9.25 |
1940 | 361000 | 4024000 | 11.14681 |
1950 | 1628000 | 33272500 | 20.43765 |
1960 | 5250000 | 1.125E8 | 21.42943 |
1970 | 12954000 | 2.949E8 | 22.76494 |
1980 | 22140000 | 3.6031E8 | 16.27416 |
1990 | 11565000 | 2.7108E8 | 23.43969 |
2000 | 8835000 | 2.2571E8 | 25.54726 |
2010 | 10188000 | 3.3907E8 | 33.28082 |
In R tidyverse, we can calculate those different summary stats with a
piped statement: filter()
to retain only the rows where
year is divisible by 10, group_by()
to identify the column
by which we’re grouping the data, and summarize()
which
contains a comma-separated list of summary statistics to calculate.
se_soybeans %>%
filter(year %% 10 == 0) %>%
group_by(year) %>%
summarize(
grand_total_acres = sum(acres),
grand_total_yield = sum(total_yield),
mean_yield = weighted.mean(yield, acres)
)
# A tibble: 9 × 4
year grand_total_acres grand_total_yield mean_yield
<dbl> <dbl> <dbl> <dbl>
1 1930 212000 1961000 9.25
2 1940 361000 4024000 11.1
3 1950 1628000 33272500 20.4
4 1960 5250000 112504500 21.4
5 1970 12954000 294897000 22.8
6 1980 22140000 360310000 16.3
7 1990 11565000 271080000 23.4
8 2000 8835000 225710000 25.5
9 2010 10188000 339065000 33.3
PROTIP: Notice the similarities between
proc sql
and the R tidyverse code. That’s because the tidyverse syntax was partially inspired by SQL.
PROTIP 2: Notice that in SAS, a single equal sign
=
is used to test whether two values are equal, while in R (and in many other languages such as C and Python) you use the double equal sign==
.
Statistical models are the bread and butter of SAS and R. We can fit all kinds of different models in SAS and R: traditional parametric regression models and machine learning models, frequentist and Bayesian models, the list goes on. In this section, I am going to focus on a few flavors of linear regression model because they’re most likely the type of model that people taking this lesson commonly work with. We’ll compare how to fit them in both SAS and R.
You will probably notice that the SAS code can be quite a bit more concise than the R code for fitting these models and examining the results. That’s because when you invoke a SAS procedure to fit a statistical model, it usually does a bunch of stuff automatically, like showing some diagnostic statistics, summaries of the effects, and that kind of thing. Usually (but not always) in R, you need to explicitly write the code to produce those diagnostic plots and summaries. This may be annoying at first if you are transitioning from SAS to R. Even so, I believe that it ultimately fosters a deeper understanding of what you are doing (and why) when you are fitting a statistical model.
Let’s say we want to fit a linear model to the yield data, to determine if there is a trend over time in yield per acre.
In SAS we could use proc reg
.
proc reg data = se_soybeans;
model yield = year;
run;
Model: MODEL1
Dependent Variable: yield
Number of Observations Read | 767 |
---|---|
Number of Observations Used | 767 |
Analysis of Variance | |||||
---|---|---|---|---|---|
Source | DF |
Sum of Squares |
Mean Square |
F Value | Pr > F |
Model | 1 | 39861 | 39861 | 2099.92 | <.0001 |
Error | 765 | 14521 | 18.98200 | ||
Corrected Total | 766 | 54382 |
Root MSE | 4.35683 | R-Square | 0.7330 |
---|---|---|---|
Dependent Mean | 20.20913 | Adj R-Sq | 0.7326 |
Coeff Var | 21.55874 |
Parameter Estimates | |||||
---|---|---|---|---|---|
Variable | DF |
Parameter Estimate |
Standard Error |
t Value | Pr > |t| |
Intercept | 1 | -544.64441 | 12.32735 | -44.18 | <.0001 |
year | 1 | 0.28694 | 0.00626 | 45.82 | <.0001 |
Model: MODEL1
Dependent Variable: yield
In R, we use the lm()
function. The model formula takes
the form y ~ x
. We also provide a data
argument to tell R which data frame the variables are found in.
yield_fit <- lm(yield ~ year, data = se_soybeans)
We use functions on the fitted model object like
summary()
to get a summary of the model fit statistics and
parameters, anova()
to see the F-tests, and
check_model()
from the performance
package, part of easystats, gives us nice regression
diagnostic plots.
summary(yield_fit)
Call:
lm(formula = yield ~ year, data = se_soybeans)
Residuals:
Min 1Q Median 3Q Max
-12.9532 -3.0494 0.3767 3.0941 13.1774
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.446e+02 1.233e+01 -44.18 <2e-16 ***
year 2.869e-01 6.262e-03 45.83 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.357 on 765 degrees of freedom
Multiple R-squared: 0.733, Adjusted R-squared: 0.7326
F-statistic: 2100 on 1 and 765 DF, p-value: < 2.2e-16
anova(yield_fit)
Analysis of Variance Table
Response: yield
Df Sum Sq Mean Sq F value Pr(>F)
year 1 39861 39861 2099.9 < 2.2e-16 ***
Residuals 765 14521 19
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
check_model(yield_fit)
In both the R and SAS output, the residual diagnostic plots show that the assumptions of normally distributed residuals with homogeneous error variance are reasonably well met.
The models we fit above may not be the most appropriate. That’s because linear regression assumes that all data points are independent of one another. Clearly that is not the case for the time series of yield for each state, because yield values from the same state in successive years are not independent of one another. Our dataset has a “multilevel” or “nested” structure. In particular, there are repeated measures across time within each individual state. So we will fit a linear mixed model (LMM).
In SAS, we can use proc glimmix
with the appropriate
random
statement to fit random intercepts by state. More
complex models are possible that explicitly model the temporal
autocorrelation within states. We will cover some of that in a later
lesson. For now, we will just consider the case where the mean yield may
differ by state, but the trend over time is the same for all states
(random intercepts).
proc glimmix data = se_soybeans plots = residualpanel;
class state;
model yield = year / solution;
random intercept / subject = state;
run;
Model Information | |
---|---|
Data Set | WORK.SE_SOYBEANS |
Response Variable | yield |
Response Distribution | Gaussian |
Link Function | Identity |
Variance Function | Default |
Variance Matrix Blocked By | state |
Estimation Technique | Restricted Maximum Likelihood |
Degrees of Freedom Method | Containment |
Class Level Information | ||
---|---|---|
Class | Levels | Values |
state | 9 | Alabama Arkansas Florida Georgia Louisiana Mississippi North Carolina South Carolina Tennessee |
Number of Observations Read | 767 |
---|---|
Number of Observations Used | 767 |
Dimensions | |
---|---|
G-side Cov. Parameters | 1 |
R-side Cov. Parameters | 1 |
Columns in X | 2 |
Columns in Z per Subject | 1 |
Subjects (Blocks in V) | 9 |
Max Obs per Subject | 88 |
Optimization Information | |
---|---|
Optimization Technique | Dual Quasi-Newton |
Parameters in Optimization | 1 |
Lower Boundaries | 1 |
Upper Boundaries | 0 |
Fixed Effects | Profiled |
Residual Variance | Profiled |
Starting From | Data |
Iteration History | |||||
---|---|---|---|---|---|
Iteration | Restarts | Evaluations |
Objective Function |
Change |
Max Gradient |
0 | 0 | 4 | 4300.0547387 | . | 0.186885 |
1 | 0 | 5 | 4300.0545574 | 0.00018124 | 0.001746 |
2 | 0 | 2 | 4300.0545574 | 0.00000002 | 0.000023 |
Convergence criterion (GCONV=1E-8) satisfied. |
Fit Statistics | |
---|---|
-2 Res Log Likelihood | 4300.05 |
AIC (smaller is better) | 4304.05 |
AICC (smaller is better) | 4304.07 |
BIC (smaller is better) | 4304.45 |
CAIC (smaller is better) | 4306.45 |
HQIC (smaller is better) | 4303.20 |
Generalized Chi-Square | 11658.18 |
Gener. Chi-Square / DF | 15.24 |
Covariance Parameter Estimates | |||
---|---|---|---|
Cov Parm | Subject | Estimate |
Standard Error |
Intercept | state | 4.2257 | 2.2039 |
Residual | 15.2394 | 0.7833 |
Solutions for Fixed Effects | |||||
---|---|---|---|---|---|
Effect | Estimate |
Standard Error |
DF | t Value | Pr > |t| |
Intercept | -538.04 | 11.1674 | 8 | -48.18 | <.0001 |
year | 0.2836 | 0.005661 | 757 | 50.10 | <.0001 |
Type III Tests of Fixed Effects | ||||
---|---|---|---|---|
Effect | Num DF | Den DF | F Value | Pr > F |
year | 1 | 757 | 2510.22 | <.0001 |
The equivalent in R uses lmer()
from the
lme4 package.
yield_fit_lmm <- lmer(yield ~ year + (1 | state), data = se_soybeans)
As you can see, the call to lmer()
includes the same
components as the SAS proc glimmix
statement, but with
different syntax. We specify a model formula including fixed and random
components to the first argument. The first part of the formula,
yield ~ year
, is the same as in the linear model above.
Then comes the random component, + (1 | state)
. This
includes a random intercept (1
) grouped by
state
. The |
separates the terms that are
random, in this case only the intercept 1
, from the
grouping factor state
. Finally, both the SAS and R code
include a data
argument to say which dataset or data frame
contains the variables to be used for model fitting.
The regression diagnostics look good. The Q-Q plot of the residuals shows near-perfect agreement with the normal distribution quantile line. There is a slight positive trend in the homogeneity of variance plot. If this trend were more pronounced we would have to consider transforming the yield data or fitting a generalized linear mixed model. But in this case it is not severe enough to cause concern.
PROTIP: You may find this too subjective and want to do formal tests of normality and homogeneity of variance. I do not usually recommend doing that. With linear regression, you may slightly deviate from normality and homogeneity of variance and still get good results. As you do more stats, you will get better at glancing at diagnostic plots and telling whether the model is “good enough” or needs revising.
check_model(yield_fit_lmm)
The summary output for the fitted model object shows us the fixed-effect intercept and slope, and the variance of the random intercepts.
summary(yield_fit_lmm)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: yield ~ year + (1 | state)
Data: se_soybeans
REML criterion at convergence: 4300.1
Scaled residuals:
Min 1Q Median 3Q Max
-3.4750 -0.6300 0.0015 0.6958 3.0399
Random effects:
Groups Name Variance Std.Dev.
state (Intercept) 4.226 2.056
Residual 15.239 3.904
Number of obs: 767, groups: state, 9
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -5.380e+02 1.117e+01 7.627e+02 -48.18 <2e-16 ***
year 2.836e-01 5.661e-03 7.584e+02 50.10 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr)
year -0.998
The anova()
function shows F-tests for the overall
slope.
anova(yield_fit_lmm)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
year 38254 38254 1 758.42 2510.2 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The coef()
function gives us a table of intercepts and
slopes for each of the states. Because we fit a random intercept model,
the slope by year
is the same for each state. However we
see that our model picks up some random variation by state in the
intercept, on the order of a few bushels per acre.
coef(yield_fit_lmm)
$state
(Intercept) year
Alabama -538.7362 0.2836196
Arkansas -536.7887 0.2836196
Florida -536.0610 0.2836196
Georgia -540.9151 0.2836196
Louisiana -536.5377 0.2836196
Mississippi -537.7909 0.2836196
North Carolina -537.1872 0.2836196
South Carolina -541.6771 0.2836196
Tennessee -536.6588 0.2836196
attr(,"class")
[1] "coef.mer"
Because we fit a random intercept model, the slope by
year
in the above table is the same for each state. However
we see that our model picks up some random variation by state in the
intercept, on the order of a few bushels per acre. Between 1924 and
2011, a linear trend fit to soybean yield per acre shows an increase of
0.28 bushels per acre per year.
Here we generate predictions from the model for the population-level expectation (overall trend across states) and the state-level trends. We will plot the observed and predicted yields on the same plot.
I have included code for making this plot with SAS in the
.sas
file available for download accompanying this lesson.
I didn’t include it here in the interest of space.
yield_pred_bystate <- expand_grid(year = c(1924, 2011), state = se_states) %>%
mutate(yield = as.numeric(predict(yield_fit_lmm, newdata = .)))
yield_pred_overall <- data.frame(state = 'overall', year = c(1924, 2011)) %>%
mutate(yield = as.numeric(predict(yield_fit_lmm, newdata = ., re.form = NA)))
ggplot(mapping = aes(x = year, y = yield, color = state, group = state)) +
geom_line(data = se_soybeans, alpha = 0.3) +
geom_line(data = yield_pred_bystate, linewidth = 0.7) +
geom_line(data = yield_pred_overall, color = 'black', linewidth = 1.2) +
theme_bw() +
ggtitle('soybean yields by state, observations and modeled trends, 1924-2011',
'black line is overall modeled trend')
The linear trend is a reasonably good model. The straight colored lines illustrate how our model allows the intercept to vary by state, but the lines are parallel because the slope is modeled as being the same for all states. If I were really analyzing these data, I would probably think about fitting a time series model that could account for the different slopes we observe in different decades, as well as the slight increase in variance over time as the yield increases. But this isn’t bad for starters.
What have you learned so far?
Impressive!
This was Lesson 1 in a series. Future lessons will cover:
lsmeans
statement in SAS,
emmeans()
function in R)For some additional practice, here are a few exercises. These exercises ask you to “translate” some simple SAS code into the R equivalent. This will take you through the pipeline steps of importing a file, manipulating the data, calculating summary statistics, making a plot, and fitting a model. Each question builds off the previous one, so I recommend running each step in R sequentially. The exercises use data from an oat breeding program originally published in Crop Science by Edwards & Jannink.
If you are starting on a fresh R session for these exercises, you
will need to load library(tidyverse)
and
library(easystats)
again first.
Write the R code that does the same thing as this SAS code (import a
CSV file from a URL to a dataset called oats
). Note that if
you are on the cloud server you may import the data from
data/Edwards_oats.csv
instead.
filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/Edwards_oats.csv";
proc import datafile = csvFile out = oats replace dbms = csv; run;
Write the R code that does the same thing as this SAS code (create a
subset of the oats
data with only data from year 2001 and
from the genotypes gen
Belle, Blaze, Brawn, and Chaps).
data oats_subset; set oats;
where year = 2001 & gen in ('Belle', 'Blaze', 'Brawn', 'Chaps');
run;
filter()
with
two conditions, one for year
and one for gen
,
separated by ,
.Write the R code that does the same thing as this SAS code (find the
mean and standard deviation of yield
for each genotype
gen
in the data we subset in Exercise 2).
proc sql;
select
gen,
avg(yield) as mean_yield,
std(yield) as stdev_yield
from oats_subset
group by gen;
quit;
group_by()
and
summarize()
, with mean()
for the mean and
sd()
for the standard deviation.Write the R code that does the same thing as this SAS code (make a
boxplot of yield
for each location loc
, in the
data we subset in Exercise 2). You will need to use
geom_boxplot()
to create this plot.
proc sgplot data = oats_subset;
vbox yield / group = gen;
run;
aes(x = loc, y = yield)
.Write the R code that does the same thing as this SAS code (fit a
linear model to yield
as a function of genotype
gen
, using the data we subset in Exercise 2, output
regression diagnostics, examine model coefficients, and generate ANOVA
table).
proc glm data = oats_subset plots(unpack) = diagnostics;
class gen;
model yield = gen;
run;
lm()
to fit the model, with gen
as the
predictor variable.Include a term for environment loc
and genotype by
environment interaction gen:loc
in the R linear model
above, similar to the SAS code shown here.
proc glm data = oats_subset plots(unpack) = diagnostics;
class gen loc;
model yield = gen loc gen*loc;
run;
gen + loc + gen:loc
.