R for SAS users

What is this workshop?

For SAS users who want to learn R
Practicing researchers who have a decent working knowledge of SAS and of basic statistical analysis
SAS code is provided for comparison purposes
Lesson 1 in a series

Conceptual learning objectives

During this workshop, you will …

Learn the similarities and differences between SAS and R, both different tools for the job
Get introduced to what R packages are, in particular the “tidyverse”
Learn which common SAS statistical procedures correspond to which R packages/functions

Practical skills

“Data to doc” pipeline, including …

Import the data from a CSV file
Clean and reshape the data
Calculate some summary statistics and make some exploratory plots
Fit a linear model and a linear mixed-effects model
Make plots and tables of results

How to follow along with this workshop

Slides and text version of lessons are online
- usda-ree-ars.github.io/SEAStats
Fill in R code in the worksheet (replace ... with code)
You can always copy and paste code from text version of lesson if you fall behind

Background: R and SAS

SAS created in the late 1960s as a “statistical analysis system” for agricultural researchers at North Carolina State University
Spun off as an independent business in 1976

photo of SAS Hall at NC State University
SAS Hall at NC State University. There is no R Hall … yet!

R first released in 1993, created by statisticians Ross Ihaka and Robert Gentleman, University of Auckland, New Zealand

Pros and cons of R

R is free and open-source
Anyone can contribute a package (set of functions serving a common purpose)
Lots of different capabilities
- GIS and geospatial analysis
- High-performance computing
- Machine learning

Lots of different user communities
- Social sciences
- Ecology
- Economics
- Pharmacology
- Linguistics
- and more …

Image of the “hex sticker” logos of many R packages
Just a few of the R packages out there

Top-down versus crowdsourced

SAS is more top-down and centralized than R
Usually in SAS, one kind of statistical model = one procedure
R may have dozens of ways to fit a particular stat model
This is both good and bad

Transition from SAS to R

SAS may be winding down support for its desktop products, moving to cloud-based services
This is good for big corporations but not great for government and academic researchers
R is more reliable for the future
Python is another option

Tools for a job

R and SAS are both good tools, neither is perfect
I still recommend R for stats beginners
I’m not trying to convert you from SAS to R, just expose SAS users to a few of R’s capabilities

Data to doc pipeline

Import the data from a file
Clean, manipulate, and sort the data
Make exploratory graphs and tables
Fit statistical models
Look at model diagnostics to make sure our models are valid
Extract predictions from the statistical models
Make graphs and tables of model predictions
Put results into a document

Load R packages

tidyverse for reading, manipulating, and plotting data
lme4 for fitting linear mixed models
easystats for making model diagnostic plots
- (tidyverse and easystats are actually collections of several packages)

library(tidyverse)
library(lme4)
library(easystats)

Import the data from a file

The example data for this lesson are hosted on GitHub
We are importing the data directly from a URL (usually you read data you have saved locally)

PROTIP: Use CSV not XLSX!

Import the data: SAS

I recommend using proc import but some use the data step

filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/NASS_soybean.csv";
proc import datafile = csvFile out = nass_soybeans replace dbms = csv; guessingrows = 2000; run;

Import the data: R

If you are using cloud server:

nass_soybeans <- read_csv('data/NASS_soybean.csv')

If you are running code locally:

nass_soybeans <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/NASS_soybean.csv')

Use read_csv() function
variable is created using the syntax variable <- value: “variable gets value”
functions are called using the syntax function(argument)

Examining the data: SAS and R

SAS

proc print data = nass_soybeans(obs = 10); run;

ods select variables;
proc contents data = nass_soybeans; run;

R

head(nass_soybeans, 10)
summary(nass_soybeans)
glimpse(nass_soybeans)

Subsetting the data: SAS

Create a new dataset or data frame with only the states in the Southeast region
In SAS we use the data step

data se_soybeans; set nass_soybeans;
    where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
run;

Subsetting the data: R

se_states <- c('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee')

se_soybeans <- filter(nass_soybeans, state %in% se_states)

In R tidyverse, we use filter()
Create character vector of state names separately
filter() function has data frame as first argument, condition as second argument
Keep only rows that match the condition
Assign the result to a new data frame

Doing calculations on the data: SAS

Calculate a new column from existing columns
Multiply acreage by yield (bushels per acre) to get total yield in bushels
SAS uses data step

data se_soybeans; set se_soybeans;
    total_yield = acres * yield;
run;

Doing calculations on the data: R

se_soybeans <- mutate(se_soybeans, total_yield = acres * yield)

R tidyverse: mutate()
Data frame is first argument, computation is the second argument
Assign the result back to the original data frame, overwriting it

Combining data-wrangling operations: SAS

data se_soybeans; set nass_soybeans;
  where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
    total_yield = acres * yield;
run;

In SAS the row subsetting and new column calculation can be part of the same data step

Combining data-wrangling operations: R

se_soybeans <- nass_soybeans %>%
  filter(state %in% se_states) %>%
  mutate(total_yield = acres * yield)

Use the “pipe” operator %>% to “chain” operations
The %>% passes the result of one line to the next line
Here we first pass nass_soybeans unmodified to filter()
Then pass the result of filter() to mutate()

Sorting the data: SAS

proc sort data = se_soybeans;
    by year state;
run;

Sort by year and then state using proc sort

Sorting the data: R

se_soybeans <- arrange(se_soybeans, year, state)

Use arrange() in R tidyverse
First argument is data frame, following arguments are the column names to sort on

Even longer pipes

filter(), mutate(), and arrange() in the same pipe statement

se_soybeans <- nass_soybeans %>%
  filter(state %in% se_states) %>%
  mutate(total_yield = acres * yield) %>%
  arrange(year, state)

Reshaping the data

Each row is a unique observation
- Identifier columns are year and state
- Data columns are acres and yield
This is sometimes called “tidy” data
For visualization, we might want to reshape to a wide format instead of long and skinny

animation of pivoting a data frame between wide and long formats
animation by Garrick Aden-Buie

Reshaping the data: SAS

proc transpose data = se_soybeans out = total_yield_wide;
    by year; id state; var total_yield;
run;

We get wide form data, row for each year and column for each state
proc transpose specifying by, id, and var
Create a new dataset with the reshaped result

Reshaping the data: R

total_yield_wide <- se_soybeans %>%
  pivot_wider(id_cols = year, names_from = state, values_from = total_yield)

pivot_wider() from tidyverse
Pass the column names as arguments
id_cols is for row IDs, confusingly in SAS id is for column IDs

Reshaping wide to long

SAS

proc transpose data = total_yield_wide out = total_yield_long;
    by year;
run;

R

total_yield_long <- total_yield_wide %>%
  pivot_longer(-year, names_to = 'state', values_to = 'total_yield')

We end up with more rows than we started with (missing cells now have rows)
R is better at renaming the columns sensibly than SAS is

Make exploratory graphs: SAS

proc sgplot data = se_soybeans description = 'SAS PROC SGPLOT line plot of soybean yield over time for each state';
    where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
    series x = year y = yield / group = state; 
    yaxis label='yield (bu/ac)';
run;

Time series for four states as different lines on the same plot

Make exploratory graphs: R

fourstates <- filter(se_soybeans, state %in% c('Arkansas', 'Tennessee', 'North Carolina', 'Georgia'))

ggplot(fourstates, aes(x = year, y = yield, color = state)) +
  geom_line(linewidth = 1) +
  theme_bw() +
  scale_y_continuous(name = 'yield (bu/ac)') +
  theme(legend.position = c(0.2, 0.8))

Check out my ggplot2 lesson to learn more

Make multi-panel version of the plot: SAS

proc sgpanel data = se_soybeans description = 'SAS PROC SGPANEL line plot of soybean yield over time with panel for each state';
    where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
    panelby state;
    series x = year y = yield;
    rowaxis label = 'yield (bu/ac)';
run;

A separate procedure is needed (sgpanel)

Make multi-panel version of the plot: R

ggplot(fourstates, aes(x = year, y = yield)) +
  facet_wrap(~ state) +
  geom_line(linewidth = 1) +
  theme_bw() +
  scale_y_continuous(name = 'yield (bu/ac)')

Same code as before except we removed color = state and added facet_wrap(~ state)

Make tables of summary statistics

Many ways to do it in SAS but I like proc sql
In R we will use a combination of group_by() and summarize() with pipes
Summary statistics to calculate for all states for every 10th year and put into a table:
- total acreage harvested
- total yield in bushels
- weighted mean of yield per acre

Table of summary statistics: SAS

proc sql;
    select 
        year,
        sum(acres) as grand_total_acres,
        sum(total_yield) as grand_total_yield,
        sum(yield * acres) / sum(acres) as mean_yield
    from se_soybeans
    where mod(year, 10) = 0
    group by year;
quit;

Table of summary statistics: R

se_soybeans %>%
  filter(year %% 10 == 0) %>%
  group_by(year) %>%
  summarize(
    grand_total_acres = sum(acres),
    grand_total_yield = sum(total_yield),
    mean_yield = weighted.mean(yield, acres)
  )

Piped statement with filter(), group_by(), and summarize()
group_by() identifies column or columns by which to split the data into groups of rows
summarize() includes a list of summary statistics separated by commas

PROTIP: Notice the similarities between proc sql and the R tidyverse code. That’s because the tidyverse syntax was partially inspired by SQL.

PROTIP 2: In SAS, a single equal sign = is used to test whether two values are equal. In R (and in many other languages such as C and Python) you use the double equal sign ==.

Fit statistical models

Both SAS and R have tons of different options for model fitting
SAS code is sometimes more concise (spits out lots of output automagically)
R code usually needs to be explicit about what output you want from the model
- This has both pros and cons

Simple linear regression: SAS

proc reg data = se_soybeans;
    model yield = year;
run;

Is there a linear trend over time in yield per acre?
SAS uses proc reg for this

Simple linear regression: R

yield_fit <- lm(yield ~ year, data = se_soybeans)

Use lm()
Model formula y ~ x1 + x2 …
data argument says which data frame the variables come from

summary(yield_fit)
anova(yield_fit)
check_model(yield_fit)

summary() gives us model fit statistics and parameter estimates
anova() shows us F-test for the slope
check_model() gives us nice regression diagnostics (from easystats)

Mixed models

Simple linear regression assumes all data points are independent
But yield values from the same state in different years aren’t independent
We have “multilevel” or “nested” data (repeated measures)

Mixed models: SAS

proc glimmix data = se_soybeans plots = residualpanel;
    class state;
    model yield = year / solution;
    random intercept / subject = state;
run;

proc glimmix with appropriate random statement
Random intercepts: “baseline” yield may vary by state
Same slope (time trend) for all states

Mixed models: R

yield_fit_lmm <- lmer(yield ~ year + (1 | state), data = se_soybeans)

Use lmer(): same components as proc glimmix but different syntax
(1 | state) is random component of model formula:
- model terms on left-hand side: 1 is intercept only
- grouping factor on right-hand side of |

check_model(yield_fit_lmm)
summary(yield_fit_lmm)
anova(yield_fit_lmm)
coef(yield_fit_lmm)

Similar to the lm() output functions
coef() gives us a table of intercepts and slopes for each state
Common slope: ~0.28 more bushels per acre per year

Make graph of model predictions

Model predictions for individual states and population-level expectation

yield_pred_bystate <- expand_grid(year = c(1924, 2011), state = se_states) %>%
  mutate(yield = as.numeric(predict(yield_fit_lmm, newdata = .)))

yield_pred_overall <- data.frame(state = 'overall', year = c(1924, 2011)) %>% 
  mutate(yield = as.numeric(predict(yield_fit_lmm, newdata = ., re.form = NA)))

ggplot(mapping = aes(x = year, y = yield, color = state, group = state)) +
  geom_line(data = se_soybeans, alpha = 0.3) +
  geom_line(data = yield_pred_bystate, linewidth = 0.7) +
  geom_line(data = yield_pred_overall, color = 'black', linewidth = 1.2) +
  theme_bw() +
  ggtitle('soybean yields by state, observations and modeled trends, 1924-2011',
          'black line is overall modeled trend')

Observed and modeled on the same plot
Linear trend is reasonably good model
Code for a similar plot made with SAS is in lesson materials

What did you learn today?

The pros and cons of R and SAS
How to create a data frame in R by reading data from an external file
How to clean, manipulate, sort, and reshape your data frame
How to calculate summary statistics from a data frame
How to fit a linear model and a linear mixed model

Impressive!

See text version of lesson for further reading and useful resources
Fill out MS Forms survey
Contact me at quentin.read@usda.gov!