R for SAS users

What is this workshop?

  • For SAS users who want to learn R
  • Practicing researchers who have a decent working knowledge of SAS and of basic statistical analysis
  • SAS code is provided for comparison purposes
  • Lesson 1 in a series

Conceptual learning objectives

During this workshop, you will …

  • Learn the similarities and differences between SAS and R, both different tools for the job
  • Get introduced to what R packages are, in particular the “tidyverse”
  • Learn which common SAS statistical procedures correspond to which R packages/functions

Practical skills

“Data to doc” pipeline, including …

  • Import the data from a CSV file
  • Clean and reshape the data
  • Calculate some summary statistics and make some exploratory plots
  • Fit a linear model and a linear mixed-effects model
  • Make plots and tables of results

How to follow along with this workshop

  • Slides and text version of lessons are online
    • usda-ree-ars.github.io/SEAStats
  • Fill in R code in the worksheet (replace ... with code)
  • You can always copy and paste code from text version of lesson if you fall behind

Background: R and SAS

  • SAS created in the late 1960s as a “statistical analysis system” for agricultural researchers at North Carolina State University
  • Spun off as an independent business in 1976

photo of SAS Hall at NC State University
SAS Hall at NC State University. There is no R Hall … yet!

  • R first released in 1993, created by statisticians Ross Ihaka and Robert Gentleman, University of Auckland, New Zealand

Pros and cons of R

  • R is free and open-source
  • Anyone can contribute a package (set of functions serving a common purpose)
  • Lots of different capabilities
    • GIS and geospatial analysis
    • High-performance computing
    • Machine learning
  • Lots of different user communities
    • Social sciences
    • Ecology
    • Economics
    • Pharmacology
    • Linguistics
    • and more …

Image of the “hex sticker” logos of many R packages
Just a few of the R packages out there

Top-down versus crowdsourced

  • SAS is more top-down and centralized than R
  • Usually in SAS, one kind of statistical model = one procedure
  • R may have dozens of ways to fit a particular stat model
  • This is both good and bad

Transition from SAS to R

  • SAS may be winding down support for its desktop products, moving to cloud-based services
  • This is good for big corporations but not great for government and academic researchers
  • R is more reliable for the future
  • Python is another option

Tools for a job

  • R and SAS are both good tools, neither is perfect
  • I still recommend R for stats beginners
  • I’m not trying to convert you from SAS to R, just expose SAS users to a few of R’s capabilities

Data to doc pipeline

  • Import the data from a file
  • Clean, manipulate, and sort the data
  • Make exploratory graphs and tables
  • Fit statistical models
  • Look at model diagnostics to make sure our models are valid
  • Extract predictions from the statistical models
  • Make graphs and tables of model predictions
  • Put results into a document

Load R packages

  • tidyverse for reading, manipulating, and plotting data
  • lme4 for fitting linear mixed models
  • easystats for making model diagnostic plots
    • (tidyverse and easystats are actually collections of several packages)
library(tidyverse)
library(lme4)
library(easystats)

Import the data from a file

  • The example data for this lesson are hosted on GitHub
  • We are importing the data directly from a URL (usually you read data you have saved locally)

PROTIP: Use CSV not XLSX!

Import the data: SAS

I recommend using proc import but some use the data step

filename csvFile url "https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/NASS_soybean.csv";
proc import datafile = csvFile out = nass_soybeans replace dbms = csv; guessingrows = 2000; run;

Import the data: R

If you are using cloud server:

nass_soybeans <- read_csv('data/NASS_soybean.csv')

If you are running code locally:

nass_soybeans <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_for_SAS_users/datasets/NASS_soybean.csv')
  • Use read_csv() function
  • variable is created using the syntax variable <- value: “variable gets value”
  • functions are called using the syntax function(argument)

Examining the data: SAS and R

SAS

proc print data = nass_soybeans(obs = 10); run;

ods select variables;
proc contents data = nass_soybeans; run;

R

head(nass_soybeans, 10)
summary(nass_soybeans)
glimpse(nass_soybeans)

Subsetting the data: SAS

  • Create a new dataset or data frame with only the states in the Southeast region
  • In SAS we use the data step
data se_soybeans; set nass_soybeans;
    where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
run;

Subsetting the data: R

se_states <- c('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee')

se_soybeans <- filter(nass_soybeans, state %in% se_states)
  • In R tidyverse, we use filter()
  • Create character vector of state names separately
  • filter() function has data frame as first argument, condition as second argument
  • Keep only rows that match the condition
  • Assign the result to a new data frame

Doing calculations on the data: SAS

  • Calculate a new column from existing columns
  • Multiply acreage by yield (bushels per acre) to get total yield in bushels
  • SAS uses data step
data se_soybeans; set se_soybeans;
    total_yield = acres * yield;
run;

Doing calculations on the data: R

se_soybeans <- mutate(se_soybeans, total_yield = acres * yield)
  • R tidyverse: mutate()
  • Data frame is first argument, computation is the second argument
  • Assign the result back to the original data frame, overwriting it

Combining data-wrangling operations: SAS

data se_soybeans; set nass_soybeans;
  where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
    total_yield = acres * yield;
run;
  • In SAS the row subsetting and new column calculation can be part of the same data step

Combining data-wrangling operations: R

se_soybeans <- nass_soybeans %>%
  filter(state %in% se_states) %>%
  mutate(total_yield = acres * yield)
  • Use the “pipe” operator %>% to “chain” operations
  • The %>% passes the result of one line to the next line
  • Here we first pass nass_soybeans unmodified to filter()
  • Then pass the result of filter() to mutate()

Sorting the data: SAS

proc sort data = se_soybeans;
    by year state;
run;
  • Sort by year and then state using proc sort

Sorting the data: R

se_soybeans <- arrange(se_soybeans, year, state)
  • Use arrange() in R tidyverse
  • First argument is data frame, following arguments are the column names to sort on

Even longer pipes

  • filter(), mutate(), and arrange() in the same pipe statement
se_soybeans <- nass_soybeans %>%
  filter(state %in% se_states) %>%
  mutate(total_yield = acres * yield) %>%
  arrange(year, state)

Reshaping the data

  • Each row is a unique observation
    • Identifier columns are year and state
    • Data columns are acres and yield
  • This is sometimes called “tidy” data
  • For visualization, we might want to reshape to a wide format instead of long and skinny

animation of pivoting a data frame between wide and long formats
animation by Garrick Aden-Buie

Reshaping the data: SAS

proc transpose data = se_soybeans out = total_yield_wide;
    by year; id state; var total_yield;
run;
  • We get wide form data, row for each year and column for each state
  • proc transpose specifying by, id, and var
  • Create a new dataset with the reshaped result

Reshaping the data: R

total_yield_wide <- se_soybeans %>%
  pivot_wider(id_cols = year, names_from = state, values_from = total_yield)
  • pivot_wider() from tidyverse
  • Pass the column names as arguments
  • id_cols is for row IDs, confusingly in SAS id is for column IDs

Reshaping wide to long

SAS

proc transpose data = total_yield_wide out = total_yield_long;
    by year;
run;

R

total_yield_long <- total_yield_wide %>%
  pivot_longer(-year, names_to = 'state', values_to = 'total_yield')
  • We end up with more rows than we started with (missing cells now have rows)
  • R is better at renaming the columns sensibly than SAS is

Make exploratory graphs: SAS

proc sgplot data = se_soybeans description = 'SAS PROC SGPLOT line plot of soybean yield over time for each state';
    where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
    series x = year y = yield / group = state; 
    yaxis label='yield (bu/ac)';
run;
  • Time series for four states as different lines on the same plot

Make exploratory graphs: R

fourstates <- filter(se_soybeans, state %in% c('Arkansas', 'Tennessee', 'North Carolina', 'Georgia'))

ggplot(fourstates, aes(x = year, y = yield, color = state)) +
  geom_line(linewidth = 1) +
  theme_bw() +
  scale_y_continuous(name = 'yield (bu/ac)') +
  theme(legend.position = c(0.2, 0.8))
  • Check out my ggplot2 lesson to learn more

Make multi-panel version of the plot: SAS

proc sgpanel data = se_soybeans description = 'SAS PROC SGPANEL line plot of soybean yield over time with panel for each state';
    where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
    panelby state;
    series x = year y = yield;
    rowaxis label = 'yield (bu/ac)';
run;
  • A separate procedure is needed (sgpanel)

Make multi-panel version of the plot: R

ggplot(fourstates, aes(x = year, y = yield)) +
  facet_wrap(~ state) +
  geom_line(linewidth = 1) +
  theme_bw() +
  scale_y_continuous(name = 'yield (bu/ac)')
  • Same code as before except we removed color = state and added facet_wrap(~ state)

Make tables of summary statistics

  • Many ways to do it in SAS but I like proc sql
  • In R we will use a combination of group_by() and summarize() with pipes
  • Summary statistics to calculate for all states for every 10th year and put into a table:
    • total acreage harvested
    • total yield in bushels
    • weighted mean of yield per acre

Table of summary statistics: SAS

proc sql;
    select 
        year,
        sum(acres) as grand_total_acres,
        sum(total_yield) as grand_total_yield,
        sum(yield * acres) / sum(acres) as mean_yield
    from se_soybeans
    where mod(year, 10) = 0
    group by year;
quit;

Table of summary statistics: R

se_soybeans %>%
  filter(year %% 10 == 0) %>%
  group_by(year) %>%
  summarize(
    grand_total_acres = sum(acres),
    grand_total_yield = sum(total_yield),
    mean_yield = weighted.mean(yield, acres)
  )
  • Piped statement with filter(), group_by(), and summarize()
  • group_by() identifies column or columns by which to split the data into groups of rows
  • summarize() includes a list of summary statistics separated by commas

PROTIP: Notice the similarities between proc sql and the R tidyverse code. That’s because the tidyverse syntax was partially inspired by SQL.

PROTIP 2: In SAS, a single equal sign = is used to test whether two values are equal. In R (and in many other languages such as C and Python) you use the double equal sign ==.

Fit statistical models

  • Both SAS and R have tons of different options for model fitting
  • SAS code is sometimes more concise (spits out lots of output automagically)
  • R code usually needs to be explicit about what output you want from the model
    • This has both pros and cons

Simple linear regression: SAS

proc reg data = se_soybeans;
    model yield = year;
run;    
  • Is there a linear trend over time in yield per acre?
  • SAS uses proc reg for this

Simple linear regression: R

yield_fit <- lm(yield ~ year, data = se_soybeans)
  • Use lm()
  • Model formula y ~ x1 + x2
  • data argument says which data frame the variables come from
summary(yield_fit)
anova(yield_fit)
check_model(yield_fit)
  • summary() gives us model fit statistics and parameter estimates
  • anova() shows us F-test for the slope
  • check_model() gives us nice regression diagnostics (from easystats)

Mixed models

  • Simple linear regression assumes all data points are independent
  • But yield values from the same state in different years aren’t independent
  • We have “multilevel” or “nested” data (repeated measures)

Mixed models: SAS

proc glimmix data = se_soybeans plots = residualpanel;
    class state;
    model yield = year / solution;
    random intercept / subject = state;
run;
  • proc glimmix with appropriate random statement
  • Random intercepts: “baseline” yield may vary by state
  • Same slope (time trend) for all states

Mixed models: R

yield_fit_lmm <- lmer(yield ~ year + (1 | state), data = se_soybeans)
  • Use lmer(): same components as proc glimmix but different syntax
  • (1 | state) is random component of model formula:
    • model terms on left-hand side: 1 is intercept only
    • grouping factor on right-hand side of |
check_model(yield_fit_lmm)
summary(yield_fit_lmm)
anova(yield_fit_lmm)
coef(yield_fit_lmm)
  • Similar to the lm() output functions
  • coef() gives us a table of intercepts and slopes for each state
  • Common slope: ~0.28 more bushels per acre per year

Make graph of model predictions

  • Model predictions for individual states and population-level expectation
yield_pred_bystate <- expand_grid(year = c(1924, 2011), state = se_states) %>%
  mutate(yield = as.numeric(predict(yield_fit_lmm, newdata = .)))

yield_pred_overall <- data.frame(state = 'overall', year = c(1924, 2011)) %>% 
  mutate(yield = as.numeric(predict(yield_fit_lmm, newdata = ., re.form = NA)))
ggplot(mapping = aes(x = year, y = yield, color = state, group = state)) +
  geom_line(data = se_soybeans, alpha = 0.3) +
  geom_line(data = yield_pred_bystate, linewidth = 0.7) +
  geom_line(data = yield_pred_overall, color = 'black', linewidth = 1.2) +
  theme_bw() +
  ggtitle('soybean yields by state, observations and modeled trends, 1924-2011',
          'black line is overall modeled trend')
  • Observed and modeled on the same plot
  • Linear trend is reasonably good model
  • Code for a similar plot made with SAS is in lesson materials

What did you learn today?

  • The pros and cons of R and SAS
  • How to create a data frame in R by reading data from an external file
  • How to clean, manipulate, sort, and reshape your data frame
  • How to calculate summary statistics from a data frame
  • How to fit a linear model and a linear mixed model

Impressive!

  • See text version of lesson for further reading and useful resources
  • Fill out MS Forms survey
  • Contact me at quentin.read@usda.gov!