Create a new dataset or data frame with only the states in the Southeast region
In SAS we use the data step
data se_soybeans; set nass_soybeans;
where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
run;
Data frame is first argument, computation is the second argument
Assign the result back to the original data frame, overwriting it
Combining data-wrangling operations: SAS
data se_soybeans; set nass_soybeans;
where state in ('Alabama', 'Arkansas', 'Florida', 'Georgia', 'Louisiana', 'Mississippi', 'North Carolina', 'South Carolina', 'Tennessee');
total_yield = acres * yield;
run;
In SAS the row subsetting and new column calculation can be part of the same data step
We end up with more rows than we started with (missing cells now have rows)
R is better at renaming the columns sensibly than SAS is
Make exploratory graphs: SAS
proc sgplot data = se_soybeans description = 'SAS PROC SGPLOT line plot of soybean yield over time for each state';
where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
series x = year y = yield / group = state;
yaxis label='yield (bu/ac)';
run;
Time series for four states as different lines on the same plot
Make exploratory graphs: R
fourstates <-filter(se_soybeans, state %in%c('Arkansas', 'Tennessee', 'North Carolina', 'Georgia'))ggplot(fourstates, aes(x = year, y = yield, color = state)) +geom_line(linewidth =1) +theme_bw() +scale_y_continuous(name ='yield (bu/ac)') +theme(legend.position =c(0.2, 0.8))
Check out my ggplot2 lesson to learn more
Make multi-panel version of the plot: SAS
proc sgpanel data = se_soybeans description = 'SAS PROC SGPANEL line plot of soybean yield over time with panel for each state';
where state in ('Arkansas', 'Tennessee', 'North Carolina', 'Georgia');
panelby state;
series x = year y = yield;
rowaxis label = 'yield (bu/ac)';
run;
Same code as before except we removed color = state and added facet_wrap(~ state)
Make tables of summary statistics
Many ways to do it in SAS but I like proc sql
In R we will use a combination of group_by() and summarize() with pipes
Summary statistics to calculate for all states for every 10th year and put into a table:
total acreage harvested
total yield in bushels
weighted mean of yield per acre
Table of summary statistics: SAS
proc sql;
select
year,
sum(acres) as grand_total_acres,
sum(total_yield) as grand_total_yield,
sum(yield * acres) / sum(acres) as mean_yield
from se_soybeans
where mod(year, 10) = 0
group by year;
quit;
Piped statement with filter(), group_by(), and summarize()
group_by() identifies column or columns by which to split the data into groups of rows
summarize() includes a list of summary statistics separated by commas
PROTIP: Notice the similarities between proc sql and the R tidyverse code. That’s because the tidyverse syntax was partially inspired by SQL.
PROTIP 2: In SAS, a single equal sign = is used to test whether two values are equal. In R (and in many other languages such as C and Python) you use the double equal sign ==.
Fit statistical models
Both SAS and R have tons of different options for model fitting
SAS code is sometimes more concise (spits out lots of output automagically)
R code usually needs to be explicit about what output you want from the model
This has both pros and cons
Simple linear regression: SAS
proc reg data = se_soybeans;
model yield = year;
run;
Is there a linear trend over time in yield per acre?
SAS uses proc reg for this
Simple linear regression: R
yield_fit <-lm(yield ~ year, data = se_soybeans)
Use lm()
Model formula y ~ x1 + x2 …
data argument says which data frame the variables come from