At the end of this lesson, students will …
We will not go into every possible R data type for this lesson. The one you really need to know about for model fitting is the data frame.
A data frame corresponds to a table or spreadsheet of data. Each column has its own data type. You might have numeric, character, and factor columns all in the same data frame. Ideally, each row will correspond to an individual observation.
Data frames in R are the workhorses for data manipulation and analysis. There are a lot of things you can do with data frames; we will go over some of the basics in this lesson.
In this lesson, I am going to teach you functions from the tidyverse group of packages. These are a set of packages, designed to work together, that make it easy to wrangle and manipulate data in R. All of the stuff I am going to teach can be done with base R code without any packages. There are also alternatives such as data.table. Personally I use a combination of all three. However for this lesson I chose to teach tidyverse for a couple of reasons. For one thing, it’s the easiest for beginners to grasp. Also, it has become really popular and widespread in the last few years. So a lot of the other R resources you will find are now done with the tidyverse. So why swim against the tide? There are pros and cons to the tidyverse, but it is great for starting out with R. If you become a big time R programmer I would recommend learning all the base R code and other packages like data.table in the future.
Load the tidyverse package which is a shortcut to loading the multiple related tidyverse packages we will be using in this lesson.
library(tidyverse)Read the file small_mammals.csv from the URL where it is
hosted online. (Typically, you will import data from files on your local
hard drive, but for these lessons we will work with data hosted on the
course website.)
Read in an external file using the read_csv() function
from the tidyverse package readr. There is a base R
function for this as well but the readr function has
better default values and reads large files more quickly. Also,
read_csv returns a special kind of data frame called a
“tibble” that works better with the other tidyverse functions.
smammals <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_boot_camp/datasets/small_mammals.csv')## Rows: 1629 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): siteName, taxonID, family, sex, lifeStage
## dbl (2): hindfootLength, weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.The data we are reading in here is a set of body measurements of small mammals from the NEON ecological observatory network. We will go on to examine this dataset in detail.
There are a lot of R functions and operators for looking at your data frame. We can get summary information on a data frame or pull out individual bits of data from it. We’ll go through a few ways of doing that.
Let’s look at the object which is now loaded into our workspace that
we called smammals. If we just type smammals
into our console we see the first few rows of data and some information
about its dimensions.
smammals## # A tibble: 1,629 × 7
##    siteName       taxonID family     sex   lifeStage hindfootLength weight
##    <chr>          <chr>   <chr>      <chr> <chr>              <dbl>  <dbl>
##  1 Harvard Forest MYGA    Cricetidae M     adult               18       19
##  2 Harvard Forest MYGA    Cricetidae M     adult               17       22
##  3 Harvard Forest MYGA    Cricetidae M     adult               18       23
##  4 Harvard Forest MYGA    Cricetidae F     adult               18       21
##  5 Harvard Forest MYGA    Cricetidae F     juvenile            17       11
##  6 Harvard Forest MYGA    Cricetidae F     adult               17.5     17
##  7 Harvard Forest MYGA    Cricetidae M     adult               17       22
##  8 Harvard Forest PELE    Cricetidae F     adult               21       36
##  9 Harvard Forest PELE    Cricetidae M     subadult            20       20
## 10 Harvard Forest PELE    Cricetidae M     subadult            20       19
## # ℹ 1,619 more rowsAs the informational text says, we can print all the values of
smammals if we explicitly tell R to do this. It’s nice that
it does not do this by default especially if we have data frames with
thousands or millions of rows.
print(smammals, n = 50)## # A tibble: 1,629 × 7
##    siteName       taxonID family     sex   lifeStage hindfootLength weight
##    <chr>          <chr>   <chr>      <chr> <chr>              <dbl>  <dbl>
##  1 Harvard Forest MYGA    Cricetidae M     adult               18     19  
##  2 Harvard Forest MYGA    Cricetidae M     adult               17     22  
##  3 Harvard Forest MYGA    Cricetidae M     adult               18     23  
##  4 Harvard Forest MYGA    Cricetidae F     adult               18     21  
##  5 Harvard Forest MYGA    Cricetidae F     juvenile            17     11  
##  6 Harvard Forest MYGA    Cricetidae F     adult               17.5   17  
##  7 Harvard Forest MYGA    Cricetidae M     adult               17     22  
##  8 Harvard Forest PELE    Cricetidae F     adult               21     36  
##  9 Harvard Forest PELE    Cricetidae M     subadult            20     20  
## 10 Harvard Forest PELE    Cricetidae M     subadult            20     19  
## 11 Harvard Forest PELE    Cricetidae M     subadult            19     22  
## 12 Harvard Forest PELE    Cricetidae F     adult               20     26  
## 13 Harvard Forest PELE    Cricetidae F     adult               20     22  
## 14 Harvard Forest PELE    Cricetidae F     adult               20.5   22.5
## 15 Harvard Forest PELE    Cricetidae F     adult               20     29  
## 16 Harvard Forest PELE    Cricetidae F     adult               19     22  
## 17 Harvard Forest PELE    Cricetidae M     adult               21     32  
## 18 Harvard Forest PELE    Cricetidae F     subadult            20     17  
## 19 Harvard Forest PELE    Cricetidae F     adult               19.5   23  
## 20 Harvard Forest PELE    Cricetidae F     adult               20     23  
## 21 Harvard Forest PELE    Cricetidae F     adult               21     27  
## 22 Harvard Forest PELE    Cricetidae F     adult               20     23  
## 23 Harvard Forest PELE    Cricetidae F     adult               20.5   22  
## 24 Harvard Forest PELE    Cricetidae F     adult               20     24  
## 25 Harvard Forest PELE    Cricetidae M     adult               20.5   22.5
## 26 Harvard Forest PELE    Cricetidae M     adult               20.5   18.5
## 27 Harvard Forest PELE    Cricetidae F     subadult            20     19  
## 28 Harvard Forest PELE    Cricetidae M     subadult            19     15  
## 29 Harvard Forest PELE    Cricetidae M     subadult            20     21  
## 30 Harvard Forest PELE    Cricetidae M     subadult            18     14  
## 31 Harvard Forest PELE    Cricetidae M     adult               19     19  
## 32 Harvard Forest PEMA    Cricetidae F     adult               20     23.5
## 33 Harvard Forest PEMA    Cricetidae M     subadult            21     20  
## 34 Harvard Forest PEMA    Cricetidae M     adult               20     21  
## 35 Harvard Forest PEMA    Cricetidae M     adult               20     21  
## 36 Harvard Forest PEMA    Cricetidae F     subadult            20     20  
## 37 Harvard Forest PEMA    Cricetidae M     subadult            20     16  
## 38 Harvard Forest PEMA    Cricetidae M     subadult            19     17  
## 39 Harvard Forest PEMA    Cricetidae M     adult               19     20  
## 40 Harvard Forest PEMA    Cricetidae F     adult               21     21.5
## 41 Harvard Forest PEMA    Cricetidae F     subadult            19     18  
## 42 Harvard Forest PEMA    Cricetidae F     subadult            18     17  
## 43 Harvard Forest PEMA    Cricetidae F     adult               19     22  
## 44 Harvard Forest PEMA    Cricetidae F     adult               20     27.5
## 45 Harvard Forest PEMA    Cricetidae M     adult               20     18  
## 46 Harvard Forest PEMA    Cricetidae M     adult               20.5   21  
## 47 Harvard Forest PEMA    Cricetidae F     adult               20     25  
## 48 Harvard Forest PEMA    Cricetidae F     adult               21     30  
## 49 Harvard Forest PEMA    Cricetidae F     adult               20     22  
## 50 Harvard Forest PEMA    Cricetidae F     adult               21     20  
## # ℹ 1,579 more rowsSome other useful functions to examine a data frame are
summary() and str() from base R, and
glimpse() from tidyverse. Try calling them and see what
output you get.
We can see from the output of those summary functions that we have
1629 rows and 7 columns. There is a mixture of character columns
(chr) and numeric columns (dbl refers to
double-precision, meaning a numeric value that has a decimal point).
Calling summary(smammals) also gives us some quick summary
statistics for the mean and range of the numeric columns, as well as the
number of missing values (NA values). Both numeric columns
have several dozen missing values.
If we want to extract only specific subsets of the data, we can do
that by indexing rows and columns. The square brackets []
are used for this. You can get individual elements of a data frame with
brackets. The syntax is dataframe[row, column]. For example
this will return the value in row 2, column 3 of the data frame. It
actually returns a 1x1 tibble.
smammals[2, 3]## # A tibble: 1 × 1
##   family    
##   <chr>     
## 1 CricetidaeWe can also get an entire row by leaving the column part
blank, and an entire column by leaving the row part blank.
This gives us the 5th row:
smammals[5, ]## # A tibble: 1 × 7
##   siteName       taxonID family     sex   lifeStage hindfootLength weight
##   <chr>          <chr>   <chr>      <chr> <chr>              <dbl>  <dbl>
## 1 Harvard Forest MYGA    Cricetidae F     juvenile              17     11and this gives us the 2nd column:
smammals[, 2]## # A tibble: 1,629 × 1
##    taxonID
##    <chr>  
##  1 MYGA   
##  2 MYGA   
##  3 MYGA   
##  4 MYGA   
##  5 MYGA   
##  6 MYGA   
##  7 MYGA   
##  8 PELE   
##  9 PELE   
## 10 PELE   
## # ℹ 1,619 more rowsWe can also subset with ranges, for example this gives us the 6th through 10th rows, including all columns:
smammals[6:10, ]## # A tibble: 5 × 7
##   siteName       taxonID family     sex   lifeStage hindfootLength weight
##   <chr>          <chr>   <chr>      <chr> <chr>              <dbl>  <dbl>
## 1 Harvard Forest MYGA    Cricetidae F     adult               17.5     17
## 2 Harvard Forest MYGA    Cricetidae M     adult               17       22
## 3 Harvard Forest PELE    Cricetidae F     adult               21       36
## 4 Harvard Forest PELE    Cricetidae M     subadult            20       20
## 5 Harvard Forest PELE    Cricetidae M     subadult            20       19and this gives us the first 5 rows, but only columns 1 and 2.
smammals[1:5, 1:2]## # A tibble: 5 × 2
##   siteName       taxonID
##   <chr>          <chr>  
## 1 Harvard Forest MYGA   
## 2 Harvard Forest MYGA   
## 3 Harvard Forest MYGA   
## 4 Harvard Forest MYGA   
## 5 Harvard Forest MYGAYou can also get columns by name. Just make sure to pass the vector
of column names using the c() function, and put quotes
around the names. This gives us the 20th row and columns called
taxonID and sex.
smammals[20, c('taxonID', 'sex')]## # A tibble: 1 × 2
##   taxonID sex  
##   <chr>   <chr>
## 1 PELE    FTo extract a single column from a data frame as a vector, you can use
the $ operator.
head(smammals$siteName)## [1] "Harvard Forest" "Harvard Forest" "Harvard Forest" "Harvard Forest"
## [5] "Harvard Forest" "Harvard Forest"If the data frame column name contains special characters, surround
the name with backticks like smammals$`column 10`.
For example, let’s look at all unique values in the
siteName column using the unique()
function.
unique(smammals$siteName)## [1] "Harvard Forest"                            
## [2] "Konza Prairie Biological Station"          
## [3] "Oak Ridge"                                 
## [4] "Smithsonian Conservation Biology Institute"Another function that you often use on single data frame columns is
table(), which gives you the number of values of each
category.
table(smammals$sex)## 
##   F   M   U 
## 724 881   8The tidyverse functions filter() and
select() are used to give you subsets of rows and columns,
respectively. A common use of filter() is to subset rows by
a certain condition. For example we can use filter() to get
all rows where weight is less than 10.
filter(smammals, weight < 10)## # A tibble: 14 × 7
##    siteName                 taxonID family sex   lifeStage hindfootLength weight
##    <chr>                    <chr>   <chr>  <chr> <chr>              <dbl>  <dbl>
##  1 Smithsonian Conservatio… PELE    Crice… M     juvenile              19      9
##  2 Smithsonian Conservatio… MUMU    Murid… M     juvenile              17      8
##  3 Smithsonian Conservatio… MUMU    Murid… F     juvenile              17      9
##  4 Smithsonian Conservatio… MUMU    Murid… M     juvenile              17      8
##  5 Smithsonian Conservatio… MUMU    Murid… M     juvenile              17      9
##  6 Smithsonian Conservatio… MUMU    Murid… F     juvenile              18      7
##  7 Smithsonian Conservatio… MUMU    Murid… M     juvenile              17      8
##  8 Smithsonian Conservatio… MUMU    Murid… M     juvenile              18      7
##  9 Smithsonian Conservatio… MUMU    Murid… M     juvenile              18      7
## 10 Smithsonian Conservatio… MUMU    Murid… F     juvenile              18      8
## 11 Smithsonian Conservatio… MUMU    Murid… M     juvenile              17      7
## 12 Smithsonian Conservatio… MUMU    Murid… M     juvenile              18      8
## 13 Smithsonian Conservatio… MUMU    Murid… F     juvenile              16      9
## 14 Harvard Forest           MYGA    Crice… M     juvenile              NA      9And we can use select() to get columns we want. We do
not need to use quotes on column names within select():
select(smammals, taxonID, weight)## # A tibble: 1,629 × 2
##    taxonID weight
##    <chr>    <dbl>
##  1 MYGA        19
##  2 MYGA        22
##  3 MYGA        23
##  4 MYGA        21
##  5 MYGA        11
##  6 MYGA        17
##  7 MYGA        22
##  8 PELE        36
##  9 PELE        20
## 10 PELE        19
## # ℹ 1,619 more rowsPROTIP: A common “gotcha” is that a function called
select()also exists in the widely used statistics package MASS which is loaded by default when you load many other common modeling packages. You can make sure you are using the correct version of theselect()function by specifyingdplyr::select().
We can calculate summary statistics on the entire data frame.
Summary methods for data frames tell you how many missing values there are.
Many R functions that calculate summary statistics return a
NA value if there are any missing values in the dataset.
Let’s look at max():
x <- c(5.3, 12.2, -8.6)
x## [1]  5.3 12.2 -8.6max(x)## [1] 12.2x[2] <- NA
x## [1]  5.3   NA -8.6max(x)## [1] NAWhere there’s data, there’s missing data. There are lots of ways to deal with missing data. For now we will do the simplest thing, ignore it.
The R functions mean(), min(),
max(), and sum() have an argument
na.rm = TRUE which removes NA values before
computing the summary statistic. Its default value is FALSE
so you need to explicitly set it to TRUE if you want to
calculate the summary statistics without the missing values.
max(x, na.rm = TRUE)## [1] 5.3Note na.rm serves a useful purpose and I think it’s a
good idea to have FALSE be the default.
Before we start to do more complex operations on data frames, here’s
a little trick to make your code more readable: the pipe operator, or
%>%. If you look at any example code using tidyverse
packages that you find online, you will see this operator everywhere. It
is used to compose complex expressions that include “chains” of multiple
functions. The pipe is used to “pipe” the output of one function into
the input of the next function.
For instance take this nested function call:
sqrt(log(abs(x)))We start with a variable called x, take the absolute
value with abs(), take the natural log with
log(), then take the square root with sqrt().
But the order in which we do those operations is actually the opposite
of what it appears in the code. We can use the %>%
operator to make this code more readable. Not only are the operations
printed in a logical order, but we can also easily split the code into
multiple lines which is often easier for anyone who is reading your code
to decipher.
x %>%
  abs() %>%
  log() %>%
  sqrt()PROTIP: It’s really “future you” who is going to be the one reading the code 99% of the time, so any time you spend a little time and effort making your code more legible, you are actually saving yourself time in the long run … even if you’re the only one who ever looks at the code!
Let’s calculate summary statistics for entire columns of a data
frame. We’re going to put the %>% pipe operator into
practice now, to pass the output of one line of code to the next
line.
smammals %>%
  summarize(mean_weight = mean(weight), 
            sd_weight = sd(weight))## # A tibble: 1 × 2
##   mean_weight sd_weight
##         <dbl>     <dbl>
## 1          NA        NAAs you can see this output is not helpful. All it tells us is that
there is at least one missing (NA) value in the
weight column. There are two ways we can deal with
this.
One alternative is to subset the data frame using
filter(), keeping only rows where weight is not missing.
The function is.na(x) returns a vector of logical values,
TRUE if x is missing and FALSE if
it is not. We negate this with the ! operator to get a
vector that is TRUE if weight is not missing.
Then we can calculate the summary statistics. This shows you how to use
the %>% pipe to do two operations on a data frame
sequentially.
smammals %>%
  filter(!is.na(weight)) %>%
  summarize(mean_weight = mean(weight), 
            sd_weight = sd(weight))## # A tibble: 1 × 2
##   mean_weight sd_weight
##         <dbl>     <dbl>
## 1        30.5      33.7Or we can use na.rm as we demonstrated above. The result
should be identical.
smammals %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE), 
            sd_weight = sd(weight, na.rm = TRUE))## # A tibble: 1 × 2
##   mean_weight sd_weight
##         <dbl>     <dbl>
## 1        30.5      33.7You often want to calculate summary statistics for each group. We
will use the group_by() function here to get the mean and
standard deviation of body weight for each taxon ID.
smammals %>%
  filter(!is.na(weight)) %>%
  group_by(taxonID) %>%
  summarize(mean_weight = mean(weight), 
            sd_weight = sd(weight))## # A tibble: 17 × 3
##    taxonID mean_weight sd_weight
##    <chr>         <dbl>     <dbl>
##  1 CHHI           39.2    13.3  
##  2 MIOC           37.4     7.44 
##  3 MIPE           39.7    12.8  
##  4 MIPI           27       4.06 
##  5 MUMU           16.8     6.55 
##  6 MYGA           22.5     4.70 
##  7 NAIN           23.7     4.27 
##  8 NEFL          198.     63.5  
##  9 OCNU           20.5     0.707
## 10 PELE           21.3     4.93 
## 11 PEMA           21.1     4.54 
## 12 PEME           23      NA    
## 13 RARA          180      NA    
## 14 REHU           13      NA    
## 15 REME           14       1.73 
## 16 SIHI           95.6    49.6  
## 17 ZAHU           15.7     3.23We can filter() by more than one condition, and
group_by() more than one column. We can also put as many
summary statistics in the call to summarize() as we want.
For example we could get the number of values (using the
n() function), the mean, and the standard deviation of all
non-missing values of body weight for each combination of sex and life
stage, for one particular species. (PELE = Peromyscus
leucopus, the white-footed mouse).
smammals %>%
  filter(!is.na(weight), taxonID == 'PELE') %>%
  group_by(sex, lifeStage) %>%
  summarize(n_individuals = n(),
            mean_weight = mean(weight), 
            sd_weight = sd(weight))## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.## # A tibble: 8 × 5
## # Groups:   sex [3]
##   sex   lifeStage n_individuals mean_weight sd_weight
##   <chr> <chr>             <int>       <dbl>     <dbl>
## 1 F     adult               267        22.6      4.71
## 2 F     juvenile             16        13.8      2.02
## 3 F     subadult             35        16.6      3.56
## 4 F     <NA>                  2        18        7.07
## 5 M     adult               345        22.3      4.17
## 6 M     juvenile             19        14.0      2.31
## 7 M     subadult             54        16.5      3.20
## 8 <NA>  adult                 1        34       NANotice there are some NA values in the grouping columns.
The group_by() function considers those to be separate
groups. The sd_weight column also returns a NA
value if you try to compute the sample standard deviation of a single
value, which is undefined.
We can use the pipe %>% operator to chain an
arbitrarily long sequence of operations together. This makes for more
readable code.
smammals %>%
  filter(!is.na(weight), taxonID == 'PELE') %>%
  mutate(weight_mg = weight * 1000) %>%
  group_by(siteName) %>%
  summarize(mean_weight = mean(weight_mg)) %>%
  arrange(mean_weight)## # A tibble: 4 × 2
##   siteName                                   mean_weight
##   <chr>                                            <dbl>
## 1 Oak Ridge                                       18435.
## 2 Smithsonian Conservation Biology Institute      20608.
## 3 Harvard Forest                                  21742.
## 4 Konza Prairie Biological Station                26638.This is equivalent to
arrange(summarize(group_by(mutate(filter(smammals, !is.na(weight), taxonID == 'PELE'), weight_mg = weight * 1000), siteName), mean_weight = mean(weight_mg)), mean_weight)but in my opinion much more readable.
I introduced two new functions here. One is the mutate()
function, which can be used to change the value of a column or create a
new column. For instance we can use it to convert the units of weight
from g to mg, making a new column, weight_mg.
Here I’m also introducing the arrange() function, which
sorts the data frame in ascending order by a given column or columns.
arrange(-mean_weight) could be used to sort in descending
order instead.
We are almost done with this crash course in working with R data frames. We have two more functions to learn.
pivot_longer()pivot_wider()This set of examples is borrowed from Hadley Wickham’s book R for Data Science (Chapter 12: Tidy data).
Tidy data is a dataset that is organized so that the following are true:
Most of the model fitting functions in R require the data to be in this format (as well as many statistical procedures in SAS and other software).
All of the following data frames contain the same data: number of cases of a disease and total population, for three different countries (Afghanistan, Brazil, and China), in each of two years (1999 and 2000). Which of these data frames contain tidy data?
These are example datasets that come packaged with the tidyr package in R, part of the tidyverse.
table1## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583table2## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583table3## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583Table 1 is the only “tidy” data frame out of the three. Each row includes all the variables from one observation: a count of disease cases and total population in a country in one specific year constitutes an observation here.
There are two main functions for reshaping data. One is
pivot_wider() and the other is pivot_longer().
If we have data that looks like table2, and we want to make
it into tidy form like table1, we will need to use
pivot_wider(). We need to identify three components:
id_cols column(s) which provide the identifying
information for each observation, in this case the country and the
year.names_from column(s) which are the column or
columns that contains the labels that will become names of individual
columns in the wider version of the data framevalues_from column(s) which are the column or
columns that contain the data that will be spread across multiple
columns.table2_tidied <- pivot_wider(table2, id_cols = c(country, year), names_from = type, values_from = count)Use the function all.equal() to confirm that the two are
the same.
all.equal(table2_tidied, table1)## [1] TRUEIt’s also possible to reshape table3 to
table1 but this requires more complex operations that we
will not cover today.
A common situation is that we will get data in a wide form from an Excel spreadsheet. When entering data into a spreadsheet manually, it’s much easier to use a wide format because a long skinny table of data is very annoying to scroll through. Wide-form data is often how paper data sheets are formatted. So we often will need to reshape the data to a “longer” format to get an analysis-ready tidy dataset.
Here is an example: the Billboard top 100 song rankings by week for the year 2000. This is another example dataset from the tidyr package . . . read this list of top albums from 2000 if you are a ’90s kid and you want to feel “old as hell”! Use the data frame summary tools you now know to take a look at this dataset.
billboard## # A tibble: 317 × 79
##    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
This CD cover may look familiar to the thirty-somethings reading
this
We want tidy data where each row is an observation: a song identified by its artist name, track name, and the date it entered the top 100 ranking, the week at which the ranking was observed, and the ranking. We have to identify
cols: the columns that need to be pivoted to a longer
format. In this example we can use a shorthand wk1:wk76 to
refer to all columns between wk1 and wk76
inclusive. You can also use the - sign to specify which
columns will not be pivoted (the identifying columns for each
observation).names_to: the name of the column into which the column
names will be put, as a quoted character string. (If you do not supply
this, it will default to name.)values_to: the name of the column into which the data
values in the cells will be put, as a quoted character string. (If you
do not supply this, it will default to value.)pivot_longer(billboard, cols = c(wk1:wk76), names_to = 'week', values_to = 'ranking')## # A tibble: 24,092 × 5
##    artist track                   date.entered week  ranking
##    <chr>  <chr>                   <date>       <chr>   <dbl>
##  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1        87
##  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2        82
##  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3        72
##  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4        77
##  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5        87
##  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6        94
##  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7        99
##  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8        NA
##  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9        NA
## 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10       NA
## # ℹ 24,082 more rowsThis is identical to the code above, instead we are specifying the columns not to pivot.
pivot_longer(billboard, cols = -c(artist, track, date.entered), names_to = 'week', values_to = 'ranking')## # A tibble: 24,092 × 5
##    artist track                   date.entered week  ranking
##    <chr>  <chr>                   <date>       <chr>   <dbl>
##  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1        87
##  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2        82
##  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3        72
##  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4        77
##  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5        87
##  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6        94
##  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7        99
##  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8        NA
##  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9        NA
## 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10       NA
## # ℹ 24,082 more rowsThose are very important topics that you will definitely need to learn about if you want to do serious data analysis with R. We don’t have time to cover them today but I would recommend using the R resources I posted on the workshop page to learn more about them. What we did just now only scratches the surface of how to wrangle data in R.
The dataset used for these exercises comes from an experimental study measuring greenhouse gas emissions from experimentally constructed barnyards. The dataset is hosted on Ag Data Commons.
You may combine the results from steps 3-6 into a single “piped”
statement using the %>% operator.
Read the CSV file at the URL
https://usda-ree-ars.github.io/SEAStats/R_boot_camp/datasets/barnyard_ghg.csv
into R as a data frame.
Examine the data frame. What are the data types of the different
columns? How many different values are there in the Surface
column?
str(),
summary(), and/or glimpse() to examine the
data frame. Use table() or unique() for the
Surface column.Create a subset of the data frame containing only the observations
where the temperature (Temp column) is greater than 10.
Remember R is case-sensitive!
Group the data frame by surface type.
Calculate the mean of the CO2_mgC, CH4_mgC,
NH3_ugN, and N2O_ugN columns.
Sort the result in increasing order of the mean value of
CO2_mgC.
Load the relig_income example dataset which has survey
data on the number of individuals of different religions by reported
yearly income. (Use data(relig_income) to load the
dataset.) Convert this dataset into “tidy” form so that each row has a
religion, income level, and number of individuals.
pivot_longer().religion.