Lesson 2 learning objectives
What is a data frame?
Note about tidyverse
Reading in data
Examining contents of a data frame
- Summary information about a data frame
- Subsetting values from a data frame
Calculating summary statistics
Tidy data and reshaping data frames
- Reshaping data
Hey! What about …
Exercises

Lesson 2 learning objectives

At the end of this lesson, students will …

Read a CSV file of data into the R workspace.
Know what a data frame is and understand what constitutes “tidy” data.
Do basic manipulations and reshaping of data frames.
Calculate summary statistics by group.

Download the worksheet for this lesson here.

What is a data frame?

We will not go into every possible R data type for this lesson. The one you really need to know about for model fitting is the data frame.

A data frame corresponds to a table or spreadsheet of data. Each column has its own data type. You might have numeric, character, and factor columns all in the same data frame. Ideally, each row will correspond to an individual observation.

Data frames in R are the workhorses for data manipulation and analysis. There are a lot of things you can do with data frames; we will go over some of the basics in this lesson.

Note about tidyverse

In this lesson, I am going to teach you functions from the tidyverse group of packages. These are a set of packages, designed to work together, that make it easy to wrangle and manipulate data in R. All of the stuff I am going to teach can be done with base R code without any packages. There are also alternatives such as data.table. Personally I use a combination of all three. However for this lesson I chose to teach tidyverse for a couple of reasons. For one thing, it’s the easiest for beginners to grasp. Also, it has become really popular and widespread in the last few years. So a lot of the other R resources you will find are now done with the tidyverse. So why swim against the tide? There are pros and cons to the tidyverse, but it is great for starting out with R. If you become a big time R programmer I would recommend learning all the base R code and other packages like data.table in the future.

Load the tidyverse package which is a shortcut to loading the multiple related tidyverse packages we will be using in this lesson.

library(tidyverse)

Reading in data

Read the file small_mammals.csv from the URL where it is hosted online. (Typically, you will import data from files on your local hard drive, but for these lessons we will work with data hosted on the course website.)

Read in an external file using the read_csv() function from the tidyverse package readr. There is a base R function for this as well but the readr function has better default values and reads large files more quickly. Also, read_csv returns a special kind of data frame called a “tibble” that works better with the other tidyverse functions.

smammals <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_boot_camp/datasets/small_mammals.csv')

## Rows: 1629 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): siteName, taxonID, family, sex, lifeStage
## dbl (2): hindfootLength, weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The data we are reading in here is a set of body measurements of small mammals from the NEON ecological observatory network. We will go on to examine this dataset in detail.

Examining contents of a data frame

There are a lot of R functions and operators for looking at your data frame. We can get summary information on a data frame or pull out individual bits of data from it. We’ll go through a few ways of doing that.

Summary information about a data frame

Let’s look at the object which is now loaded into our workspace that we called smammals. If we just type smammals into our console we see the first few rows of data and some information about its dimensions.

smammals

## # A tibble: 1,629 × 7
##    siteName       taxonID family     sex   lifeStage hindfootLength weight
##    <chr>          <chr>   <chr>      <chr> <chr>              <dbl>  <dbl>
##  1 Harvard Forest MYGA    Cricetidae M     adult               18       19
##  2 Harvard Forest MYGA    Cricetidae M     adult               17       22
##  3 Harvard Forest MYGA    Cricetidae M     adult               18       23
##  4 Harvard Forest MYGA    Cricetidae F     adult               18       21
##  5 Harvard Forest MYGA    Cricetidae F     juvenile            17       11
##  6 Harvard Forest MYGA    Cricetidae F     adult               17.5     17
##  7 Harvard Forest MYGA    Cricetidae M     adult               17       22
##  8 Harvard Forest PELE    Cricetidae F     adult               21       36
##  9 Harvard Forest PELE    Cricetidae M     subadult            20       20
## 10 Harvard Forest PELE    Cricetidae M     subadult            20       19
## # ℹ 1,619 more rows

As the informational text says, we can print all the values of smammals if we explicitly tell R to do this. It’s nice that it does not do this by default especially if we have data frames with thousands or millions of rows.

print(smammals, n = 50)

## # A tibble: 1,629 × 7
##    siteName       taxonID family     sex   lifeStage hindfootLength weight
##    <chr>          <chr>   <chr>      <chr> <chr>              <dbl>  <dbl>
##  1 Harvard Forest MYGA    Cricetidae M     adult               18     19  
##  2 Harvard Forest MYGA    Cricetidae M     adult               17     22  
##  3 Harvard Forest MYGA    Cricetidae M     adult               18     23  
##  4 Harvard Forest MYGA    Cricetidae F     adult               18     21  
##  5 Harvard Forest MYGA    Cricetidae F     juvenile            17     11  
##  6 Harvard Forest MYGA    Cricetidae F     adult               17.5   17  
##  7 Harvard Forest MYGA    Cricetidae M     adult               17     22  
##  8 Harvard Forest PELE    Cricetidae F     adult               21     36  
##  9 Harvard Forest PELE    Cricetidae M     subadult            20     20  
## 10 Harvard Forest PELE    Cricetidae M     subadult            20     19  
## 11 Harvard Forest PELE    Cricetidae M     subadult            19     22  
## 12 Harvard Forest PELE    Cricetidae F     adult               20     26  
## 13 Harvard Forest PELE    Cricetidae F     adult               20     22  
## 14 Harvard Forest PELE    Cricetidae F     adult               20.5   22.5
## 15 Harvard Forest PELE    Cricetidae F     adult               20     29  
## 16 Harvard Forest PELE    Cricetidae F     adult               19     22  
## 17 Harvard Forest PELE    Cricetidae M     adult               21     32  
## 18 Harvard Forest PELE    Cricetidae F     subadult            20     17  
## 19 Harvard Forest PELE    Cricetidae F     adult               19.5   23  
## 20 Harvard Forest PELE    Cricetidae F     adult               20     23  
## 21 Harvard Forest PELE    Cricetidae F     adult               21     27  
## 22 Harvard Forest PELE    Cricetidae F     adult               20     23  
## 23 Harvard Forest PELE    Cricetidae F     adult               20.5   22  
## 24 Harvard Forest PELE    Cricetidae F     adult               20     24  
## 25 Harvard Forest PELE    Cricetidae M     adult               20.5   22.5
## 26 Harvard Forest PELE    Cricetidae M     adult               20.5   18.5
## 27 Harvard Forest PELE    Cricetidae F     subadult            20     19  
## 28 Harvard Forest PELE    Cricetidae M     subadult            19     15  
## 29 Harvard Forest PELE    Cricetidae M     subadult            20     21  
## 30 Harvard Forest PELE    Cricetidae M     subadult            18     14  
## 31 Harvard Forest PELE    Cricetidae M     adult               19     19  
## 32 Harvard Forest PEMA    Cricetidae F     adult               20     23.5
## 33 Harvard Forest PEMA    Cricetidae M     subadult            21     20  
## 34 Harvard Forest PEMA    Cricetidae M     adult               20     21  
## 35 Harvard Forest PEMA    Cricetidae M     adult               20     21  
## 36 Harvard Forest PEMA    Cricetidae F     subadult            20     20  
## 37 Harvard Forest PEMA    Cricetidae M     subadult            20     16  
## 38 Harvard Forest PEMA    Cricetidae M     subadult            19     17  
## 39 Harvard Forest PEMA    Cricetidae M     adult               19     20  
## 40 Harvard Forest PEMA    Cricetidae F     adult               21     21.5
## 41 Harvard Forest PEMA    Cricetidae F     subadult            19     18  
## 42 Harvard Forest PEMA    Cricetidae F     subadult            18     17  
## 43 Harvard Forest PEMA    Cricetidae F     adult               19     22  
## 44 Harvard Forest PEMA    Cricetidae F     adult               20     27.5
## 45 Harvard Forest PEMA    Cricetidae M     adult               20     18  
## 46 Harvard Forest PEMA    Cricetidae M     adult               20.5   21  
## 47 Harvard Forest PEMA    Cricetidae F     adult               20     25  
## 48 Harvard Forest PEMA    Cricetidae F     adult               21     30  
## 49 Harvard Forest PEMA    Cricetidae F     adult               20     22  
## 50 Harvard Forest PEMA    Cricetidae F     adult               21     20  
## # ℹ 1,579 more rows

Some other useful functions to examine a data frame are summary() and str() from base R, and glimpse() from tidyverse. Try calling them and see what output you get.

We can see from the output of those summary functions that we have 1629 rows and 7 columns. There is a mixture of character columns (chr) and numeric columns (dbl refers to double-precision, meaning a numeric value that has a decimal point). Calling summary(smammals) also gives us some quick summary statistics for the mean and range of the numeric columns, as well as the number of missing values (NA values). Both numeric columns have several dozen missing values.

Subsetting values from a data frame

If we want to extract only specific subsets of the data, we can do that by indexing rows and columns. The square brackets [] are used for this. You can get individual elements of a data frame with brackets. The syntax is dataframe[row, column]. For example this will return the value in row 2, column 3 of the data frame. It actually returns a 1x1 tibble.

smammals[2, 3]

## # A tibble: 1 × 1
##   family    
##   <chr>     
## 1 Cricetidae

We can also get an entire row by leaving the column part blank, and an entire column by leaving the row part blank. This gives us the 5th row:

smammals[5, ]

## # A tibble: 1 × 7
##   siteName       taxonID family     sex   lifeStage hindfootLength weight
##   <chr>          <chr>   <chr>      <chr> <chr>              <dbl>  <dbl>
## 1 Harvard Forest MYGA    Cricetidae F     juvenile              17     11

and this gives us the 2nd column:

smammals[, 2]

## # A tibble: 1,629 × 1
##    taxonID
##    <chr>  
##  1 MYGA   
##  2 MYGA   
##  3 MYGA   
##  4 MYGA   
##  5 MYGA   
##  6 MYGA   
##  7 MYGA   
##  8 PELE   
##  9 PELE   
## 10 PELE   
## # ℹ 1,619 more rows

We can also subset with ranges, for example this gives us the 6th through 10th rows, including all columns:

smammals[6:10, ]

## # A tibble: 5 × 7
##   siteName       taxonID family     sex   lifeStage hindfootLength weight
##   <chr>          <chr>   <chr>      <chr> <chr>              <dbl>  <dbl>
## 1 Harvard Forest MYGA    Cricetidae F     adult               17.5     17
## 2 Harvard Forest MYGA    Cricetidae M     adult               17       22
## 3 Harvard Forest PELE    Cricetidae F     adult               21       36
## 4 Harvard Forest PELE    Cricetidae M     subadult            20       20
## 5 Harvard Forest PELE    Cricetidae M     subadult            20       19

and this gives us the first 5 rows, but only columns 1 and 2.

smammals[1:5, 1:2]

## # A tibble: 5 × 2
##   siteName       taxonID
##   <chr>          <chr>  
## 1 Harvard Forest MYGA   
## 2 Harvard Forest MYGA   
## 3 Harvard Forest MYGA   
## 4 Harvard Forest MYGA   
## 5 Harvard Forest MYGA

You can also get columns by name. Just make sure to pass the vector of column names using the c() function, and put quotes around the names. This gives us the 20th row and columns called taxonID and sex.

smammals[20, c('taxonID', 'sex')]

## # A tibble: 1 × 2
##   taxonID sex  
##   <chr>   <chr>
## 1 PELE    F

To extract a single column from a data frame as a vector, you can use the $ operator.

head(smammals$siteName)

## [1] "Harvard Forest" "Harvard Forest" "Harvard Forest" "Harvard Forest"
## [5] "Harvard Forest" "Harvard Forest"

If the data frame column name contains special characters, surround the name with backticks like smammals$`column 10`.

For example, let’s look at all unique values in the siteName column using the unique() function.

unique(smammals$siteName)

## [1] "Harvard Forest"                            
## [2] "Konza Prairie Biological Station"          
## [3] "Oak Ridge"                                 
## [4] "Smithsonian Conservation Biology Institute"

Another function that you often use on single data frame columns is table(), which gives you the number of values of each category.

table(smammals$sex)

## 
##   F   M   U 
## 724 881   8

The tidyverse functions filter() and select() are used to give you subsets of rows and columns, respectively. A common use of filter() is to subset rows by a certain condition. For example we can use filter() to get all rows where weight is less than 10.

filter(smammals, weight < 10)

## # A tibble: 14 × 7
##    siteName                 taxonID family sex   lifeStage hindfootLength weight
##    <chr>                    <chr>   <chr>  <chr> <chr>              <dbl>  <dbl>
##  1 Smithsonian Conservatio… PELE    Crice… M     juvenile              19      9
##  2 Smithsonian Conservatio… MUMU    Murid… M     juvenile              17      8
##  3 Smithsonian Conservatio… MUMU    Murid… F     juvenile              17      9
##  4 Smithsonian Conservatio… MUMU    Murid… M     juvenile              17      8
##  5 Smithsonian Conservatio… MUMU    Murid… M     juvenile              17      9
##  6 Smithsonian Conservatio… MUMU    Murid… F     juvenile              18      7
##  7 Smithsonian Conservatio… MUMU    Murid… M     juvenile              17      8
##  8 Smithsonian Conservatio… MUMU    Murid… M     juvenile              18      7
##  9 Smithsonian Conservatio… MUMU    Murid… M     juvenile              18      7
## 10 Smithsonian Conservatio… MUMU    Murid… F     juvenile              18      8
## 11 Smithsonian Conservatio… MUMU    Murid… M     juvenile              17      7
## 12 Smithsonian Conservatio… MUMU    Murid… M     juvenile              18      8
## 13 Smithsonian Conservatio… MUMU    Murid… F     juvenile              16      9
## 14 Harvard Forest           MYGA    Crice… M     juvenile              NA      9

And we can use select() to get columns we want. We do not need to use quotes on column names within select():

select(smammals, taxonID, weight)

## # A tibble: 1,629 × 2
##    taxonID weight
##    <chr>    <dbl>
##  1 MYGA        19
##  2 MYGA        22
##  3 MYGA        23
##  4 MYGA        21
##  5 MYGA        11
##  6 MYGA        17
##  7 MYGA        22
##  8 PELE        36
##  9 PELE        20
## 10 PELE        19
## # ℹ 1,619 more rows

PROTIP: A common “gotcha” is that a function called select() also exists in the widely used statistics package MASS which is loaded by default when you load many other common modeling packages. You can make sure you are using the correct version of the select() function by specifying dplyr::select().

Calculating summary statistics

We can calculate summary statistics on the entire data frame.

Dealing with missing data

Summary methods for data frames tell you how many missing values there are.

Many R functions that calculate summary statistics return a NA value if there are any missing values in the dataset. Let’s look at max():

x <- c(5.3, 12.2, -8.6)

x

## [1]  5.3 12.2 -8.6

max(x)

## [1] 12.2

x[2] <- NA

x

## [1]  5.3   NA -8.6

max(x)

## [1] NA

Where there’s data, there’s missing data. There are lots of ways to deal with missing data. For now we will do the simplest thing, ignore it.

The R functions mean(), min(), max(), and sum() have an argument na.rm = TRUE which removes NA values before computing the summary statistic. Its default value is FALSE so you need to explicitly set it to TRUE if you want to calculate the summary statistics without the missing values.

max(x, na.rm = TRUE)

## [1] 5.3

Note na.rm serves a useful purpose and I think it’s a good idea to have FALSE be the default.

The pipe operator

Before we start to do more complex operations on data frames, here’s a little trick to make your code more readable: the pipe operator, or %>%. If you look at any example code using tidyverse packages that you find online, you will see this operator everywhere. It is used to compose complex expressions that include “chains” of multiple functions. The pipe is used to “pipe” the output of one function into the input of the next function.

For instance take this nested function call:

sqrt(log(abs(x)))

We start with a variable called x, take the absolute value with abs(), take the natural log with log(), then take the square root with sqrt(). But the order in which we do those operations is actually the opposite of what it appears in the code. We can use the %>% operator to make this code more readable. Not only are the operations printed in a logical order, but we can also easily split the code into multiple lines which is often easier for anyone who is reading your code to decipher.

x %>%
  abs() %>%
  log() %>%
  sqrt()

PROTIP: It’s really “future you” who is going to be the one reading the code 99% of the time, so any time you spend a little time and effort making your code more legible, you are actually saving yourself time in the long run … even if you’re the only one who ever looks at the code!

Summary statistics on an entire data frame

Let’s calculate summary statistics for entire columns of a data frame. We’re going to put the %>% pipe operator into practice now, to pass the output of one line of code to the next line.

smammals %>%
  summarize(mean_weight = mean(weight), 
            sd_weight = sd(weight))

## # A tibble: 1 × 2
##   mean_weight sd_weight
##         <dbl>     <dbl>
## 1          NA        NA

As you can see this output is not helpful. All it tells us is that there is at least one missing (NA) value in the weight column. There are two ways we can deal with this.

One alternative is to subset the data frame using filter(), keeping only rows where weight is not missing. The function is.na(x) returns a vector of logical values, TRUE if x is missing and FALSE if it is not. We negate this with the ! operator to get a vector that is TRUE if weight is not missing. Then we can calculate the summary statistics. This shows you how to use the %>% pipe to do two operations on a data frame sequentially.

smammals %>%
  filter(!is.na(weight)) %>%
  summarize(mean_weight = mean(weight), 
            sd_weight = sd(weight))

## # A tibble: 1 × 2
##   mean_weight sd_weight
##         <dbl>     <dbl>
## 1        30.5      33.7

Or we can use na.rm as we demonstrated above. The result should be identical.

smammals %>%
  summarize(mean_weight = mean(weight, na.rm = TRUE), 
            sd_weight = sd(weight, na.rm = TRUE))

## # A tibble: 1 × 2
##   mean_weight sd_weight
##         <dbl>     <dbl>
## 1        30.5      33.7

Calculating summary statistics by group

You often want to calculate summary statistics for each group. We will use the group_by() function here to get the mean and standard deviation of body weight for each taxon ID.

smammals %>%
  filter(!is.na(weight)) %>%
  group_by(taxonID) %>%
  summarize(mean_weight = mean(weight), 
            sd_weight = sd(weight))

## # A tibble: 17 × 3
##    taxonID mean_weight sd_weight
##    <chr>         <dbl>     <dbl>
##  1 CHHI           39.2    13.3  
##  2 MIOC           37.4     7.44 
##  3 MIPE           39.7    12.8  
##  4 MIPI           27       4.06 
##  5 MUMU           16.8     6.55 
##  6 MYGA           22.5     4.70 
##  7 NAIN           23.7     4.27 
##  8 NEFL          198.     63.5  
##  9 OCNU           20.5     0.707
## 10 PELE           21.3     4.93 
## 11 PEMA           21.1     4.54 
## 12 PEME           23      NA    
## 13 RARA          180      NA    
## 14 REHU           13      NA    
## 15 REME           14       1.73 
## 16 SIHI           95.6    49.6  
## 17 ZAHU           15.7     3.23

We can filter() by more than one condition, and group_by() more than one column. We can also put as many summary statistics in the call to summarize() as we want. For example we could get the number of values (using the n() function), the mean, and the standard deviation of all non-missing values of body weight for each combination of sex and life stage, for one particular species. (PELE = Peromyscus leucopus, the white-footed mouse).

smammals %>%
  filter(!is.na(weight), taxonID == 'PELE') %>%
  group_by(sex, lifeStage) %>%
  summarize(n_individuals = n(),
            mean_weight = mean(weight), 
            sd_weight = sd(weight))

## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.

## # A tibble: 8 × 5
## # Groups:   sex [3]
##   sex   lifeStage n_individuals mean_weight sd_weight
##   <chr> <chr>             <int>       <dbl>     <dbl>
## 1 F     adult               267        22.6      4.71
## 2 F     juvenile             16        13.8      2.02
## 3 F     subadult             35        16.6      3.56
## 4 F     <NA>                  2        18        7.07
## 5 M     adult               345        22.3      4.17
## 6 M     juvenile             19        14.0      2.31
## 7 M     subadult             54        16.5      3.20
## 8 <NA>  adult                 1        34       NA

White-footed mouse, image by Peterwchen

Notice there are some NA values in the grouping columns. The group_by() function considers those to be separate groups. The sd_weight column also returns a NA value if you try to compute the sample standard deviation of a single value, which is undefined.

More pipe!

We can use the pipe %>% operator to chain an arbitrarily long sequence of operations together. This makes for more readable code.

smammals %>%
  filter(!is.na(weight), taxonID == 'PELE') %>%
  mutate(weight_mg = weight * 1000) %>%
  group_by(siteName) %>%
  summarize(mean_weight = mean(weight_mg)) %>%
  arrange(mean_weight)

## # A tibble: 4 × 2
##   siteName                                   mean_weight
##   <chr>                                            <dbl>
## 1 Oak Ridge                                       18435.
## 2 Smithsonian Conservation Biology Institute      20608.
## 3 Harvard Forest                                  21742.
## 4 Konza Prairie Biological Station                26638.

This is equivalent to arrange(summarize(group_by(mutate(filter(smammals, !is.na(weight), taxonID == 'PELE'), weight_mg = weight * 1000), siteName), mean_weight = mean(weight_mg)), mean_weight) but in my opinion much more readable.

I introduced two new functions here. One is the mutate() function, which can be used to change the value of a column or create a new column. For instance we can use it to convert the units of weight from g to mg, making a new column, weight_mg.

Here I’m also introducing the arrange() function, which sorts the data frame in ascending order by a given column or columns. arrange(-mean_weight) could be used to sort in descending order instead.

Tidy data and reshaping data frames

We are almost done with this crash course in working with R data frames. We have two more functions to learn.

pivot_longer()
pivot_wider()

This set of examples is borrowed from Hadley Wickham’s book R for Data Science (Chapter 12: Tidy data).

Tidy data is a dataset that is organized so that the following are true:

Each observation has its own row.
Each variable has its own column.
Each value has its own cell.

Most of the model fitting functions in R require the data to be in this format (as well as many statistical procedures in SAS and other software).

All of the following data frames contain the same data: number of cases of a disease and total population, for three different countries (Afghanistan, Brazil, and China), in each of two years (1999 and 2000). Which of these data frames contain tidy data?

These are example datasets that come packaged with the tidyr package in R, part of the tidyverse.

table1

## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

table2

## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

table3

## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

Table 1 is the only “tidy” data frame out of the three. Each row includes all the variables from one observation: a count of disease cases and total population in a country in one specific year constitutes an observation here.

Reshaping data

There are two main functions for reshaping data. One is pivot_wider() and the other is pivot_longer(). If we have data that looks like table2, and we want to make it into tidy form like table1, we will need to use pivot_wider(). We need to identify three components:

the id_cols column(s) which provide the identifying information for each observation, in this case the country and the year.
the names_from column(s) which are the column or columns that contains the labels that will become names of individual columns in the wider version of the data frame
the values_from column(s) which are the column or columns that contain the data that will be spread across multiple columns.

table2_tidied <- pivot_wider(table2, id_cols = c(country, year), names_from = type, values_from = count)

Use the function all.equal() to confirm that the two are the same.

all.equal(table2_tidied, table1)

## [1] TRUE

It’s also possible to reshape table3 to table1 but this requires more complex operations that we will not cover today.

Wide to long reshape

A common situation is that we will get data in a wide form from an Excel spreadsheet. When entering data into a spreadsheet manually, it’s much easier to use a wide format because a long skinny table of data is very annoying to scroll through. Wide-form data is often how paper data sheets are formatted. So we often will need to reshape the data to a “longer” format to get an analysis-ready tidy dataset.

Here is an example: the Billboard top 100 song rankings by week for the year 2000. This is another example dataset from the tidyr package . . . read this list of top albums from 2000 if you are a ’90s kid and you want to feel “old as hell”! Use the data frame summary tools you now know to take a look at this dataset.

billboard

## # A tibble: 317 × 79
##    artist     track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##    <chr>      <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2 Pac      Baby… 2000-02-26      87    82    72    77    87    94    99    NA
##  2 2Ge+her    The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
##  3 3 Doors D… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
##  4 3 Doors D… Loser 2000-10-21      76    76    72    69    67    65    55    59
##  5 504 Boyz   Wobb… 2000-04-15      57    34    25    17    17    31    36    49
##  6 98^0       Give… 2000-08-19      51    39    34    26    26    19     2     2
##  7 A*Teens    Danc… 2000-07-08      97    97    96    95   100    NA    NA    NA
##  8 Aaliyah    I Do… 2000-01-29      84    62    51    41    38    35    35    38
##  9 Aaliyah    Try … 2000-03-18      59    53    38    28    21    18    16    14
## 10 Adams, Yo… Open… 2000-08-26      76    76    74    69    68    67    61    58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …

90s kids remember this
This CD cover may look familiar to the thirty-somethings reading this

We want tidy data where each row is an observation: a song identified by its artist name, track name, and the date it entered the top 100 ranking, the week at which the ranking was observed, and the ranking. We have to identify

cols: the columns that need to be pivoted to a longer format. In this example we can use a shorthand wk1:wk76 to refer to all columns between wk1 and wk76 inclusive. You can also use the - sign to specify which columns will not be pivoted (the identifying columns for each observation).
names_to: the name of the column into which the column names will be put, as a quoted character string. (If you do not supply this, it will default to name.)
values_to: the name of the column into which the data values in the cells will be put, as a quoted character string. (If you do not supply this, it will default to value.)

pivot_longer(billboard, cols = c(wk1:wk76), names_to = 'week', values_to = 'ranking')

## # A tibble: 24,092 × 5
##    artist track                   date.entered week  ranking
##    <chr>  <chr>                   <date>       <chr>   <dbl>
##  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1        87
##  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2        82
##  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3        72
##  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4        77
##  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5        87
##  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6        94
##  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7        99
##  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8        NA
##  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9        NA
## 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10       NA
## # ℹ 24,082 more rows

This is identical to the code above, instead we are specifying the columns not to pivot.

pivot_longer(billboard, cols = -c(artist, track, date.entered), names_to = 'week', values_to = 'ranking')

## # A tibble: 24,092 × 5
##    artist track                   date.entered week  ranking
##    <chr>  <chr>                   <date>       <chr>   <dbl>
##  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1        87
##  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2        82
##  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3        72
##  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4        77
##  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5        87
##  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6        94
##  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7        99
##  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8        NA
##  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9        NA
## 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10       NA
## # ℹ 24,082 more rows

Hey! What about …

Joining two or more data frames together
Operations on lists of data frames or data frames containing lists

Those are very important topics that you will definitely need to learn about if you want to do serious data analysis with R. We don’t have time to cover them today but I would recommend using the R resources I posted on the workshop page to learn more about them. What we did just now only scratches the surface of how to wrangle data in R.

Exercises

The dataset used for these exercises comes from an experimental study measuring greenhouse gas emissions from experimentally constructed barnyards. The dataset is hosted on Ag Data Commons.

You may combine the results from steps 3-6 into a single “piped” statement using the %>% operator.

Exercise 1

Read the CSV file at the URL https://usda-ree-ars.github.io/SEAStats/R_boot_camp/datasets/barnyard_ghg.csv into R as a data frame.

Exercise 2

Examine the data frame. What are the data types of the different columns? How many different values are there in the Surface column?

Hint: Use str(), summary(), and/or glimpse() to examine the data frame. Use table() or unique() for the Surface column.

Exercise 3

Create a subset of the data frame containing only the observations where the temperature (Temp column) is greater than 10. Remember R is case-sensitive!

Exercise 4

Group the data frame by surface type.

Exercise 5

Calculate the mean of the CO2_mgC, CH4_mgC, NH3_ugN, and N2O_ugN columns.

Exercise 6

Sort the result in increasing order of the mean value of CO2_mgC.

Exercise 7

Load the relig_income example dataset which has survey data on the number of individuals of different religions by reported yearly income. (Use data(relig_income) to load the dataset.) Convert this dataset into “tidy” form so that each row has a religion, income level, and number of individuals.

Hint 1: Use pivot_longer().
Hint 2: You may specify either the columns to pivot, or the columns not to pivot. There is only one column not to pivot, religion.

Click here for answers

Lesson 2: Working with data frames

Quentin D. Read

August 11, 2023