At the end of this lesson, students will …
We will not go into every possible R data type for this lesson. The one you really need to know about for model fitting is the data frame.
A data frame corresponds to a table or spreadsheet of data. Each column has its own data type. You might have numeric, character, and factor columns all in the same data frame. Ideally, each row will correspond to an individual observation.
Data frames in R are the workhorses for data manipulation and analysis. There are a lot of things you can do with data frames; we will go over some of the basics in this lesson.
In this lesson, I am going to teach you functions from the tidyverse group of packages. These are a set of packages, designed to work together, that make it easy to wrangle and manipulate data in R. All of the stuff I am going to teach can be done with base R code without any packages. There are also alternatives such as data.table. Personally I use a combination of all three. However for this lesson I chose to teach tidyverse for a couple of reasons. For one thing, it’s the easiest for beginners to grasp. Also, it has become really popular and widespread in the last few years. So a lot of the other R resources you will find are now done with the tidyverse. So why swim against the tide? There are pros and cons to the tidyverse, but it is great for starting out with R. If you become a big time R programmer I would recommend learning all the base R code and other packages like data.table in the future.
Load the tidyverse package which is a shortcut to loading the multiple related tidyverse packages we will be using in this lesson.
library(tidyverse)
Read the file small_mammals.csv
from the URL where it is
hosted online. (Typically, you will import data from files on your local
hard drive, but for these lessons we will work with data hosted on the
course website.)
Read in an external file using the read_csv()
function
from the tidyverse package readr. There is a base R
function for this as well but the readr function has
better default values and reads large files more quickly. Also,
read_csv
returns a special kind of data frame called a
“tibble” that works better with the other tidyverse functions.
smammals <- read_csv('https://usda-ree-ars.github.io/SEAStats/R_boot_camp/datasets/small_mammals.csv')
## Rows: 1629 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): siteName, taxonID, family, sex, lifeStage
## dbl (2): hindfootLength, weight
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The data we are reading in here is a set of body measurements of small mammals from the NEON ecological observatory network. We will go on to examine this dataset in detail.
There are a lot of R functions and operators for looking at your data frame. We can get summary information on a data frame or pull out individual bits of data from it. We’ll go through a few ways of doing that.
Let’s look at the object which is now loaded into our workspace that
we called smammals
. If we just type smammals
into our console we see the first few rows of data and some information
about its dimensions.
smammals
## # A tibble: 1,629 × 7
## siteName taxonID family sex lifeStage hindfootLength weight
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Harvard Forest MYGA Cricetidae M adult 18 19
## 2 Harvard Forest MYGA Cricetidae M adult 17 22
## 3 Harvard Forest MYGA Cricetidae M adult 18 23
## 4 Harvard Forest MYGA Cricetidae F adult 18 21
## 5 Harvard Forest MYGA Cricetidae F juvenile 17 11
## 6 Harvard Forest MYGA Cricetidae F adult 17.5 17
## 7 Harvard Forest MYGA Cricetidae M adult 17 22
## 8 Harvard Forest PELE Cricetidae F adult 21 36
## 9 Harvard Forest PELE Cricetidae M subadult 20 20
## 10 Harvard Forest PELE Cricetidae M subadult 20 19
## # ℹ 1,619 more rows
As the informational text says, we can print all the values of
smammals
if we explicitly tell R to do this. It’s nice that
it does not do this by default especially if we have data frames with
thousands or millions of rows.
print(smammals, n = 50)
## # A tibble: 1,629 × 7
## siteName taxonID family sex lifeStage hindfootLength weight
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Harvard Forest MYGA Cricetidae M adult 18 19
## 2 Harvard Forest MYGA Cricetidae M adult 17 22
## 3 Harvard Forest MYGA Cricetidae M adult 18 23
## 4 Harvard Forest MYGA Cricetidae F adult 18 21
## 5 Harvard Forest MYGA Cricetidae F juvenile 17 11
## 6 Harvard Forest MYGA Cricetidae F adult 17.5 17
## 7 Harvard Forest MYGA Cricetidae M adult 17 22
## 8 Harvard Forest PELE Cricetidae F adult 21 36
## 9 Harvard Forest PELE Cricetidae M subadult 20 20
## 10 Harvard Forest PELE Cricetidae M subadult 20 19
## 11 Harvard Forest PELE Cricetidae M subadult 19 22
## 12 Harvard Forest PELE Cricetidae F adult 20 26
## 13 Harvard Forest PELE Cricetidae F adult 20 22
## 14 Harvard Forest PELE Cricetidae F adult 20.5 22.5
## 15 Harvard Forest PELE Cricetidae F adult 20 29
## 16 Harvard Forest PELE Cricetidae F adult 19 22
## 17 Harvard Forest PELE Cricetidae M adult 21 32
## 18 Harvard Forest PELE Cricetidae F subadult 20 17
## 19 Harvard Forest PELE Cricetidae F adult 19.5 23
## 20 Harvard Forest PELE Cricetidae F adult 20 23
## 21 Harvard Forest PELE Cricetidae F adult 21 27
## 22 Harvard Forest PELE Cricetidae F adult 20 23
## 23 Harvard Forest PELE Cricetidae F adult 20.5 22
## 24 Harvard Forest PELE Cricetidae F adult 20 24
## 25 Harvard Forest PELE Cricetidae M adult 20.5 22.5
## 26 Harvard Forest PELE Cricetidae M adult 20.5 18.5
## 27 Harvard Forest PELE Cricetidae F subadult 20 19
## 28 Harvard Forest PELE Cricetidae M subadult 19 15
## 29 Harvard Forest PELE Cricetidae M subadult 20 21
## 30 Harvard Forest PELE Cricetidae M subadult 18 14
## 31 Harvard Forest PELE Cricetidae M adult 19 19
## 32 Harvard Forest PEMA Cricetidae F adult 20 23.5
## 33 Harvard Forest PEMA Cricetidae M subadult 21 20
## 34 Harvard Forest PEMA Cricetidae M adult 20 21
## 35 Harvard Forest PEMA Cricetidae M adult 20 21
## 36 Harvard Forest PEMA Cricetidae F subadult 20 20
## 37 Harvard Forest PEMA Cricetidae M subadult 20 16
## 38 Harvard Forest PEMA Cricetidae M subadult 19 17
## 39 Harvard Forest PEMA Cricetidae M adult 19 20
## 40 Harvard Forest PEMA Cricetidae F adult 21 21.5
## 41 Harvard Forest PEMA Cricetidae F subadult 19 18
## 42 Harvard Forest PEMA Cricetidae F subadult 18 17
## 43 Harvard Forest PEMA Cricetidae F adult 19 22
## 44 Harvard Forest PEMA Cricetidae F adult 20 27.5
## 45 Harvard Forest PEMA Cricetidae M adult 20 18
## 46 Harvard Forest PEMA Cricetidae M adult 20.5 21
## 47 Harvard Forest PEMA Cricetidae F adult 20 25
## 48 Harvard Forest PEMA Cricetidae F adult 21 30
## 49 Harvard Forest PEMA Cricetidae F adult 20 22
## 50 Harvard Forest PEMA Cricetidae F adult 21 20
## # ℹ 1,579 more rows
Some other useful functions to examine a data frame are
summary()
and str()
from base R, and
glimpse()
from tidyverse. Try calling them and see what
output you get.
We can see from the output of those summary functions that we have
1629 rows and 7 columns. There is a mixture of character columns
(chr
) and numeric columns (dbl
refers to
double-precision, meaning a numeric value that has a decimal point).
Calling summary(smammals)
also gives us some quick summary
statistics for the mean and range of the numeric columns, as well as the
number of missing values (NA
values). Both numeric columns
have several dozen missing values.
If we want to extract only specific subsets of the data, we can do
that by indexing rows and columns. The square brackets []
are used for this. You can get individual elements of a data frame with
brackets. The syntax is dataframe[row, column]
. For example
this will return the value in row 2, column 3 of the data frame. It
actually returns a 1x1 tibble.
smammals[2, 3]
## # A tibble: 1 × 1
## family
## <chr>
## 1 Cricetidae
We can also get an entire row by leaving the column
part
blank, and an entire column by leaving the row
part blank.
This gives us the 5th row:
smammals[5, ]
## # A tibble: 1 × 7
## siteName taxonID family sex lifeStage hindfootLength weight
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Harvard Forest MYGA Cricetidae F juvenile 17 11
and this gives us the 2nd column:
smammals[, 2]
## # A tibble: 1,629 × 1
## taxonID
## <chr>
## 1 MYGA
## 2 MYGA
## 3 MYGA
## 4 MYGA
## 5 MYGA
## 6 MYGA
## 7 MYGA
## 8 PELE
## 9 PELE
## 10 PELE
## # ℹ 1,619 more rows
We can also subset with ranges, for example this gives us the 6th through 10th rows, including all columns:
smammals[6:10, ]
## # A tibble: 5 × 7
## siteName taxonID family sex lifeStage hindfootLength weight
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Harvard Forest MYGA Cricetidae F adult 17.5 17
## 2 Harvard Forest MYGA Cricetidae M adult 17 22
## 3 Harvard Forest PELE Cricetidae F adult 21 36
## 4 Harvard Forest PELE Cricetidae M subadult 20 20
## 5 Harvard Forest PELE Cricetidae M subadult 20 19
and this gives us the first 5 rows, but only columns 1 and 2.
smammals[1:5, 1:2]
## # A tibble: 5 × 2
## siteName taxonID
## <chr> <chr>
## 1 Harvard Forest MYGA
## 2 Harvard Forest MYGA
## 3 Harvard Forest MYGA
## 4 Harvard Forest MYGA
## 5 Harvard Forest MYGA
You can also get columns by name. Just make sure to pass the vector
of column names using the c()
function, and put quotes
around the names. This gives us the 20th row and columns called
taxonID
and sex
.
smammals[20, c('taxonID', 'sex')]
## # A tibble: 1 × 2
## taxonID sex
## <chr> <chr>
## 1 PELE F
To extract a single column from a data frame as a vector, you can use
the $
operator.
head(smammals$siteName)
## [1] "Harvard Forest" "Harvard Forest" "Harvard Forest" "Harvard Forest"
## [5] "Harvard Forest" "Harvard Forest"
If the data frame column name contains special characters, surround
the name with backticks like smammals$`column 10`
.
For example, let’s look at all unique values in the
siteName
column using the unique()
function.
unique(smammals$siteName)
## [1] "Harvard Forest"
## [2] "Konza Prairie Biological Station"
## [3] "Oak Ridge"
## [4] "Smithsonian Conservation Biology Institute"
Another function that you often use on single data frame columns is
table()
, which gives you the number of values of each
category.
table(smammals$sex)
##
## F M U
## 724 881 8
The tidyverse functions filter()
and
select()
are used to give you subsets of rows and columns,
respectively. A common use of filter()
is to subset rows by
a certain condition. For example we can use filter()
to get
all rows where weight
is less than 10.
filter(smammals, weight < 10)
## # A tibble: 14 × 7
## siteName taxonID family sex lifeStage hindfootLength weight
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Smithsonian Conservatio… PELE Crice… M juvenile 19 9
## 2 Smithsonian Conservatio… MUMU Murid… M juvenile 17 8
## 3 Smithsonian Conservatio… MUMU Murid… F juvenile 17 9
## 4 Smithsonian Conservatio… MUMU Murid… M juvenile 17 8
## 5 Smithsonian Conservatio… MUMU Murid… M juvenile 17 9
## 6 Smithsonian Conservatio… MUMU Murid… F juvenile 18 7
## 7 Smithsonian Conservatio… MUMU Murid… M juvenile 17 8
## 8 Smithsonian Conservatio… MUMU Murid… M juvenile 18 7
## 9 Smithsonian Conservatio… MUMU Murid… M juvenile 18 7
## 10 Smithsonian Conservatio… MUMU Murid… F juvenile 18 8
## 11 Smithsonian Conservatio… MUMU Murid… M juvenile 17 7
## 12 Smithsonian Conservatio… MUMU Murid… M juvenile 18 8
## 13 Smithsonian Conservatio… MUMU Murid… F juvenile 16 9
## 14 Harvard Forest MYGA Crice… M juvenile NA 9
And we can use select()
to get columns we want. We do
not need to use quotes on column names within select()
:
select(smammals, taxonID, weight)
## # A tibble: 1,629 × 2
## taxonID weight
## <chr> <dbl>
## 1 MYGA 19
## 2 MYGA 22
## 3 MYGA 23
## 4 MYGA 21
## 5 MYGA 11
## 6 MYGA 17
## 7 MYGA 22
## 8 PELE 36
## 9 PELE 20
## 10 PELE 19
## # ℹ 1,619 more rows
PROTIP: A common “gotcha” is that a function called
select()
also exists in the widely used statistics package MASS which is loaded by default when you load many other common modeling packages. You can make sure you are using the correct version of theselect()
function by specifyingdplyr::select()
.
We can calculate summary statistics on the entire data frame.
Summary methods for data frames tell you how many missing values there are.
Many R functions that calculate summary statistics return a
NA
value if there are any missing values in the dataset.
Let’s look at max()
:
x <- c(5.3, 12.2, -8.6)
x
## [1] 5.3 12.2 -8.6
max(x)
## [1] 12.2
x[2] <- NA
x
## [1] 5.3 NA -8.6
max(x)
## [1] NA
Where there’s data, there’s missing data. There are lots of ways to deal with missing data. For now we will do the simplest thing, ignore it.
The R functions mean()
, min()
,
max()
, and sum()
have an argument
na.rm = TRUE
which removes NA
values before
computing the summary statistic. Its default value is FALSE
so you need to explicitly set it to TRUE
if you want to
calculate the summary statistics without the missing values.
max(x, na.rm = TRUE)
## [1] 5.3
Note na.rm
serves a useful purpose and I think it’s a
good idea to have FALSE
be the default.
Before we start to do more complex operations on data frames, here’s
a little trick to make your code more readable: the pipe operator, or
%>%
. If you look at any example code using tidyverse
packages that you find online, you will see this operator everywhere. It
is used to compose complex expressions that include “chains” of multiple
functions. The pipe is used to “pipe” the output of one function into
the input of the next function.
For instance take this nested function call:
sqrt(log(abs(x)))
We start with a variable called x
, take the absolute
value with abs()
, take the natural log with
log()
, then take the square root with sqrt()
.
But the order in which we do those operations is actually the opposite
of what it appears in the code. We can use the %>%
operator to make this code more readable. Not only are the operations
printed in a logical order, but we can also easily split the code into
multiple lines which is often easier for anyone who is reading your code
to decipher.
x %>%
abs() %>%
log() %>%
sqrt()
PROTIP: It’s really “future you” who is going to be the one reading the code 99% of the time, so any time you spend a little time and effort making your code more legible, you are actually saving yourself time in the long run … even if you’re the only one who ever looks at the code!
Let’s calculate summary statistics for entire columns of a data
frame. We’re going to put the %>%
pipe operator into
practice now, to pass the output of one line of code to the next
line.
smammals %>%
summarize(mean_weight = mean(weight),
sd_weight = sd(weight))
## # A tibble: 1 × 2
## mean_weight sd_weight
## <dbl> <dbl>
## 1 NA NA
As you can see this output is not helpful. All it tells us is that
there is at least one missing (NA
) value in the
weight
column. There are two ways we can deal with
this.
One alternative is to subset the data frame using
filter()
, keeping only rows where weight is not missing.
The function is.na(x)
returns a vector of logical values,
TRUE
if x
is missing and FALSE
if
it is not. We negate this with the !
operator to get a
vector that is TRUE
if weight
is not missing.
Then we can calculate the summary statistics. This shows you how to use
the %>%
pipe to do two operations on a data frame
sequentially.
smammals %>%
filter(!is.na(weight)) %>%
summarize(mean_weight = mean(weight),
sd_weight = sd(weight))
## # A tibble: 1 × 2
## mean_weight sd_weight
## <dbl> <dbl>
## 1 30.5 33.7
Or we can use na.rm
as we demonstrated above. The result
should be identical.
smammals %>%
summarize(mean_weight = mean(weight, na.rm = TRUE),
sd_weight = sd(weight, na.rm = TRUE))
## # A tibble: 1 × 2
## mean_weight sd_weight
## <dbl> <dbl>
## 1 30.5 33.7
You often want to calculate summary statistics for each group. We
will use the group_by()
function here to get the mean and
standard deviation of body weight for each taxon ID.
smammals %>%
filter(!is.na(weight)) %>%
group_by(taxonID) %>%
summarize(mean_weight = mean(weight),
sd_weight = sd(weight))
## # A tibble: 17 × 3
## taxonID mean_weight sd_weight
## <chr> <dbl> <dbl>
## 1 CHHI 39.2 13.3
## 2 MIOC 37.4 7.44
## 3 MIPE 39.7 12.8
## 4 MIPI 27 4.06
## 5 MUMU 16.8 6.55
## 6 MYGA 22.5 4.70
## 7 NAIN 23.7 4.27
## 8 NEFL 198. 63.5
## 9 OCNU 20.5 0.707
## 10 PELE 21.3 4.93
## 11 PEMA 21.1 4.54
## 12 PEME 23 NA
## 13 RARA 180 NA
## 14 REHU 13 NA
## 15 REME 14 1.73
## 16 SIHI 95.6 49.6
## 17 ZAHU 15.7 3.23
We can filter()
by more than one condition, and
group_by()
more than one column. We can also put as many
summary statistics in the call to summarize()
as we want.
For example we could get the number of values (using the
n()
function), the mean, and the standard deviation of all
non-missing values of body weight for each combination of sex and life
stage, for one particular species. (PELE
= Peromyscus
leucopus, the white-footed mouse).
smammals %>%
filter(!is.na(weight), taxonID == 'PELE') %>%
group_by(sex, lifeStage) %>%
summarize(n_individuals = n(),
mean_weight = mean(weight),
sd_weight = sd(weight))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
## # A tibble: 8 × 5
## # Groups: sex [3]
## sex lifeStage n_individuals mean_weight sd_weight
## <chr> <chr> <int> <dbl> <dbl>
## 1 F adult 267 22.6 4.71
## 2 F juvenile 16 13.8 2.02
## 3 F subadult 35 16.6 3.56
## 4 F <NA> 2 18 7.07
## 5 M adult 345 22.3 4.17
## 6 M juvenile 19 14.0 2.31
## 7 M subadult 54 16.5 3.20
## 8 <NA> adult 1 34 NA
Notice there are some NA
values in the grouping columns.
The group_by()
function considers those to be separate
groups. The sd_weight
column also returns a NA
value if you try to compute the sample standard deviation of a single
value, which is undefined.
We can use the pipe %>%
operator to chain an
arbitrarily long sequence of operations together. This makes for more
readable code.
smammals %>%
filter(!is.na(weight), taxonID == 'PELE') %>%
mutate(weight_mg = weight * 1000) %>%
group_by(siteName) %>%
summarize(mean_weight = mean(weight_mg)) %>%
arrange(mean_weight)
## # A tibble: 4 × 2
## siteName mean_weight
## <chr> <dbl>
## 1 Oak Ridge 18435.
## 2 Smithsonian Conservation Biology Institute 20608.
## 3 Harvard Forest 21742.
## 4 Konza Prairie Biological Station 26638.
This is equivalent to
arrange(summarize(group_by(mutate(filter(smammals, !is.na(weight), taxonID == 'PELE'), weight_mg = weight * 1000), siteName), mean_weight = mean(weight_mg)), mean_weight)
but in my opinion much more readable.
I introduced two new functions here. One is the mutate()
function, which can be used to change the value of a column or create a
new column. For instance we can use it to convert the units of weight
from g to mg, making a new column, weight_mg
.
Here I’m also introducing the arrange()
function, which
sorts the data frame in ascending order by a given column or columns.
arrange(-mean_weight)
could be used to sort in descending
order instead.
We are almost done with this crash course in working with R data frames. We have two more functions to learn.
pivot_longer()
pivot_wider()
This set of examples is borrowed from Hadley Wickham’s book R for Data Science (Chapter 12: Tidy data).
Tidy data is a dataset that is organized so that the following are true:
Most of the model fitting functions in R require the data to be in this format (as well as many statistical procedures in SAS and other software).
All of the following data frames contain the same data: number of cases of a disease and total population, for three different countries (Afghanistan, Brazil, and China), in each of two years (1999 and 2000). Which of these data frames contain tidy data?
These are example datasets that come packaged with the tidyr package in R, part of the tidyverse.
table1
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
Table 1 is the only “tidy” data frame out of the three. Each row includes all the variables from one observation: a count of disease cases and total population in a country in one specific year constitutes an observation here.
There are two main functions for reshaping data. One is
pivot_wider()
and the other is pivot_longer()
.
If we have data that looks like table2
, and we want to make
it into tidy form like table1
, we will need to use
pivot_wider()
. We need to identify three components:
id_cols
column(s) which provide the identifying
information for each observation, in this case the country and the
year.names_from
column(s) which are the column or
columns that contains the labels that will become names of individual
columns in the wider version of the data framevalues_from
column(s) which are the column or
columns that contain the data that will be spread across multiple
columns.table2_tidied <- pivot_wider(table2, id_cols = c(country, year), names_from = type, values_from = count)
Use the function all.equal()
to confirm that the two are
the same.
all.equal(table2_tidied, table1)
## [1] TRUE
It’s also possible to reshape table3
to
table1
but this requires more complex operations that we
will not cover today.
A common situation is that we will get data in a wide form from an Excel spreadsheet. When entering data into a spreadsheet manually, it’s much easier to use a wide format because a long skinny table of data is very annoying to scroll through. Wide-form data is often how paper data sheets are formatted. So we often will need to reshape the data to a “longer” format to get an analysis-ready tidy dataset.
Here is an example: the Billboard top 100 song rankings by week for the year 2000. This is another example dataset from the tidyr package . . . read this list of top albums from 2000 if you are a ’90s kid and you want to feel “old as hell”! Use the data frame summary tools you now know to take a look at this dataset.
billboard
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
This CD cover may look familiar to the thirty-somethings reading
this
We want tidy data where each row is an observation: a song identified by its artist name, track name, and the date it entered the top 100 ranking, the week at which the ranking was observed, and the ranking. We have to identify
cols
: the columns that need to be pivoted to a longer
format. In this example we can use a shorthand wk1:wk76
to
refer to all columns between wk1
and wk76
inclusive. You can also use the -
sign to specify which
columns will not be pivoted (the identifying columns for each
observation).names_to
: the name of the column into which the column
names will be put, as a quoted character string. (If you do not supply
this, it will default to name
.)values_to
: the name of the column into which the data
values in the cells will be put, as a quoted character string. (If you
do not supply this, it will default to value
.)pivot_longer(billboard, cols = c(wk1:wk76), names_to = 'week', values_to = 'ranking')
## # A tibble: 24,092 × 5
## artist track date.entered week ranking
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
## 9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
## 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
## # ℹ 24,082 more rows
This is identical to the code above, instead we are specifying the columns not to pivot.
pivot_longer(billboard, cols = -c(artist, track, date.entered), names_to = 'week', values_to = 'ranking')
## # A tibble: 24,092 × 5
## artist track date.entered week ranking
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
## 9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
## 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
## # ℹ 24,082 more rows
Those are very important topics that you will definitely need to learn about if you want to do serious data analysis with R. We don’t have time to cover them today but I would recommend using the R resources I posted on the workshop page to learn more about them. What we did just now only scratches the surface of how to wrangle data in R.
The dataset used for these exercises comes from an experimental study measuring greenhouse gas emissions from experimentally constructed barnyards. The dataset is hosted on Ag Data Commons.
You may combine the results from steps 3-6 into a single “piped”
statement using the %>%
operator.
Read the CSV file at the URL
https://usda-ree-ars.github.io/SEAStats/R_boot_camp/datasets/barnyard_ghg.csv
into R as a data frame.
Examine the data frame. What are the data types of the different
columns? How many different values are there in the Surface
column?
str()
,
summary()
, and/or glimpse()
to examine the
data frame. Use table()
or unique()
for the
Surface
column.Create a subset of the data frame containing only the observations
where the temperature (Temp
column) is greater than 10.
Remember R is case-sensitive!
Group the data frame by surface type.
Calculate the mean of the CO2_mgC
, CH4_mgC
,
NH3_ugN
, and N2O_ugN
columns.
Sort the result in increasing order of the mean value of
CO2_mgC
.
Load the relig_income
example dataset which has survey
data on the number of individuals of different religions by reported
yearly income. (Use data(relig_income)
to load the
dataset.) Convert this dataset into “tidy” form so that each row has a
religion, income level, and number of individuals.
pivot_longer()
.religion
.