Lesson 2: Working with data frames
Lesson 2 learning objectives
At the end of this lesson, students will …
Read a CSV file of data into the R workspace.
Know what a data frame is and understand what constitutes “tidy” data.
Do basic manipulations and reshaping of data frames.
Calculate summary statistics by group.
What is a data frame?
A data frame corresponds to a table or spreadsheet of data
Each column has its own data type
Ideally, each row will correspond to an individual observation
Workhorses of data manipulation and analysis
Note about tidyverse
This lesson features functions from the tidyverse group of packages
tidyverse is are a set of packages, designed to work together, that make it easy to wrangle and manipulate data in R
Alternatives would be using base R code or data.table
tidyverse is probably easiest for beginners and very widespread
I recommend starting here and learning other ways as you get better at R
Reading in data
Read small_mammals.csv
from URL where it is hosted
In practice we usually import data from our local drive
Use read_csv()
from the tidyverse package readr
This returns a “tibble,” tidyverse’s name for a special class of data frame
smammals <- read_csv ('https://usda-ree-ars.github.io/SEAStats/R_boot_camp/datasets/small_mammals.csv' )
Small mammal dataset
Body measurements of small mammals from the NEON ecological observatory network
Southern red-backed vole, image by Gordon Johnston
Other ways to summarize data frames
summary()
and str()
from base R
glimpse()
from tidyverse
Try calling them and see what output you get
We can see we have 1629 rows and 7 columns.
Some character columns (chr
) and numeric columns (dbl
for double-precision)
summary(dat)
shows counts of missing (NA
) values
Subsetting values from a data frame
Subset by indexing rows and columns
Use square brackets []
Syntax is dataframe[row, column]
This code returns the 2nd row, 3rd column (actually a 1x1 data frame with that value)
Subsetting entire rows and columns
Get an entire row by leaving the column
part blank
Get entire column by leaving the row
part blank
This gives us the 5th row:
This gives us the 2nd column:
Subsetting with ranges
We can also subset with ranges
For example this gives us the 6th through 10th rows, including all columns:
This gives us the first 5 rows, but only columns 1 and 2.
Subsetting columns by name
You can also get columns by name
Pass the vector of column names using the c()
function, and put quotes around the names
This gives us the 20th row and the columns called taxonID
and sex
.
smammals[20 , c ('taxonID' , 'sex' )]
The $ operator
To extract a single column from a data frame as a vector, you can use the $
operator.
If the data frame column name contains special characters, surround the name with backticks like smammals$`column 10`
.
Passing data frame columns to functions
Use the $
operator to pass a single data frame column to a function as input
For example, let’s look at all unique values in the siteName
column using the unique()
function
unique (smammals$ siteName)
table()
gives you the number of values of each category within a data frame column
Subsetting with tidyverse functions: filter()
The tidyverse function filter()
gives you subsets of rows
A common use of filter()
is to subset rows by a condition
For example we can use filter()
to get all rows where weight
is less than 10
filter (smammals, weight < 10 )
Subsetting with tidyverse functions: select()
select()
gives you subsets of columns
We do not need to use quotes on column names within select()
select (smammals, taxonID, weight)
Calculating summary statistics
Next we’ll see how to get summary statistics for entire data frames and subgroups of data
First we need to learn about how to deal with missing values
Dealing with missing data
Summary methods for data frames tell you how many missing values there are
Many functions that calculate summary statistics return a NA
value if there are any missing values
x <- c (5.3 , 12.2 , - 8.6 )
max (x)
x[2 ] <- NA
max (x)
Ignoring missing values
Where there’s data, there’s missing data
There are lots of ways to deal with missing data; the simplest is ignoring it
Functions mean()
, min()
, max()
, and sum()
have an argument na.rm
If na.rm = TRUE
it removes NA
values before computing the summary statistic
Default is FALSE
so you need to explicitly set it to TRUE
The pipe operator
The pipe operator, or %>%
, “pipes” the output of one function into the input of the next function
Useful for composing expressions with many functions that are complicated but still readable
Nested versus piped
Nested function call:
sqrt(log(abs(x)))
Piped version of the same function call:
x %>%
abs() %>%
log() %>%
sqrt()
Can be split across multiple lines
Order of operations read from top to bottom instead of inside to outside
A lot easier for people, especially “future you,” to read
Using pipe to calculate summary statistics
Summary statistics for entire column of data frame
The pipe %>%
passes dat
to the next line of code
smammals %>%
summarize (mean_weight = mean (weight),
sd_weight = sd (weight))
We get only NA
because there is at least one missing value in the weight
column
Remove missing values with filter()
We can use filter()
to keep only rows where weight is not missing
is.na(x)
returns a vector of logical values, TRUE
if x
is missing and FALSE
if it is not
We negate this with the !
operator to get a vector that is TRUE
if weight
is not missing
Use the %>%
pipe to do two operations on a data frame sequentially.
smammals %>%
filter (! is.na (weight)) %>%
summarize (mean_weight = mean (weight),
sd_weight = sd (weight))
Remove missing values with na.rm
na.rm
argument will also remove missing values
Result will be identical to the previous
smammals %>%
summarize (mean_weight = mean (weight, na.rm = TRUE ),
sd_weight = sd (weight, na.rm = TRUE ))
Calculating summary statistics by group
Use group_by()
before summarize()
Here we remove missing values, group by the taxonID
column, and get mean and SD of body weight for each taxon
smammals %>%
filter (! is.na (weight)) %>%
group_by (taxonID) %>%
summarize (mean_weight = mean (weight),
sd_weight = sd (weight))
More complex operations
filter()
by more than one condition and group_by()
more than one column
Put many summary statistics in summarize()
n()
function means number of values
Example: all non-missing weight values for PELE
(Peromyscus leucopus , the white-footed mouse), grouped by sex and life stage.
smammals %>%
filter (! is.na (weight), taxonID == 'PELE' ) %>%
group_by (sex, lifeStage) %>%
summarize (n_individuals = n (),
mean_weight = mean (weight),
sd_weight = sd (weight))
Notice NA
values in the grouping columns (missing value treated as group)
sd_weight
is NA
if you try to calculate SD of a single value (undefined)
White-footed mouse, image by Peterwchen
More pipe!
Pipe %>%
can chain arbitrarily many functions together
smammals %>%
filter (! is.na (weight), taxonID == 'PELE' ) %>%
mutate (weight_mg = weight * 1000 ) %>%
group_by (siteName) %>%
summarize (mean_weight = mean (weight_mg)) %>%
arrange (mean_weight)
This is equivalent to arrange(summarize(group_by(mutate(filter(smammals)))))
… etc … but much more readable
mutate()
changes the value of a column or creates a new column.
Here we convert the units of weight, making a new column, weight_mg
mutate(weight_mg = weight * 1000)
arrange()
sorts the data frame in ascending order by a given column or columns
arrange(mean_weight)
for ascending order
arrange(-mean_weight)
for descending order
Tidy data and reshaping data frames
Two more functions to learn:
What is tidy data?
If a dataset is tidy,
Each observation has its own row
Each variable has its own column
Each value has its own cell
Most model fitting functions in R require the data to be in this format (and many SAS procedures)
Which of these data frames contain tidy data?
Example datasets from the tidyr package (part of tidyverse)
All contain the same data: number of cases of a disease and total population, for three different countries (Afghanistan, Brazil, and China), in each of two years (1999 and 2000)
Answer
Table 1 is the only “tidy” data frame
Count of disease cases and total population in a country in one specific year is one observation
Each row of Table 1 includes both grouping columns and both variable columns from one observation
Reshaping data: pivot_wider()
pivot_wider()
will turn table2
into table1
Three arguments needed to reshape data from long to wide
id_cols
column(s): identifying information for each observation (country and year). Use c()
and unquoted names.
names_from
column(s): the labels that will become column names in wide data (type)
values_from
column(s): the data that will be spread across multiple columns
table2_tidied <- pivot_wider (table2, id_cols = c (country, year), names_from = type, values_from = count)
all.equal()
confirms that we have recreated table1
all.equal (table2_tidied, table1)
Reshaping table3
to table1
requires more complex operations that we will not cover today
Reshaping data: pivot_longer()
Lots of data begins as wide form in a paper data sheet or Excel spreadsheet
Easier to enter, but needs to be reshaped to long form for analysis
Example: Billboard top 100 song rankings by week for the year 2000 (also an example dataset from the tidyr package)
Use the data frame summary tools you now know to explore this dataset
Arguments of pivot_longer()
cols
: columns that need to be pivoted to a longer format
Shorthand wk1:wk76
: all columns from wk1
to wk76
Or use -
sign to specify which columns will not be pivoted (the identifying columns for each observation)
names_to
: the name of the column into which the column names will be put, as a quoted character string
Defaults to name
if not specified
values_to
: the name of the column into which the data values in the cells will be put, as a quoted character string
Defaults to value
if not specified
Putting it together, …
pivot_longer (billboard, cols = c (wk1: wk76), names_to = 'week' , values_to = 'ranking' )
Alternative, specifying columns not to pivot
pivot_longer (billboard, cols = - c (artist, track, date.entered), names_to = 'week' , values_to = 'ranking' )
Hey! What about …
Joining two or more data frames together
Operations on lists of data frames or data frames containing lists
Check out the resources I posted on the workshop site, especially R for Data Science , to learn these essential parts of data wrangling.