The Small Round Blue Cell Tumors (SRBCT) dataset from Khan et al. 2001 includes the expression levels of 2308 genes on 63 samples. The samples are distributed in four classes as follows: 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20 rhabdomyosarcoma (RMS).
The srbct
dataset contains the following:
1 - $gene
: data frame with 63 rows and 2308 columns. The
expression levels of 2,308 genes in 63 subjects.
2 - $class
: A class vector containing the class tumor of
each individual (4 classes in total).
3 - $gene.name
: data frame with 2,308 rows and 2 columns
containing further information on the genes.
In this Practical we give an illustration of multivariate analysis for a supervised analysis context, but we will first start with a preliminary investigation with PCA analysis on the gene expression data.
The aim of this analysis is to select the genes that can help predict the class of the samples.
We will give you the codes to run the methods and ask you some
questions pertaining to the interpretation of the graphical or numerical
outputs. We will give you some examples of R code to tweak to your
liking after the questions. Use the code ?NameOfFunction
to
have a list of the different arguments available and understand what
they do. Some advanced code is also provided if you would like to go
much further in the analyses.
Let us first have a first understanding of the data with a PCA.
library(mixOmics)
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
##
## Loaded mixOmics 6.24.0
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us: citation('mixOmics')
data(srbct)
# The gene expression data
X <- srbct$gene
dim(X)
pca.srbct <- pca(X, ncomp = 10, center = TRUE, scale = TRUE)
pca.srbct
plot(pca.srbct)
Q1. How many principal components would you choose, and why?
Q2. Using the function plotIndiv()
on
the PCA object, represent the data in the dimension spanned by
components 1 and 2, and also component 1 and 3.
Q3. Going further. PCA is an unsupervised
approach, but coloring the patients according to their tumour classes
can help the interpretation. You can use the arguments
group
to indicate some known grouping of the samples, and
ellipse
that will plot confidence ellipse plots. Comment on
the plot obtained.
Q4. Cool option but requires the installation of
the library rgl
. We can use a 3D plot instead and
rotate the box with your mouse using the argument style
.
Use this option sparingly, cool does not necessarily mean meaningful
âŠ
Q5. Use the function plotVar()
to
represent the correlation circle plot. You can also use the arguments
var.names, cutoff
etc. Comment on what you see, or what you
dont see!
Sample plot:
plotIndiv(pca.srbct, comp = c(1,2), group = srbct$class, ind.names = FALSE,
legend = TRUE, title = 'SRBCT, PCA comp 1 - 2')
Sample plot in 3D with specific colors (Optional: will require the
installation of the rgl
library):
col.srbct = as.numeric(srbct$class)
plotIndiv(pca.srbct, col = col.srbct, style = '3d', ind.names = FALSE)
Correlation circle plot:
plotVar(pca.srbct, cex = 0.9)