The data

The Small Round Blue Cell Tumors (SRBCT) dataset from Khan et al. 2001 includes the expression levels of 2308 genes on 63 samples. The samples are distributed in four classes as follows: 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20 rhabdomyosarcoma (RMS).

The srbct dataset contains the following:

1 - $gene: data frame with 63 rows and 2308 columns. The expression levels of 2,308 genes in 63 subjects.

2 - $class: A class vector containing the class tumor of each individual (4 classes in total).

3 - $gene.name: data frame with 2,308 rows and 2 columns containing further information on the genes.

In this Practical we give an illustration of multivariate analysis for a supervised analysis context, but we will first start with a preliminary investigation with PCA analysis on the gene expression data.

The aim of this analysis is to select the genes that can help predict the class of the samples.

How does this practical work?

We will give you the codes to run the methods and ask you some questions pertaining to the interpretation of the graphical or numerical outputs. We will give you some examples of R code to tweak to your liking after the questions. Use the code ?NameOfFunction to have a list of the different arguments available and understand what they do. Some advanced code is also provided if you would like to go much further in the analyses.

Preliminary analysis with PCA

Let us first have a first understanding of the data with a PCA.

library(mixOmics)
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Loaded mixOmics 6.24.0
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us:  citation('mixOmics')
data(srbct)
# The gene expression data
X <- srbct$gene
dim(X)
pca.srbct <- pca(X, ncomp = 10, center = TRUE, scale = TRUE)
pca.srbct
plot(pca.srbct)

Q1. How many principal components would you choose, and why?

Q2. Using the function plotIndiv() on the PCA object, represent the data in the dimension spanned by components 1 and 2, and also component 1 and 3.

Q3. Going further. PCA is an unsupervised approach, but coloring the patients according to their tumour classes can help the interpretation. You can use the arguments group to indicate some known grouping of the samples, and ellipse that will plot confidence ellipse plots. Comment on the plot obtained.

Q4. Cool option but requires the installation of the library rgl. We can use a 3D plot instead and rotate the box with your mouse using the argument style. Use this option sparingly, cool does not necessarily mean meaningful 


Q5. Use the function plotVar() to represent the correlation circle plot. You can also use the arguments var.names, cutoff etc. Comment on what you see, or what you dont see!

Sample plot:

plotIndiv(pca.srbct, comp = c(1,2), group = srbct$class, ind.names = FALSE,
          legend = TRUE, title = 'SRBCT, PCA comp 1 - 2')

Sample plot in 3D with specific colors (Optional: will require the installation of the rgl library):

col.srbct = as.numeric(srbct$class)
plotIndiv(pca.srbct, col = col.srbct,  style = '3d', ind.names = FALSE)

Correlation circle plot:

plotVar(pca.srbct, cex = 0.9)