The Small Round Blue Cell Tumors (SRBCT) dataset from Khan et al. 2001 includes the expression levels of 2308 genes on 63 samples. The samples are distributed in four classes as follows: 8 Burkitt Lymphoma (BL), 23 Ewing Sarcoma (EWS), 12 neuroblastoma (NB), and 20 rhabdomyosarcoma (RMS).
The srbct
dataset contains the following:
1 - $gene
: data frame with 63 rows and 2308 columns. The
expression levels of 2,308 genes in 63 subjects.
2 - $class
: A class vector containing the class tumor of
each individual (4 classes in total).
3 - $gene.name
: data frame with 2,308 rows and 2 columns
containing further information on the genes.
In this Practical we give an illustration of multivariate analysis for a supervised analysis context, but we will first start with a preliminary investigation with PCA analysis on the gene expression data.
The aim of this analysis is to select the genes that can help predict the class of the samples.
We will give you the codes to run the methods and ask you some
questions pertaining to the interpretation of the graphical or numerical
outputs. We will give you some examples of R code to tweak to your
liking after the questions. Use the code ?NameOfFunction
to
have a list of the different arguments available and understand what
they do. Some advanced code is also provided if you would like to go
much further in the analyses.
Let us first have a first understanding of the data with a PCA.
library(mixOmics)
## Loading required package: MASS
## Loading required package: lattice
## Loading required package: ggplot2
##
## Loaded mixOmics 6.24.0
## Thank you for using mixOmics!
## Tutorials: http://mixomics.org
## Bookdown vignette: https://mixomicsteam.github.io/Bookdown
## Questions, issues: Follow the prompts at http://mixomics.org/contact-us
## Cite us: citation('mixOmics')
data(srbct)
# The gene expression data
X <- srbct$gene
dim(X)
pca.srbct <- pca(X, ncomp = 10, center = TRUE, scale = TRUE)
pca.srbct
plot(pca.srbct)
Q1. How many principal components would you choose, and why?
Q2. Using the function plotIndiv()
on
the PCA object, represent the data in the dimension spanned by
components 1 and 2, and also component 1 and 3.
Q3. Going further. PCA is an unsupervised
approach, but coloring the patients according to their tumour classes
can help the interpretation. You can use the arguments
group
to indicate some known grouping of the samples, and
ellipse
that will plot confidence ellipse plots. Comment on
the plot obtained.
Q4. Cool option but requires the installation of
the library rgl
. We can use a 3D plot instead and
rotate the box with your mouse using the argument style
.
Use this option sparingly, cool does not necessarily mean meaningful
…
Q5. Use the function plotVar()
to
represent the correlation circle plot. You can also use the arguments
var.names, cutoff
etc. Comment on what you see, or what you
dont see!
Sample plot:
plotIndiv(pca.srbct, comp = c(1,2), group = srbct$class, ind.names = FALSE,
legend = TRUE, title = 'SRBCT, PCA comp 1 - 2')
Sample plot in 3D with specific colors (Optional: will require the
installation of the rgl
library):
col.srbct = as.numeric(srbct$class)
plotIndiv(pca.srbct, col = col.srbct, style = '3d', ind.names = FALSE)
Correlation circle plot:
plotVar(pca.srbct, cex = 0.9)
The PLS-DA and sPLS-DA analyses below will help refining the clusters of samples in a supervised fashion.
We first set up the data and make sure that the \(Y\) response is a factor. \(X\) is the gene expression data set.
X <- srbct$gene
Y <- srbct$class
summary(Y)
## EWS BL NB RMS
## 23 8 12 20
dim(X)
## [1] 63 2308
length(Y) # always check you have the right dimensions
## [1] 63
srbct.plsda <- plsda(X, Y, ncomp = 3)
Q6. Plot the samples in the space spanned by
components 1 and 2 using the plotIndiv()
function. Compare
to the sample plot obtained with PCA and comment.* Note that by default
the function will automatically assign samples to colored groups, can
you work out why?*
Q7. Is the third component informative?
Sample plot:
plotIndiv(srbct.plsda, comp = c(1,2),
ind.names = FALSE,
legend = TRUE, title = 'SRBCT, PLSDA comp 1 - 2')
We decide to run a sPLS-DA with a specific number of variables to select on each component. For example (50, 40, 30) on each component respectively. Note that this is an arbitrary choice, we usually use performance evaluation and cross-validation to choose the optimal hyperparameter. See the section below for more advanced analyses.
select.keepX = c(50, 40, 30)
splsda.srbct <- splsda(X, Y, ncomp = 3, keepX = select.keepX)
Q8. Compare and comment on the sample plot obtained between a PLS-DA (above) and a sPLS-DA.
Q9. Is the third component essential in sPLS-DA to explain one sample group?
Q10. Look at the correlation circle plot using
plotVar()
and comment briefly.
Q11. The function plotLoadings()
displays the genes selected on each component and color them according
to whether, on average, their expression values is higher in a
particular samples group. Use this plot and specify the arguments
comp
and contrib = 'max'
and
method = 'mean'
(method = 'median'
is also
available for skewed data). Interpret this plot together with the sample
plot obtained earlier. Does the plotLoadings make sense? Interpret the
plotLoadings on component 1.
Sample plot:
plotIndiv(splsda.srbct, comp = c(1,2),
ind.names = FALSE,
legend = TRUE,
title = 'SRBCT, sPLSDA comp 1 - 2')
Correlation circle plot (we specify the argument
var.names
to the first 5 characters of the gene names for
improved visibility):
plotVar(splsda.srbct, comp = c(1,2),
var.names = list(substr(srbct$gene.name[, 2], 1, 5)), cex = 3)
Loading plot (you can display a smaller number of genes with the
argument ndisplay
):
plotLoadings(splsda.srbct, comp = 2, method = 'mean', contrib = 'max')
We can assess the predictive performance of PLS-DA and sPLS-DA with
the perf()
function applied to a PLS-DA or a sPLS-DA
object.
Input arguments. You will need to specify the
arguments for cross-validation: here we propose 5-fold validation
(folds = 5
), and the number of CV repeats, here we propose
(nrepeat = 10
).
Outputs. The classification error rates that are output include the overall error rate, as well as the balanced (BER) error rate that weights up the groups of samples that are minoritary (see lecture notes). You can also examine the error rate per class. We can use different types of prediction distances (maximum, centroid, mahalanobis), and they vary in their performance.
Evaluate and compare the performance of your PLS-DA and sPLS-DA models:
Q12. Which performance measure would you choose for this specific data set, the overall error rate or the BER, and why? Also look at the error rate per class.
Q13. If you rerun the perf()
function a
second time, you may obtain different results. Can you explain why?
Q14. Comment on the choice of of the number of folds and repeats. Do you think those are appropriate on those data?
Example of performance evaluation code on sPLS-DA:
# example of performance evaluation code on sPLS-DA
set.seed(34) # remove the seed for your own study!
perf.splsda <- perf(splsda.srbct, folds = 5, validation = "Mfold",
progressBar = FALSE, nrepeat = 10)
# perf.srbct # lists the different outputs
perf.splsda$error.rate
## $overall
## max.dist centroids.dist mahalanobis.dist
## comp1 0.414285714 0.431746032 0.43174603
## comp2 0.204761905 0.166666667 0.17936508
## comp3 0.001587302 0.004761905 0.02063492
##
## $BER
## max.dist centroids.dist mahalanobis.dist
## comp1 0.519927536 0.42473732 0.42473732
## comp2 0.284945652 0.15894928 0.17115036
## comp3 0.001086957 0.00326087 0.01628623
# error rate per class
perf.splsda$error.rate.class
## $max.dist
## comp1 comp2 comp3
## EWS 0.01304348 0.03478261 0.004347826
## BL 0.65000000 0.45000000 0.000000000
## NB 0.96666667 0.57500000 0.000000000
## RMS 0.45000000 0.08000000 0.000000000
##
## $centroids.dist
## comp1 comp2 comp3
## EWS 0.4347826 0.1391304 0.01304348
## BL 0.2625000 0.1250000 0.00000000
## NB 0.6166667 0.1416667 0.00000000
## RMS 0.3850000 0.2300000 0.00000000
##
## $mahalanobis.dist
## comp1 comp2 comp3
## EWS 0.4347826 0.1304348 0.04347826
## BL 0.2625000 0.0875000 0.00000000
## NB 0.6166667 0.2166667 0.01666667
## RMS 0.3850000 0.2500000 0.00500000
We can have a look at how the variables are repeatedly selected across the different CV folds, for example on the first two components. This information is important to answer the question How much should I ‘trust’ the reproducibility of my signature?’.
par(mfrow=c(1,2))
#this is for comp 1
plot(perf.splsda$features$stable[[1]], type = 'h',
xlab = 'variables selected across CV folds', ylab = 'Stability frequency')
title(main = 'Feature stability for comp = 1')
# for comp2
plot(perf.splsda$features$stable[[2]], type = 'h',
xlab = 'variables selected across CV folds', ylab = 'Stability frequency')
title(main = 'Feature stability for comp = 2')
par(mfrow=c(1,1))
The function selectVar()
outputs the variables selected
for a given component and their loading values (ranked in decreasing
absolute values). We can concatenate those results with the feature
stability. Interestingly, on the first component, the genes that are
selected in the final model are not necessary the most stable when we
subsample the data! Try it out by changing the code below.
# just the head of the selectVar output:
head(selectVar(splsda.srbct, comp = 2)$value)
## value.var
## g1389 -0.4483063
## g1954 -0.3623933
## g246 -0.3615745
## g545 -0.3399198
## g1319 -0.2542295
## g2050 0.2413160
# save the name of selected var + stability from perf:
select.name <- selectVar(splsda.srbct, comp = 2)$name
stability <- perf.splsda$features$stable[[2]][select.name]
# just the head of the stability of the selected var:
head(cbind(selectVar(splsda.srbct, comp = 2)$value, stability))
## value.var Var1 Freq
## g1389 -0.4483063 g1389 0.46
## g1954 -0.3623933 g1954 0.44
## g246 -0.3615745 g246 0.44
## g545 -0.3399198 g545 0.44
## g1319 -0.2542295 g1319 0.44
## g2050 0.2413160 g2050 0.40
Q15. Comment on those outputs.
There are different parameters to choose in PLS-DA and sPLS-DA. One
is the number of components ncomp
, which can be chosen
based on the predictive performance (there is no need to add more
components if they do not improve classification). The other one, and
only for sPLS-DA is the number of variables to select per component
keepX
, also by evaluating the prediction performance.
One practical way to choose the number of components is to run a
PLS-DA model first with a large number of components
(e.g. ncomp = 10
) using repeated cross-validation (here
folds = 5
and nrepeat = 10
), then use the
function perf()
which evaluates the performance of the
model per component. This step will allow to:
1 - choose the distance that minimises the classification error rate and
2 - the number of optimal components. Our experience has shown that
usually ncomp
= K-1 where K is the number of classes is
often optimal, but this is highly dependent on the data.
Details about the code below. To speed up the
computations we have set up folds = 5
in the
cross-validation (10 would probably be best) and we have set the seed to
obtain the same results from one computer/analysis to another. The
argument nrepeat
indicates the number of cross-validation
performed, and the performance are averaged across those repeats.
Ideally you should choose the folds
value so that the
learning set is large enough, and so that the test set includes \(\sim\) 5 samples or more. Also consider
increasing nrepeat
when folds
is small.
Alternatively use leave-one-out cross validation
validation = 'loo'
and nrepeat
is not
needed.
# this chunk takes ~ 4-5 min to run, you can skip that part and
# load the .RData provided instead in RData/
srbct.plsda <- plsda(X, Y, ncomp = 10)
perf.srbct.plsda <- perf(srbct.plsda, validation = "Mfold", folds = 5,
progressBar = FALSE, nrepeat = 10)
# perf.srbct.plsda #will list the different outputs
# perf.srbct.plsda$error.rate #outputs the error rate for each comp and each distance
#save(perf.srbct.plsda, file = 'data/SRBCT-perf-PLSDA.RData')
# to gain some computing time on the tuning, directly load the data
load('data/SRBCT-perf-PLSDA.RData')
Below is the plot of the classification error rate averaged across
the 5 folds and the 10 repeated CV for all prediction distances. BER
stands for balanced error rate, which accounts for unbalanced
number of samples per class. We can choose ncomp = 3
or 4
(depending on the standard deviation error bars) and the Mahalanobis
distance.
# sd to show the error bars across the repeats
plot(perf.srbct.plsda, overlay = 'measure', sd = TRUE)
We estimate the classification error rate with respect to the number
of selected variables in the model with the function
tune.splsda
. The tuning is being performed one component at
the time inside the function and the optimal number of variables to
select is automatically retrieved for each component. We set
ncomp = 4
and we used 10-fold cross validation
(folds = 5
repeated 10 times.
The following code might take a few minutes to run and can be ignored. Note that for a thorough tuning step, the following code should be repeated 10 - 50 times and the error rate is averaged across the runs.
# this chunk takes ~ 6 min to run, load the .RData provided instead.
# Some convergence issues may arise but it is ok (run on CV)
# grid of possible keepX values that will be tested for comp 1 and comp 2
list.keepX <- c(1:10, seq(20, 100, 10))
tune.splsda.srbct <- tune.splsda(X, Y, ncomp = 4, validation = 'Mfold', folds = 5,
progressBar = FALSE, dist = 'mahalanobis.dist',
test.keepX = list.keepX, nrepeat = 10)
# mean error rate across all CV folds and nrepeats
error <- tune.splsda.srbct$error.rate
# optimal keepX achieving lowest error rate
select.keepX <- tune.splsda.srbct$choice.keepX
#save(error, tune.splsda.srbct, select.keepX, list.keepX, file = 'data/SRBCT-tune-sPLSDA.RData')
# to gain some computing time on the tuning, directly load the data
load('data/SRBCT-tune-sPLSDA.RData')
The following command line outputs the mean error rate for each component given the past (tuned) components. This output globally shows that 3 components are sufficient to achieve the lowest classification error rate in the sparse model:
# just a head of the classif error rate per keepX and comp
head(tune.splsda.srbct$error.rate)
## comp1 comp2 comp3 comp4
## 1 0.5070924 0.2456793 0.07234601 0.01601449
## 2 0.4557609 0.2236866 0.05755435 0.01592391
## 3 0.4345562 0.2138768 0.04454710 0.01592391
## 4 0.4290942 0.1927264 0.03482790 0.01592391
## 5 0.4016304 0.1865942 0.02673007 0.01384058
## 6 0.3920290 0.1840036 0.03048007 0.01384058
We display the mean classification error rate on each component, bear
in mind that each component is conditional on the previous components
calculated with the optimal number of selected variables. The circle
indicates the best keepX
value to achieve the lowest error
rate per component. Given those outputs, do you think we should
include the 4th component in the model?
# the optimal number of components according to our one-sided t-tests
tune.splsda.srbct$choice.ncomp
## $ncomp
## [1] 3
##
## $values
## comp1 comp2 comp3 comp4
## nrep.1 0.4054348 0.06548913 0.00000000 0.01086957
## nrep.2 0.4068841 0.09420290 0.00000000 0.00000000
## nrep.3 0.4073370 0.16757246 0.02173913 0.01086957
## nrep.4 0.3978261 0.10923913 0.03333333 0.00000000
## nrep.5 0.3497283 0.13007246 0.02083333 0.00000000
## nrep.6 0.3543478 0.14719203 0.04211957 0.01086957
## nrep.7 0.3800725 0.12545290 0.00000000 0.02336957
## nrep.8 0.2509058 0.14094203 0.01250000 0.00000000
## nrep.9 0.4272645 0.12219203 0.01086957 0.02173913
## nrep.10 0.2596920 0.18559783 0.00000000 0.00000000
#the optimal keepX parameter according to minimal error rate
tune.splsda.srbct$choice.keepX
## comp1 comp2 comp3 comp4
## 20 100 40 70
# argument option to show the optimal keepX values
# sd to show the error bars across the repeats
plot(tune.splsda.srbct, optimal = TRUE, sd = TRUE)
Of course this result will depend on how fine your tuning grid
list.keepX
is, as well as the values chosen for
folds
and nrepeat
. Therefore, it is
interesting to assess the performance of final model, as well as examine
the stability of the selected variables across the different folds.
The graphic above shows that the error rate decreases when more
components are included in sPLS-DA. To obtain a more reliable estimation
of the error rate, the number of repeats should be increased (between 50
to 100). This type of graph helps choosing not only the ‘optimal’ number
of variables to select confirm the number of components
ncomp
. Indeed, when a sufficient number of components have
been added, the error rate will stop decreasing. The addition of the
fourth component is probably not necessary here. In this specific
example, the addition of a third component did not seem necessary.
We artificially create an ‘external’ test set on which we want to predict the class membership:
# set the seed for reproducibility purposes during this workshop
set.seed(33)
train <- sample(1:nrow(X), 50) # randomly select 50 samples in the training set
test <- setdiff(1:nrow(X), train) # rest is part of the test set
# store matrices into training and test set:
X.train <- X[train, ]
X.test <- X[test,]
Y.train <- Y[train]
Y.test <- Y[test]
# check dimensions are OK:
dim(X.train)
## [1] 50 2308
dim(X.test)
## [1] 13 2308
We assume here that the tuning step was performed on the training set
only1. We also assume that based on a thorough
tuning we would need ncomp = 3
and
keepX = c(20,30,40)
:
splsda.srbct.train <- splsda(X.train, Y.train, ncomp = 3, keepX = c(20,30,40))
We now apply the trained model on the test set and we specify the
prediction distance, for example mahalanobis.dist
(see also
?predict.splsda
):
splsda.srbct.predict <- predict(splsda.srbct.train, X.test,
dist = "mahalanobis.dist")
The object $class
outputs the predicted classes of the
test individual for each of the 3 components, conditionally on the
previous component. We can compare the prediction to the real class
(note: in a real application case you may never know the true class).
What do you think about the predictive ability of the model? Does
the addition of components increase the prediction accuracy?
# just the head:
head(data.frame(splsda.srbct.predict$class, Truth = Y.test))
## mahalanobis.dist.comp1 mahalanobis.dist.comp2 mahalanobis.dist.comp3
## EWS.T7 EWS EWS EWS
## EWS.T15 EWS EWS EWS
## EWS.C8 EWS EWS EWS
## EWS.C10 EWS EWS EWS
## BL.C8 BL BL BL
## NB.C6 NB NB NB
## Truth
## EWS.T7 EWS
## EWS.T15 EWS
## EWS.C8 EWS
## EWS.C10 EWS
## BL.C8 BL
## NB.C6 NB
# compare prediction on the third component
table(splsda.srbct.predict$class$mahalanobis.dist[,3], Y.test)
## Y.test
## EWS BL NB RMS
## BL 0 1 0 0
## EWS 4 0 0 0
## NB 0 0 1 0
## RMS 0 0 0 7
The object $predict
outputs the predicted dummy scores
assigned for each test observation and each class level for a given
component (as explained in details in Rohart et al. 2017). The final
prediction call is given based on this matrix and the distance that is
specified. You may now understand why the outcome \(Y\) is coded as a dummy matrix in
PLS-DA!
We output the confidence measure for the 3rd component. The columns
represent the different class labels. Compare this output with the
final prediction call obtained earlier with
$class
.
#On component 4, just the head:
head(splsda.srbct.predict$predict[, , 3])
## EWS BL NB RMS
## EWS.T7 1.26848551 -0.05273773 -0.24070902 0.024961232
## EWS.T15 1.15058424 -0.02222145 -0.11877994 -0.009582845
## EWS.C8 1.25628411 0.05481026 -0.16500118 -0.146093198
## EWS.C10 0.83995956 0.10871106 0.16452934 -0.113199949
## BL.C8 0.02431262 0.90877176 0.01775304 0.049162580
## NB.C6 0.06738230 0.05086884 0.86247360 0.019275265
sessionInfo()
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] mixOmics_6.24.0 ggplot2_3.4.4 lattice_0.21-8 MASS_7.3-60
##
## loaded via a namespace (and not attached):
## [1] tidyr_1.3.0 sass_0.4.7 utf8_1.2.4
## [4] generics_0.1.3 stringi_1.7.12 digest_0.6.33
## [7] magrittr_2.0.3 evaluate_0.22 grid_4.3.1
## [10] RColorBrewer_1.1-3 fastmap_1.1.1 plyr_1.8.9
## [13] jsonlite_1.8.7 Matrix_1.6-1.1 ggrepel_0.9.3
## [16] RSpectra_0.16-1 gridExtra_2.3 purrr_1.0.2
## [19] fansi_1.0.5 scales_1.2.1 codetools_0.2-19
## [22] jquerylib_0.1.4 cli_3.6.1 rlang_1.1.1
## [25] munsell_0.5.0 withr_2.5.1 cachem_1.0.8
## [28] yaml_2.3.7 ellipse_0.5.0 tools_4.3.1
## [31] parallel_4.3.1 reshape2_1.4.4 BiocParallel_1.34.2
## [34] dplyr_1.1.3 colorspace_2.1-0 corpcor_1.6.10
## [37] vctrs_0.6.4 R6_2.5.1 matrixStats_1.0.0
## [40] lifecycle_1.0.3 stringr_1.5.0 pkgconfig_2.0.3
## [43] pillar_1.9.0 bslib_0.5.1 gtable_0.3.4
## [46] glue_1.6.2 rARPACK_0.11-0 Rcpp_1.0.11
## [49] xfun_0.40 tibble_3.2.1 tidyselect_1.2.0
## [52] rstudioapi_0.15.0 knitr_1.44 farver_2.1.1
## [55] htmltools_0.5.6.1 igraph_1.5.1 labeling_0.4.3
## [58] rmarkdown_2.25 compiler_4.3.1
It is really important to tune only on the training step to avoid overly optimistic performance results!↩︎