This lesson is a modified version of a vignette from the R package mixOmics by Le Cao et al. The full series of vignettes can be found here in ebook format; this lesson is based on Chapter 6: N-integration.

N-Integration

N-Integration is the framework of having multiple datasets which measure different aspects of the same samples. For example, you may have transcriptomic, genetic and proteomic data for the same set of cells. N-integrative methods are built to use the information in all three of these dataframes simultaneously.

DIABLO is a novel mixOmics framework for the integration of multiple data sets while explaining their relationship with a categorical outcome variable. DIABLO stands for Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics studies. It can also be referred to as Multiblock (s)PLS-DA.

Block sPLS-DA on the TCGA case study

Human breast cancer is a heterogeneous disease in terms of molecular alterations, cellular composition, and clinical outcome. Breast tumours can be classified into several subtypes, according to their levels of mRNA expression. Here we consider a subset of data generated by The Cancer Genome Atlas Network. For the package, data were normalised, and then drastically prefiltered for illustrative purposes.

The data were divided into a training set with a subset of 150 samples from the mRNA, miRNA and proteomics data, and a test set including 70 samples, but only with mRNA and miRNA data (the proteomics data are missing). The aim of this integrative analysis is to identify a highly correlated multi-omics signature discriminating the breast cancer subtypes Basal, Her2 and LumA.

The breast.TCGA (more details can be found in ?breast.TCGA) is a list containing training and test sets of omics data data.train and data.test which include:

$miRNA: A data frame with 150 (70) rows and 184 columns in the training (test) data set for the miRNA expression levels,
$mRNA: A data frame with 150 (70) rows and 520 columns in the training (test) data set for the mRNA expression levels,
$protein: A data frame with 150 rows and 142 columns in the training data set for the protein abundance (there are no proteomics in the test set),
$subtype: A factor indicating the breast cancer subtypes in the training (for 150 samples) and test sets (for 70 samples).

This case study covers an interesting scenario where one omic data set is missing in the test set, but because the method generates a set of components per training data set, we can still assess the prediction or performance evaluation using majority or weighted prediction vote.

Load the data

To illustrate the multiblock sPLS-DA approach, we will integrate the expression levels of miRNA, mRNA and the abundance of proteins while discriminating the subtypes of breast cancer, then predict the subtypes of the samples in the test set.

The input data is first set up as a list of $Q$ matrices $\boldsymbol X_1, \dots, \boldsymbol X_Q$ and a factor indicating the class membership of each sample $\boldsymbol Y$. Each data frame in $\boldsymbol X$ should be named as we will match these names with the keepX parameter for the sparse method.

library(mixOmics)
data(breast.TCGA)

# Extract training data and name each data frame
# Store as list
X <- list(mRNA = breast.TCGA$data.train$mrna, 
          miRNA = breast.TCGA$data.train$mirna, 
          protein = breast.TCGA$data.train$protein)

# Outcome
Y <- breast.TCGA$data.train$subtype
summary(Y)

## Basal  Her2  LumA 
##    45    30    75

Parameter choice

Design matrix

The choice of the design can be motivated by different aspects, including:

Biological apriori knowledge: Should we expect mRNA and miRNA to be highly correlated?
Analytical aims: A compromise needs to be achieved between a classification and prediction task, and extracting the correlation structure of the data sets. A full design with weights = 1 will favour the latter, but at the expense of classification accuracy, whereas a design with small weights will lead to a highly predictive signature. This pertains to the complexity of the analytical task involved as several constraints are included in the optimisation procedure. For example, here we choose a 0.1 weighted model as we are interested in predicting test samples later in this case study.

design <- matrix(0.1, ncol = length(X), nrow = length(X), 
                dimnames = list(names(X), names(X)))
diag(design) <- 0
design

##         mRNA miRNA protein
## mRNA     0.0   0.1     0.1
## miRNA    0.1   0.0     0.1
## protein  0.1   0.1     0.0

Note however that even with this design, we will still unravel a correlated signature as we require all data sets to explain the same outcome $\boldsymbol y$, as well as maximising pairs of covariances between data sets.

Data-driven option: we could perform regression analyses with PLS to further understand the correlation between data sets. Here for example, we run PLS with one component and calculate the cross-correlations between components associated to each data set:

res1.pls.tcga <- pls(X$mRNA, X$protein, ncomp = 1)
cor(res1.pls.tcga$variates$X, res1.pls.tcga$variates$Y)

res2.pls.tcga <- pls(X$mRNA, X$miRNA, ncomp = 1)
cor(res2.pls.tcga$variates$X, res2.pls.tcga$variates$Y)

res3.pls.tcga <- pls(X$protein, X$miRNA, ncomp = 1)
cor(res3.pls.tcga$variates$X, res3.pls.tcga$variates$Y)

##           comp1
## comp1 0.9031761
##           comp1
## comp1 0.8456299
##           comp1
## comp1 0.7982008

The data sets taken in a pairwise manner are highly correlated, indicating that a design with weights $\sim 0.8 - 0.9$ could be chosen.

Number of components

As in the PLS-DA framework presented in Module 3, we first fit a block.plsda model without variable selection to assess the global performance of the model and choose the number of components. We run perf() with 10-fold cross validation repeated 10 times for up to 5 components and with our specified design matrix. Similar to PLS-DA, we obtain the performance of the model with respect to the different prediction distances:

diablo.tcga <- block.plsda(X, Y, ncomp = 5, design = design)

perf.diablo.tcga = perf(diablo.tcga, validation = 'Mfold', folds = 10, nrepeat = 10, seed = 123)

# Plot of the error rates based on weighted vote
plot(perf.diablo.tcga)

Choosing the number of components in block.plsda using perf() with 10 x 10-fold CV function in the breast.TCGA study. Classification error rates (overall and balanced, see Module 2) are represented on the y-axis with respect to the number of components on the x-axis for each prediction distance presented in PLS-DA in Section 3.4 and detailed in Extra reading material 3 from Module 3. Bars show the standard deviation across the 10 repeated folds. The plot shows that the error rate reaches a minimum from 2 to 3 dimensions.

The performance plot indicates that two components should be sufficient in the final model, and that the centroids distance might lead to better prediction. A balanced error rate (BER) should be considered for further analysis.

The following outputs the optimal number of components according to the prediction distance and type of error rate (overall or balanced), as well as a prediction weighting scheme illustrated further below.

perf.diablo.tcga$choice.ncomp$WeightedVote

##             max.dist centroids.dist mahalanobis.dist
## Overall.ER         4              3                3
## Overall.BER        4              3                3

We will use ncomp = 2 moving forward although the above indicates 3 is optimal, because 3 is only very marginally superior to 2.

Number of variables to select

We then choose the optimal number of variables to select in each data set using the tune.block.splsda function. The function tune() is run with 10-fold cross validation, but repeated only once (nrepeat = 1) for illustrative and computational reasons here. For a thorough tuning process, we advise increasing the nrepeat argument to 10-50, or more.

We choose a keepX grid that is relatively fine at the start, then coarse. If the data sets are easy to classify, the tuning step may indicate the smallest number of variables to separate the sample groups. Hence, we start our grid at the value 5 to avoid a too small signature that may preclude biological interpretation.

# chunk takes about 2 min to run
test.keepX <- list(mRNA = c(5:9, seq(10, 25, 5)),
                   miRNA = c(5:9, seq(10, 20, 2)),
                   proteomics = c(seq(5, 25, 5)))

tune.diablo.tcga <- tune.block.splsda(X, Y, ncomp = 2, 
                              test.keepX = test.keepX, design = design,
                              validation = 'Mfold', folds = 10, nrepeat = 1, 
                              dist = "centroids.dist", seed = 123)

list.keepX <- tune.diablo.tcga$choice.keepX

Note:

For fast computation, we can use parallel computing here - this option is also enabled on a laptop or workstation, see ?tune.block.splsda.

The number of features to select on each component is returned and stored for the final model:

list.keepX

## $mRNA
## [1]  8 25
## 
## $miRNA
## [1] 14  5
## 
## $protein
## [1] 10  5

Note:

You can skip any of the tuning steps above, and hard code your chosen ncomp and keepX parameters (as a list for the latter, as shown below).

list.keepX <- list( mRNA = c(8, 25), miRNA = c(14,5), protein = c(10, 5))

Final model

The final multiblock sPLS-DA model includes the tuned parameters and is run as:

diablo.tcga <- block.splsda(X, Y, ncomp = 2, 
                            keepX = list.keepX, design = design)

## Design matrix has changed to include Y; each block will be
##             linked to Y.

A warning message informs us that the outcome $\boldsymbol Y$ has been included automatically in the design, so that the covariance between each block’s component and the outcome is maximised, as shown in the final design output:

diablo.tcga$design

##         mRNA miRNA protein Y
## mRNA     0.0   0.1     0.1 1
## miRNA    0.1   0.0     0.1 1
## protein  0.1   0.1     0.0 1
## Y        1.0   1.0     1.0 0

The selected variables can be extracted with the function selectVar(), for example in the mRNA block, along with their loading weights (not output here):

# mRNA variables selected on component 1
selectVar(diablo.tcga, block = 'mRNA', comp = 1)

Note:

The stability of the selected variables can be extracted from the perf() function, similar to the example given in the PLS-DA analysis (Module 3).

Sample plots

`plotDiablo`

plotDiablo() is a diagnostic plot to check whether the correlations between components from each data set were maximised as specified in the design matrix. We specify the dimension to be assessed with the ncomp argument.

plotDiablo(diablo.tcga, ncomp = 1)

Diagnostic plot from multiblock sPLS-DA applied on the breast.TCGA study. Samples are represented based on the specified component (here ncomp = 1) for each data set (mRNA, miRNA and protein). Samples are coloured by breast cancer subtype (Basal, Her2 and LumA) and 95% confidence ellipse plots are represented. The bottom left numbers indicate the correlation coefficients between the first components from each data set. In this example, mRNA expression and protein concentration are highly correlated on the first dimension.

The plot indicates that the first components from all data sets are highly correlated. The colours and ellipses represent the sample subtypes and indicate the discriminative power of each component to separate the different tumour subtypes. Thus, multiblock sPLS-DA is able to extract a strong correlation structure between data sets, as well as discriminate the breast cancer subtypes on the first component.

`plotIndiv`

The sample plot with the plotIndiv() function projects each sample into the space spanned by the components from each block, resulting in a series of graphs corresponding to each data set. The optional argument blocks can output a specific data set. Ellipse plots are also available (argument ellipse = TRUE).

plotIndiv(diablo.tcga, ind.names = FALSE, legend = TRUE, 
          title = 'TCGA, DIABLO comp 1 - 2')

Sample plot from multiblock sPLS-DA performed on the breast.TCGA study. The samples are plotted according to their scores on the first 2 components for each data set. Samples are coloured by cancer subtype and are classified into three classes: Basal, Her2 and LumA. The plot shows the degree of agreement between the different data sets and the discriminative ability of each data set.

This type of graphic allows us to better understand the information extracted from each data set and its discriminative ability. Here we can see that the LumA group can be difficult to classify in the miRNA data.

Note:

Additional variants include the argument block = 'average' that averages the components from all blocks to produce a single plot. The argument block='weighted.average' is a weighted average of the components according to their correlation with the components associated with the outcome.

`plotArrow`

In the arrow plot, the start of the arrow indicates the centroid between all data sets for a given sample and the tip of the arrow the location of that same sample but in each block. Such graphics highlight the agreement between all data sets at the sample level when modelled with multiblock sPLS-DA.

plotArrow(diablo.tcga, ind.names = FALSE, legend = TRUE, 
          title = 'TCGA, DIABLO comp 1 - 2')

Arrow plot from multiblock sPLS-DA performed on the breast.TCGA study. The samples are projected into the space spanned by the first two components for each data set then overlaid across data sets. The start of the arrow indicates the centroid between all data sets for a given sample and the tip of the arrow the location of the same sample in each block. Arrows further from their centroid indicate some disagreement between the data sets. Samples are coloured by cancer subtype (Basal, Her2 and LumA).

This plot shows that globally, the discrimination of all breast cancer subtypes can be extracted from all data sets, however, there are some dissimilarities at the samples level across data sets (the common information cannot be extracted in the same way across data sets).

Variable plots

The visualisation of the selected variables is crucial to mine their associations in multiblock sPLS-DA. Here we revisit existing outputs presented in Module 2 with further developments for multiple data set integration. All the plots presented provide complementary information for interpreting the results.

`plotVar`

The correlation circle plot highlights the contribution of each selected variable to each component. Important variables should be close to the large circle (see Module 2). Here, only the variables selected on components 1 and 2 are depicted (across all blocks). Clusters of points indicate a strong correlation between variables. For better visibility we chose to hide the variable names.

plotVar(diablo.tcga, var.names = FALSE, style = 'graphics', legend = TRUE, 
        pch = c(16, 17, 15), cex = c(2,2,2), 
        col = c('darkorchid', 'brown1', 'lightgreen'),
        title = 'TCGA, DIABLO comp 1 - 2')

Correlation circle plot from multiblock sPLS-DA performed on the breast.TCGA study. The variable coordinates are defined according to their correlation with the first and second components for each data set. Variable types are indicated with different symbols and colours, and are overlaid on the same plot. The plot highlights the potential associations within and between different variable types when they are important in defining their own component.

The correlation circle plot shows some positive correlations (between selected miRNA and proteins, between selected proteins and mRNA) and negative correlations between mRNAand miRNA on component 1. The correlation structure is less obvious on component 2, but we observe some key selected features (proteins and miRNA) that seem to highly contribute to component 2.

Note:

These results can be further investigated by showing the variable names on this plot (or extracting their coordinates available from the plot saved into an object, see ?plotVar), and looking at various outputs from selectVar() and plotLoadings().
You can choose to only show specific variable type names, e.g. var.names = c(FALSE, FALSE, TRUE) (where each argument is assigned to a data set in $\boldsymbol X$). Here for example, the protein names only would be output.

`circosPlot`

The circos plot represents the correlations between variables of different types, represented on the side quadrants. Several display options are possible, to show within and between connections between blocks, and expression levels of each variable according to each class (argument line = TRUE). The circos plot is built based on a similarity matrix. A cutoff argument can be further included to visualise correlation coefficients above this threshold in the multi-omics signature. The colours for the blocks and correlation lines can be chosen with color.blocks and color.cor respectively:

circosPlot(diablo.tcga, cutoff = 0.7, line = TRUE, 
           color.blocks = c('darkorchid', 'brown1', 'lightgreen'),
           color.cor = c("chocolate3","grey20"), size.labels = 1.5)

Circos plot from multiblock sPLS-DA performed on the breast.TCGA study. The plot represents the correlations greater than 0.7 between variables of different types, represented on the side quadrants. The internal connecting lines show the positive (negative) correlations. The outer lines show the expression levels of each variable in each sample group (Basal, Her2 and LumA).

The circos plot enables us to visualise cross-correlations between data types, and the nature of these correlations (positive or negative). Here we observe that correlations > 0.7 are between a few mRNAand some Proteins, whereas the majority of strong (negative) correlations are observed between miRNA and mRNAor Proteins. The lines indicating the average expression levels per breast cancer subtype indicate that the selected features are able to discriminate the sample groups.

`network`

Relevance networks, which are also built on the similarity matrix, can also visualise the correlations between the different types of variables. Each colour represents a type of variable. A threshold can also be set using the argument cutoff. By default the network includes only variables selected on component 1, unless specified in comp.

network(diablo.tcga, blocks = c(1,2,3), 
        cutoff = 0.4,
        color.node = c('darkorchid', 'brown1', 'lightgreen')
        )

Relevance network for the variables selected by multiblock sPLS-DA performed on the breast.TCGA study on component 1. Each node represents a selected variable with colours indicating their type. The colour of the edges represent positive or negative correlations. Further tweaking of this plot can be obtained, see the help file ?network.

The relevance network shows two groups of features of different types. Within each group we observe positive and negative correlations. The visualisation of this plot could be further improved by changing the names of the original features.

`plotLoadings`

plotLoadings() visualises the loading weights of each selected variable on each component and each data set. The colour indicates the class in which the variable has the maximum level of expression (contrib = 'max') or minimum (contrib = 'min'), on average (method = 'mean') or using the median (method = 'median').

plotLoadings(diablo.tcga, comp = 1, contrib = 'max', method = 'median')

Loading plot for the variables selected by multiblock sPLS-DA performed on the breast.TCGA study on component 1. The most important variables (according to the absolute value of their coefficients) are ordered from bottom to top. As this is a supervised analysis, colours indicate the class for which the median expression value is the highest for each feature (variables selected characterise Basal and LumA).

The loading plot shows the multi-omics signature selected on component 1, where each panel represents one data type. The importance of each variable is visualised by the length of the bar (i.e. its loading coefficient value). The combination of the sign of the coefficient (positive / negative) and the colours indicate that component 1 discriminates primarily the Basal samples vs. the LumA samples (see the sample plots also). The features selected are highly expressed in one of these two subtypes. One could also plot the second component that discriminates the Her2 samples.

`cimDiablo`

The cimDiablo() function is a clustered image map specifically implemented to represent the multi-omics molecular signature expression for each sample. It is very similar to a classical hierarchical clustering.

cimDiablo(diablo.tcga, color.blocks = c('darkorchid', 'brown1', 'lightgreen'),
          comp = 1, margin=c(8,20), legend.position = "right")

## 
## trimming values to [-3, 3] range for cim visualisation. See 'trim' arg in ?cimDiablo

Clustered Image Map for the variables selected by multiblock sPLS-DA performed on the breast.TCGA study on component 1. By default, Euclidean distance and Complete linkage methods are used. The CIM represents samples in rows (indicated by their breast cancer subtype on the left hand side of the plot) and selected features in columns (indicated by their data type at the top of the plot).

According to the CIM, component 1 seems to primarily classify the Basal samples, with a group of overexpressed miRNA and underexpressed mRNAand proteins. A group of LumA samples can also be identified due to the overexpression of the same mRNAand proteins. Her2 samples remain quite mixed with the other LumA samples.

Model performance and prediction

We assess the performance of the model using 10-fold cross-validation repeated 10 times with the function perf(). The method runs a block.splsda() model on the pre-specified arguments input from our final object diablo.tcga but on cross-validated samples. We then assess the accuracy of the prediction on the left out samples. Since the tune() function was used with the centroid.dist argument, we examine the outputs of the perf() function for that same distance:

perf.diablo.tcga <- perf(diablo.tcga,  validation = 'Mfold', folds = 10, 
                         nrepeat = 10, dist = 'centroids.dist', seed = 123)

We can extract the (balanced) classification error rates globally or overall with perf.diablo.tcga$error.rate.per.class, the predicted components associated to $\boldsymbol Y$, or the stability of the selected features with perf.diablo.tcga$features.

Here we look at the different performance assessment schemes specific to multiple data set integration.

First, we output the performance with the majority vote, that is, since the prediction is based on the components associated to their own data set, we can then weight those predictions across data sets according to a majority vote scheme. Based on the predicted classes, we then extract the classification error rate per class and per component:

# Performance with Majority vote
perf.diablo.tcga$MajorityVote.error.rate

## $centroids.dist
##                  comp1       comp2
## Basal       0.02666667 0.044444444
## Her2        0.23666667 0.140000000
## LumA        0.04933333 0.009333333
## Overall.ER  0.08000000 0.046000000
## Overall.BER 0.10422222 0.064592593

The output shows that the classification improves overall with the addition of the second component.

Another prediction scheme is to weight the classification error rate from each data set according to the correlation between the predicted components and the $\boldsymbol Y$ outcome.

# Performance with Weighted vote
perf.diablo.tcga$WeightedVote.error.rate

## $centroids.dist
##                   comp1       comp2
## Basal       0.006666667 0.044444444
## Her2        0.173333333 0.123333333
## LumA        0.049333333 0.009333333
## Overall.ER  0.061333333 0.042666667
## Overall.BER 0.076444444 0.059037037

Compared to the previous majority vote output, we can see that the classification accuracy is slightly better on component 2 for the subtype Her2.

An AUC plot per block is plotted using the function auroc(). We have already mentioned in Module 3 for PLS-DA, the interpretation of this output may not be particularly insightful in relation to the performance evaluation of our methods, but can complement the statistical analysis. For example, here for the miRNA data set once we have reached component 2:

auc.diablo.tcga <- auroc(diablo.tcga, roc.block = "miRNA", roc.comp = 2,
                         print = FALSE)

ROC and AUC based on multiblock sPLS-DA performed on the breast.TCGA study for the miRNA data set after 2 components. The function calculates the ROC curve and AUC for one class vs. the others. If we set print = TRUE, the Wilcoxon test p-value that assesses the differences between the predicted components from one class vs. the others is output.

The figure shows that the Her2 subtype is the most difficult to classify with multiblock sPLS-DA compared to the other subtypes.

The predict() function associated with a block.splsda() object predicts the class of samples from an external test set. In our specific case, one data set is missing in the test set but the method can still be applied. We need to ensure the names of the blocks correspond exactly to those from the training set:

# Prepare test set data: here one block (proteins) is missing
data.test.tcga <- list(mRNA = breast.TCGA$data.test$mrna, 
                      miRNA = breast.TCGA$data.test$mirna)

predict.diablo.tcga <- predict(diablo.tcga, newdata = data.test.tcga)
# The warning message will inform us that one block is missing

The following output is a confusion matrix that compares the real subtypes with the predicted subtypes for a 2 component model, for the distance of interest centroids.dist and the prediction scheme WeightedVote:

confusion.mat.tcga <- get.confusion_matrix(truth = breast.TCGA$data.test$subtype, 
                                           predicted = predict.diablo.tcga$WeightedVote$centroids.dist[,2])
confusion.mat.tcga

##       predicted.as.Basal predicted.as.Her2 predicted.as.LumA
## Basal                 20                 1                 0
## Her2                   0                13                 1
## LumA                   0                 3                32

From this table, we see that one Basal and one Her2 sample are wrongly predicted as Her2 and LumA respectively, and 3 LumA samples are wrongly predicted as Her2. The balanced prediction error rate can be obtained as:

get.BER(confusion.mat.tcga)

## [1] 0.06825397

It would be worthwhile at this stage to revisit the chosen design of the multiblock sPLS-DA model to assess the influence of the design on the prediction performance on this test set - even though this back and forth analysis is a biased criterion to choose the design!

Demo 5: Multiomics N-Integration with DIABLO

mixOmics Team (modified and updated by Quentin Read)

2025-06-04

N-Integration

Block sPLS-DA on the TCGA case study

Load the data

Parameter choice

Design matrix

Number of components

Number of variables to select

Final model

Sample plots

`plotDiablo`

`plotIndiv`

`plotArrow`

Variable plots

`plotVar`

`circosPlot`

`network`

`plotLoadings`

`cimDiablo`

Model performance and prediction