Introduction

The data set consists of N=200 samples from a cohort of Chronic Lymphocytic Leukemia (CLL) patients where four omics were profiled from blood samples: DNA methylation (450K Illumina microarray), bulk RNA-seq, somatic mutations and drug response data from blood for N=200 patients with. The data set was introduced in detail in this article and can be downloaded here. The MOFA analysis was originally published here

Load libraries

Make sure that you have installed the MOFA2 and the MOFAdata package.

library(data.table)
library(ggplot2)
library(tidyverse)
library(MOFA2)
library(MOFAdata)
library(randomForest)
library(survival)
library(survminer)

Load data

Data is stored as a list of matrices. Features are stored in the rows and samples in the columns.

utils::data("CLL_data")       
lapply(CLL_data,dim)
## $Drugs
## [1] 310 200
## 
## $Methylation
## [1] 4248  200
## 
## $mRNA
## [1] 5000  200
## 
## $Mutations
## [1]  69 200

mRNA expression

The mRNA expression has been normalised by library size, followed by a variance stabilizing transformation using DESeq2:

hist(CLL_data$mRNA)

DNA methylation

DNA methylation is calculated for every CpG site using the M-value, which provides a better summary statistic for downstream analysis. For the MOFA analysis we selected the top 1% (N=4248) most variable sites.

hist(CLL_data$Methylation)

Drug response

In this study the authors have measured the effect of multiple drugs ex vivo using a high-throughput platform. For each drug they have measured 5 different concentrations. The value reported is the viability score (0=all cells died, 1=no cells died).

hist(CLL_data$Drugs)

Somatic mutations

Mutations are assessed using a panel of common cancer mutations and are summarised in a binary format (0=no mutation, 1=mutation):

table(CLL_data$Mutations)
## 
##    0    1 
## 8474  667

Sample metadata

Load sample metadata as a data.frame. Important columns are:

  • Gender: m (male), f (female)
  • Age: age in years
  • TTT: time (in years) which passed from taking the sample to the next treatment
  • TTD: time (in years) which passed from taking the sample to patients’ death
  • treatedAfter: (TRUE/FALSE)
  • Died: whether the patient died (TRUE/FALSE)
CLL_metadata <- fread("data/sample_metadata.txt")

head(CLL_metadata)
##    sample Gender      age        TTT      TTD treatedAfter  died IGHV trisomy12
## 1:   H005      m 75.26575 0.57494867 2.625599         TRUE FALSE    1         0
## 2:   H006      m       NA         NA       NA           NA    NA   NA        NA
## 3:   H007      f       NA         NA       NA           NA    NA   NA        NA
## 4:   H008      m       NA         NA       NA           NA    NA   NA        NA
## 5:   H010      f 72.78082 2.93223819 2.932238        FALSE FALSE    0         0
## 6:   H011      f 72.99452 0.01916496 2.951403         TRUE FALSE    1         0

Create the MOFA object and train the model

Create the MOFA object.

MOFAobject <- create_mofa(CLL_data)
MOFAobject
## Untrained MOFA model with the following characteristics: 
##  Number of views: 4 
##  Views names: Drugs Methylation mRNA Mutations 
##  Number of features (per view): 310 4248 5000 69 
##  Number of groups: 1 
##  Groups names: group1 
##  Number of samples (per group): 200 
## 

Plot data overview

Visualise the number of views (rows) and the number of groups (columns), what are their corresponding dimensionalities, and how many missing samples they have (grey bars).

plot_data_overview(MOFAobject)

Define MOFA options

Model options

Two important options:

  • num_factors: number of factors
  • likelihoods: likelihood for each view (options are “gaussian”, “poisson”, “bernoulli”). By default the “gaussian” distribution is used. If data are binary, as is the case for Somatic mutations, change the likelihood to “bernoulli”:
model_opts <- get_default_model_options(MOFAobject)

model_opts$likelihoods["Mutations"] <- "bernoulli"

model_opts$num_factors <- 15

model_opts
## $likelihoods
##       Drugs Methylation        mRNA   Mutations 
##  "gaussian"  "gaussian"  "gaussian" "bernoulli" 
## 
## $num_factors
## [1] 15
## 
## $spikeslab_factors
## [1] FALSE
## 
## $spikeslab_weights
## [1] FALSE
## 
## $ard_factors
## [1] FALSE
## 
## $ard_weights
## [1] TRUE

Train the MOFA model

Prepare the MOFA object.

MOFAobject <- prepare_mofa(MOFAobject,
  model_options = model_opts
)

Train the model: this should take ~5min, but DO NOT RUN THIS. The software has evolved since the original publication and the results will not be 100% identical to the original publication. Please load the pretrained model below for reproducibility purposes.

MOFAobject <- run_mofa(MOFAobject)

Load precomputed model.

MOFAobject <- readRDS("data/MOFA2_CLL.rds")

Add sample metadata to the model

The sample metadata must be provided as a data.frame and it must contain a column sample with the sample IDs. Make sure that the samples in the metadata match the samples in the model.

stopifnot(CLL_metadata$sample %in% samples_metadata(MOFAobject)$sample)

samples_metadata(MOFAobject) <- CLL_metadata

Rename features

In this step, we load metadata from additional files and use it to replace the uninformative drug and gene IDs with descriptive names.

Keep the model with the original variable names for the gene set enrichment analysis section.

MOFAobject.ensembl <- MOFAobject
updated_features_names <- features_names(MOFAobject)

# Rename drug IDs (i.e. D_001) to drug names (i.e. navitoclax)
drug_metadata <- fread("data/drugs.txt.gz")
tmp <- drug_metadata$name; names(tmp) <- drug_metadata$drug_id
updated_features_names[["Drugs"]] <- stringr::str_replace_all(features_names(MOFAobject)[["Drugs"]], tmp)

# Rename mRNA from ENSEMBLE IDs (i.e. ENSG00000223972) to gene names (i.e. DDX11L1)
gene_metadata <- fread("data/Hsapiens_genes_BioMart.87.txt.gz")
gene_metadata[,symbol:=ifelse(symbol=="",ens_id,symbol)]
tmp <- gene_metadata$symbol; names(tmp) <- gene_metadata$ens_id

# avoid duplicated names with the Mutations view
tmp[tmp%in%features_names(MOFAobject)[["Mutations"]]] <- paste0(tmp[tmp%in%features_names(MOFAobject)[["Mutations"]]],"_mRNA")
updated_features_names[["mRNA"]] <- stringr::str_replace_all(features_names(MOFAobject)[["mRNA"]], tmp)

# Update features names in model
features_names(MOFAobject) <- updated_features_names

Variance decomposition analysis

Variance decomposition by Factor

The most important insight that MOFA generates is the variance decomposition analysis. This plot shows the percentage of variance explained by each factor across each data modality.

plot_variance_explained(MOFAobject, max_r2=10)

What insights from the data can we learn just from inspecting this plot?

  • Factor 1 captures a source of variability that is present across almost all data modalities. Thus, its etiology is likely to be something very important for the disease.
  • Factor 2 captures a strong source of variation that is exclusive to the drug response data.
  • Factor 3 captures a strong source of variation that is exclusive to the mRNA data.
  • Factor 4 and Factor 5 capture some co-variation between the mRNA and the drug response assay.

(Q) Based on the MOFA output, if you were to profile just one molecular layer, which one would you choose to maximise the amount of sources of variation captured?

Characterisation of Factor 1

There are a few systematic strategies to characterise the molecular signal that underlies each MOFA Factor and to relate them to existent sample covariates:

Association analysis

Let’s test the association between the MOFA factors versus Gender and age:

correlate_factors_with_covariates(MOFAobject, 
  covariates = c("Gender","age","died"), 
  plot = "log_pval"
)
## Warning in correlate_factors_with_covariates(MOFAobject, covariates =
## c("Gender", : There are non-numeric values in the covariates data.frame,
## converting to numeric...

Most Factors don’t have a clear association with any of the covariates. However Factor 1 and Factor 14 appear to have some association with survival outcome. We will explore association with clinical measurements later in the vignette.

Inspection of factor values

How do we interpret the factor values?
Each factor captures a different source of variability in the data. Mathematically, each Factor is defined by a linear combination of the input features. Each Factor ordinates cells along a one-dimensional axis that is centered at zero. Samples with different signs manifest opposite phenotypes along the inferred axis of variation, with higher absolute value indicating a stronger effect.
Note that the interpretation of MOFA factors is analogous to the interpretation of the principal components in PCA.

plot_factors(MOFAobject, 
  factors = c(1,3), 
  dot_size = 2.5
)

Inspection of feature weights

How do we interpret the feature weights?
The weights provide a score for each feature on each factor. Features with no association with the corresponding factor are expected to have values close to zero, whereas features with strong association with the factor are expected to have large absolute values. The sign of the weights indicates the direction of the effect: a positive weight indicates that the feature has higher levels in the cells with positive factor values, and vice-versa.

Plot feature weights for somatic mutations

By looking at the variance explained plot, we saw that Factor 1 captures variation in all data modalities. Out of all omics, the somatic mutation data is a good place to start, as somatic mutations are very sparse, easy to interpret and any change in the DNA is likely to have downstream consequences to all other molecular layers. Let’s plot the weights:

plot_weights(MOFAobject,
 view = "Mutations",
 factor = 1,
 nfeatures = 10,     # Top number of features to highlight
 scale = T           # Scale weights from -1 to 1
)

Notice that most features lie at zero, indicating that most features have no association with Factor 1. There is however one gene that clearly stands out: IGHV (immunoglobulin heavy chain variable region). This is the main clinical marker for CLL.

An alternative visualisation to the full distribution of weights is to do a line plot that displays only the top features with the corresponding weight sign on the right:

plot_top_weights(MOFAobject,
 view = "Mutations",
 factor = 1,
 nfeatures = 10,     # Top number of features to highlight
 scale = TRUE        # Scale weights from -1 to 1
)

IGHV has a positve weight. This means that samples with positive Factor 1 values have the IGHV mutation whereas samples with negative Factor 1 values do not have the IGHV mutation. To confirm this, let’s plot the Factor values and colour the IGHV mutation status.

plot_factor(MOFAobject, 
  factors = 1, 
  color_by = "IGHV",
  add_violin = TRUE,
  dodge = TRUE,
  show_missing = FALSE
)

We can also plot Factor values coloured by other covariates, for example Gender. As concluded from the association analysis above, this variable has no association with Factor 1:

plot_factor(MOFAobject, 
  factors = 1, 
  color_by = "Gender",
  dodge = TRUE,
  add_violin = TRUE
)

Plot gene weights for mRNA expression

From the variance explained plot we know that Factor 1 drives variation across all data modalities. Let’s visualise the mRNA expression changes that are associated with Factor 1:

plot_weights(MOFAobject, 
  view = "mRNA", 
  factor = 1, 
  nfeatures = 10
)

Plot molecular signatures in the input data

In this case we have a large amount of genes with high positive and high negative weights. Genes with large positive values will be more expressed in the samples with IGHV mutation, whereas genes with large negative values will be more expressed in the samples without the IGHV mutation. Let’s verify this. The function plot_data_scatter generates a scatterplot of Factor 1 values (x-axis) versus expression values (y-axis) for the top 4 genes with largest positive weight. Samples are coloured by IGHV status:

plot_data_scatter(MOFAobject, 
  view = "mRNA",
  factor = 1,  
  features = 4,
  sign = "positive",
  color_by = "IGHV"
) + labs(y="RNA expression")

This function generates a scatterplot of Factor 1 values (x-axis) versus expression values (y-axis) for the top 4 genes with largest negative weight. Samples are coloured by IGHV status:

plot_data_scatter(MOFAobject, 
  view = "mRNA",
  factor = 1,  
  features = 4,
  sign = "negative",
  color_by = "IGHV"
) + labs(y="RNA expression")