A smörgåsbord of options for multiomics data analysis

Quentin D. Read

Who is this talk for?

People who have multiomics data but are not experienced in analyzing it
I am not an expert in this topic so I am talking more conceptually
I will provide a “smörgåsbord” of options for you to consider

Session outline

Resources to learn more about multiomics analysis
Basic conceptual overview of multiomics data integration methods
Example analyses in R

Demos

Demo 1 MOFA analysis including data preprocessing and cleaning steps
Demo 2 More advanced MOFA analysis on pre-processed data
Demo 3 sPLS-DA analysis with mixOmics
Demo 4 Differential expression analysis with DESeq2
Demo 5 DIABLO omics integration with mixOmics

Resources for further learning about multiomics

ASA Short Course

Short course by American Statistical Association section on statistics in genomics and genetics

mixOmics courses

Swedish National Bioinformatics Infrastructure courses

https://nbis.se/training
- not open to USA participants but they have great course notes
Example: Class notes from September 2021 including all R and Python notebooks
The demo analyses from today are taken directly from this course

Multiomics Integration: overview of the concepts

What do multiomics data look like?

We have more than one omics dataset
- Genome
- Methylome
- Metabolome
- Transcriptome
- Microbiome

Either organism-wide or single-cell

We have metadata about the organisms or cells
- Experimental treatment
- Individual-level variables (age, sex, body weight, etc.)
- Individual-level outcomes (disease status, mortality, etc.)
- Cell-level variables (tissue type, etc.)

What research questions can we answer?

What are the relationships between the different omics datasets?
How well do the omics datasets, individually or combined, predict outcomes?

Supervised or unsupervised?

Supervised we have some kind of outcome variable (often disease status) and we are trying to figure out which combinations of features in the different omics datasets predict it

Unsupervised we don’t have an outcome, we are just exploring relationships between the datasets

Descriptive or predictive? (within supervised methods)

Descriptive we want to find weights for the variables to optimally separate the classes
- many different criteria can be used to decide what constitutes “optimal”

Predictive we want to predict the class of new samples if we know its variables
- construct a rule or classifier
- diagnose predictive performance (sensitivity & specificity; AUC)

What method we use depends on the size of the ’ome

Metabolome kinda small (~10s-100s of features)
Microbiome, transcriptome kinda big (~1000s of features)
Genome, methylome huuuge (>>1000s of features)

p versus n

p: number of predictors/features
- very variable

n: number of samples
- in clinical/biomedical research often ~10-100 patients or animals

p versus n

\(p < n\) Bayesian methods
\(p \approx n\) Frequentist methods
\(p >> n\) Deep learning methods

Platforms

R (Bioconductor)
Python
MetaboAnalyst

Data cleaning and preprocessing

Remove features with all 0s or with little or no variance (they carry no information)
Remove samples or individuals that have too high a proportion of missing values
For individuals with a moderate number of missing values, there are many imputation methods

Exploratory data analysis

Simple methods such as PCA should be used as a first step, basically as a visualization
Understand structure of the data
Pick out any biases or errors in the data

Multiomics Integration: a sampling of techniques

Methods we will demo today

MOFA (multiomics factor analysis; unsupervised)
sPLS-DA (sparse partial least squares discriminant analysis; supervised)
DIABLO (sPLS-DA for multiple omics datasets)
Differential expression analysis

Other techniques we won’t demo today

Machine learning-based dimension reduction/visualization methods
Deep learning/neural networks
Network analysis
There are more …

ML dimension reduction & visualization methods

tSNE, UMAP, and more
- See post on Towards Data Science

PCA on steroids, uses machine learning style approach

Can be used to create a consensus mapping that integrates multiple omics datasets

Deep learning methods

from Introduction to artificial intelligence for cardiovascular clinicians, Chong & Limon

Much superior performance but only on very big datasets
Essentially dimension reduction followed by clustering
Very good at picking out nonlinear patterns
Can be paired with ML visualization method

Network analysis methods

Price et al. 2017, Nature Biotechnology

Generate network graph (k-nearest neighbors or other methods)
Similarity network fusion to combine networks from different omics into a single fused network
Use network to identify important features and their relationships

Multiomics factor analysis (MOFA)

See Arguelaget et al. 2018, Molecular Systems Biology

Hypothesis-free data exploration framework
Extracts common axes of variation from multiple omics layers
Infers low-dimensional data representation in terms of hidden factors
Can impute missing values in the process
Visualize low-dimensional factors and interpret them biologically

Partial least squares discriminant analysis (PLS-DA)

See Le Cao et al. 2011, BMC Bioinformatics

Supervised classification method
Choose number of groups \(k\)
Maximize variability between groups, minimize variability within groups

Sparse PLS-DA (sPLS-DA)

Combination of sparse PCA and PLS-DA
First do variable selection to reduce some variable coefficients to 0 for each component (LASSO)
Choose parameters with cross-validation
Train model
Predict on independent test data or use CV to assess performance

DIABLO

Data Integration Analysis for Biomarker discovery using Latent variable approaches for Omics studies
sPLS-DA analysis that integrates data from multiple omics sources

Differential expression analysis

Used to determine which genes are upregulated (more RNA transcribed) and which downregulated (less RNA)
Separate statistical comparison for each gene, corrected for multiple comparisons
Implemented in R package DESeq2

Let’s analyze some multiomics data!