A smörgåsbord of options for multiomics data analysis

Quentin D. Read

Who is this talk for?

  • People who have multiomics data but are not experienced in analyzing it
  • I am not an expert in this topic so I am talking more conceptually
  • I will provide a “smörgåsbord” of options for you to consider

Talk outline

  • Resources to learn more about multiomics analysis
  • Basic conceptual overview of multiomics data integration methods
  • Example analyses in R
    • Demo 1 MOFA analysis including data preprocessing and cleaning steps
    • Demo 2 More advanced MOFA analysis on pre-processed data
    • Demo 3 sPLS-DA analysis

Resources for further learning about multiomics

ASA Short Course

mixOmics courses

Swedish National Bioinformatics Infrastructure courses

Multiomics Integration: overview of the concepts

What do multiomics data look like?

  • We have more than one omics dataset
    • Genome
    • Methylome
    • Metabolome
    • Transcriptome
    • Microbiome
  • Either organism-wide or single-cell
  • We have metadata about the organisms or cells
    • Experimental treatment
    • Individual-level variables (age, sex, body weight, etc.)
    • Individual-level outcomes (disease status, mortality, etc.)
    • Cell-level variables (tissue type, etc.)

What research questions can we answer?

  • What are the relationships between the different omics datasets?
  • How well do the omics datasets, individually or combined, predict outcomes?

Supervised or unsupervised?

  • Supervised we have some kind of outcome variable (often disease status) and we are trying to figure out which combinations of features in the different omics datasets predict it
  • Unsupervised we don’t have an outcome, we are just exploring relationships between the datasets

Descriptive or predictive? (within supervised methods)

  • Descriptive we want to find weights for the variables to optimally separate the classes
    • many different criteria can be used to decide what constitutes “optimal”
  • Predictive we want to predict the class of new samples if we know its variables
    • construct a rule or classifier
    • diagnose predictive performance (sensitivity & specificity; AUC)

What method we use depends on the size of the ’ome

  • Metabolome kinda small (~10s-100s of features)
  • Microbiome, transcriptome kinda big (~1000s of features)
  • Genome, methylome huuuge (>>1000s of features)

p versus n

  • p: number of predictors/features
    • very variable
  • n: number of samples
    • in clinical/biomedical research often ~10-100 patients or animals

p versus n

  • \(p < n\) Bayesian methods
  • \(p \approx n\) Frequentist methods
  • \(p >> n\) Deep learning methods

Platforms

Data cleaning and preprocessing

  • Remove features with all 0s or with little or no variance (they carry no information)
  • Remove samples or individuals that have too high a proportion of missing values
  • For individuals with a moderate number of missing values, there are many imputation methods

Exploratory data analysis

  • Simple methods such as PCA should be used as a first step, basically as a visualization
  • Understand structure of the data
  • Pick out any biases or errors in the data

Multiomics Integration: a sampling of techniques

Methods we will demo today

  • MOFA (multiomics factor analysis; unsupervised)
  • sPLS-DA (sparse partial least squares discriminant analysis; supervised)

Other techniques we won’t demo today

  • Machine learning-based dimension reduction/visualization methods
  • Deep learning/neural networks
  • Network analysis
  • There are more …

ML dimension reduction & visualization methods

  • PCA on steroids, uses machine learning style approach
  • Can be used to create a consensus mapping that integrates multiple omics datasets

Deep learning methods

from Introduction to artificial intelligence for cardiovascular clinicians, Chong & Limon
  • Much superior performance but only on very big datasets
  • Essentially dimension reduction followed by clustering
  • Very good at picking out nonlinear patterns
  • Can be paired with ML visualization method

Network analysis methods

Price et al. 2017, Nature Biotechnology
  • Generate network graph (k-nearest neighbors or other methods)
  • Similarity network fusion to combine networks from different omics into a single fused network
  • Use network to identify important features and their relationships

Multiomics factor analysis (MOFA)

See Arguelaget et al. 2018, Molecular Systems Biology

  • Hypothesis-free data exploration framework
  • Extracts common axes of variation from multiple omics layers
  • Infers low-dimensional data representation in terms of hidden factors
  • Can impute missing values in the process
  • Visualize low-dimensional factors and interpret them biologically

Partial least squares discriminant analysis (PLS-DA)

See Le Cao et al. 2011, BMC Bioinformatics

  • Supervised classification method
  • Choose number of groups \(k\)
  • Maximize variability between groups, minimize variability within groups

Sparse PLS-DA (sPLS-DA)

  • Combination of sparse PCA and PLS-DA
  • First do variable selection to reduce some variable coefficients to 0 for each component (LASSO)
  • Choose parameters with cross-validation
  • Train model
  • Predict on independent test data or use CV to assess performance

Let’s analyze some multiomics data!