A smörgåsbord of options for multiomics data analysis
Who is this talk for?
- People who have multiomics data but are not experienced in analyzing it
- I am not an expert in this topic so I am talking more conceptually
- I will provide a “smörgåsbord” of options for you to consider
Talk outline
- Resources to learn more about multiomics analysis
- Basic conceptual overview of multiomics data integration methods
- Example analyses in R
- Demo 1 MOFA analysis including data preprocessing and cleaning steps
- Demo 2 More advanced MOFA analysis on pre-processed data
- Demo 3 sPLS-DA analysis
Resources for further learning about multiomics
ASA Short Course
- Short course by American Statistical Association section on statistics in genomics and genetics
mixOmics courses
Multiomics Integration: overview of the concepts
What do multiomics data look like?
- We have more than one omics dataset
- Genome
- Methylome
- Metabolome
- Transcriptome
- Microbiome
- Either organism-wide or single-cell
- We have metadata about the organisms or cells
- Experimental treatment
- Individual-level variables (age, sex, body weight, etc.)
- Individual-level outcomes (disease status, mortality, etc.)
- Cell-level variables (tissue type, etc.)
What research questions can we answer?
- What are the relationships between the different omics datasets?
- How well do the omics datasets, individually or combined, predict outcomes?
Supervised or unsupervised?
- Supervised we have some kind of outcome variable (often disease status) and we are trying to figure out which combinations of features in the different omics datasets predict it
- Unsupervised we don’t have an outcome, we are just exploring relationships between the datasets
Descriptive or predictive? (within supervised methods)
- Descriptive we want to find weights for the variables to optimally separate the classes
- many different criteria can be used to decide what constitutes “optimal”
- Predictive we want to predict the class of new samples if we know its variables
- construct a rule or classifier
- diagnose predictive performance (sensitivity & specificity; AUC)
What method we use depends on the size of the ’ome
- Metabolome kinda small (~10s-100s of features)
- Microbiome, transcriptome kinda big (~1000s of features)
- Genome, methylome huuuge (>>1000s of features)
p versus n
- p: number of predictors/features
- n: number of samples
- in clinical/biomedical research often ~10-100 patients or animals
p versus n
- \(p < n\) Bayesian methods
- \(p \approx n\) Frequentist methods
- \(p >> n\) Deep learning methods
Data cleaning and preprocessing
- Remove features with all 0s or with little or no variance (they carry no information)
- Remove samples or individuals that have too high a proportion of missing values
- For individuals with a moderate number of missing values, there are many imputation methods
Exploratory data analysis
- Simple methods such as PCA should be used as a first step, basically as a visualization
- Understand structure of the data
- Pick out any biases or errors in the data
Multiomics Integration: a sampling of techniques
Methods we will demo today
- MOFA (multiomics factor analysis; unsupervised)
- sPLS-DA (sparse partial least squares discriminant analysis; supervised)
Other techniques we won’t demo today
- Machine learning-based dimension reduction/visualization methods
- Deep learning/neural networks
- Network analysis
- There are more …
ML dimension reduction & visualization methods
- PCA on steroids, uses machine learning style approach
- Can be used to create a consensus mapping that integrates multiple omics datasets
Deep learning methods
- Much superior performance but only on very big datasets
- Essentially dimension reduction followed by clustering
- Very good at picking out nonlinear patterns
- Can be paired with ML visualization method
Network analysis methods
- Generate network graph (k-nearest neighbors or other methods)
- Similarity network fusion to combine networks from different omics into a single fused network
- Use network to identify important features and their relationships
Multiomics factor analysis (MOFA)
See Arguelaget et al. 2018, Molecular Systems Biology
- Hypothesis-free data exploration framework
- Extracts common axes of variation from multiple omics layers
- Infers low-dimensional data representation in terms of hidden factors
- Can impute missing values in the process
- Visualize low-dimensional factors and interpret them biologically
Partial least squares discriminant analysis (PLS-DA)
See Le Cao et al. 2011, BMC Bioinformatics
- Supervised classification method
- Choose number of groups \(k\)
- Maximize variability between groups, minimize variability within groups
Sparse PLS-DA (sPLS-DA)
- Combination of sparse PCA and PLS-DA
- First do variable selection to reduce some variable coefficients to 0 for each component (LASSO)
- Choose parameters with cross-validation
- Predict on independent test data or use CV to assess performance
Let’s analyze some multiomics data!