Analyzing ordered categorical phenotypes: challenges and pitfalls

Quentin D. Read

Who is this talk for?

  • Anyone who works with categorical phenotypes
  • Hopefully you have a little experience with statistics
  • It’s OK if you do more “applied” stats and don’t have a solid background in statistical theory … I don’t either!

Applied statistics

Talk outline

  • What is categorical phenotype data, and what are its advantages and disadvantages?
  • How should we analyze categorical data?
  • Practical recommendations

Image (C) Maria Gusarova

What is categorical phenotype data?

  • It may be ordered or unordered
  • unordered: color, etc. (not the focus of this talk)
  • ordered: low/medium/high, disease score from 1-5, etc.

Image (C) Allison Horst

Why do we collect categorical phenotype data?

  • If the “true” quantitative variable is very time-consuming to measure, many individuals can be phenotyped more quickly if you use categorical data
    • Example: counting the number of hairs on a leaf vs. visually assigning low/medium/high)
  • Or there is no easily determined single quantitative variable, so we have no choice

Image (C) Alina Majcen

Categorical data loses information

  • Even if the visual scoring is perfect, information is lost due to rounding
true count category

1

≤ 10

3

≤ 10

12

11-20

19

11-20

24

21-50

55

51-100

369

> 100

Categorical data loses information, continued

  • Visual scoring is never perfect
  • There is always some additional error from individuals being assigned to the ‘wrong’ category
  • There may be differences among raters and rating methods in both accuracy and systematic bias
  • These two different sources of error are rarely explicitly accounted for
true count category, estimated without error category, estimated with error

1

≤ 10

≤ 10

3

≤ 10

≤ 10

12

11-20

≤ 10

19

11-20

11-20

24

21-50

21-50

55

51-100

21-50

369

> 100

> 100

How do we analyze categorical phenotype data?

Option 1: Treat categorical variable as continuous

  • People often treat the categorical score as a continuous numerical value from 1 ... n and fit a simple linear model
  • Or if the categories are binned values, people take the midpoints of the bins and treat them as continuous variables
bin midpoint
1 1.0
2-5 3.5
6-10 8.0
11-20 15.5
21-50 35.5

Option 1: Treat categorical variable as continuous

  • For example disease is scored at 1-5, where 1=no disease and 5=dead of the disease

“Genotype A has a mean disease score of 1.67 (95% CI [1.35, 2.02]) and Genotype B has a mean of 3.55 (95% CI [3.15, 3.96]).”

Option 1: Treat categorical variable as continuous

  • I call this the “quick and dirty way”

Quick and dirty way: the downsides

  • The estimates you get when you assume a categorical variable is continuous seem precise and quantitative
  • This masks the fact that the numbers are basically arbitrary and the interpretation may not make sense
  • Does it make sense to say that an individual with disease score 5 has “5 times as much disease” as an individual with disease score 1?
  • And what if we had decided to make a disease score classification from 1-9 instead?
  • Now a genotype with mean disease score 9 has “9 times as much disease” as one with mean 1. But it’s the same observations!

Quick and dirty way: the downsides

  • It also forces us to assume that the difference between 1->2 is exactly the same as going from 2->3, 3->4, …
  • This is an especially bad assumption for things like disease scores where 1=no disease and 2 and above=disease
  • 1->2 could be vastly more important than 2->3, 3->4, …, which are just different levels of disease

Quick and dirty way: the upside

  • Continuous model is easy to fit and often gives acceptable results, even though it violates a lot of assumptions
  • Usually fine to use in preliminary screening phase
  • A simulation study showed that it usually has similar predictions (Azevedo et al. 2024, Theor. Appl. Gen.)
  • Acceptable as long as the categorical variable is estimating a true underlying continuous variable (disease intensity)
  • Lumping multiple things together may give strange results (presence/absence of a disease and disease intensity)

Option 2: Account for the ordered categorical nature of the data

Image (C) Getty Images

Binomial distribution

Image (C) ICMA Photos
  • Many categorical phenotype datasets are binary
  • Binomial distribution has one parameter, \(\theta\), the probability of a “success”
  • Define one outcome as a success (1) and the other as a failure (0)
  • Example: the outcomes of flipping a coin many times are distributed binomially with \(\theta = 0.5\)

How to analyze binary categorical data?

  • A linear model requires the error, or residuals, to be normally distributed
  • Residuals can’t be normally distributed if the data are all 0s and 1s
  • Normal distribution (bell curve) is symmetrical and can take any \(x\) value

How to analyze binary categorical data?

  • We model the probability of a success on the logit (log-odds) scale because log-odds can take any value from \(-\infty\) to \(+\infty\)
  • Straight between ~0.25 and ~0.75, steeper closer to 0 and 1
  • \(\text{logit} y = \mu + G_{ij} + E_{ij} + GE_{ij} + \epsilon\)
  • Probit can be used instead of logit but the outcome is similar

Multinomial distribution

Image (C) Diacritica
  • Now instead of flipping a coin, we’re rolling a die
  • More than two outcomes, each with a probability of occurring
  • For fair six-sided dice the probabilities are all 1/6
  • But they don’t have to be equal, they just have to sum to 1
  • There are \(n-1\) parameters, where \(n\) is the number of outcomes (classes)

Ordered multinomial distribution

  • Response distribution for an ordered multinomial is a cumulative probability distribution
    • If we have \(n\) classes, there are \(n-1\) probabilities
  • Example: For 4 classes, the probabilities are \(P(1 \text{ or less}) \leq P(2 \text{ or less}) \leq P(3 \text{ or less})\)
    • \(P(4 \text{ or less}) = 1\) so we don’t need a parameter for that
  • The threshold probabilities can be modeled as the same for each transition between classes, or different

Ordered multinomial distribution, continued

  • A “cumulative logit” or “cumulative probit” link function is used to convert from cumulative probability scale to a scale we can analyze with a linear model
  • Extends the two-category binary model to any number of ordered categories
  • It is possible to include both fixed and random effects in these models
    • CLM (cumulative logistic model)
    • CLMM (cumulative logistic mixed model)

Cumulative logistic model: the downsides

  • It is a “data hungry” model because you need to have some representation of all the categories in all the treatments
    • 2.25× greater training population size needed compared to continuous phenotypes (Kizilkaya et al. 2014, Gen. Sel. Evol.)
  • This is especially true for mixed models because you also need representation of all the categories in all the environments (blocks)
  • Convergence of the model fitting algorithm is often a problem

Quasi-complete separation

  • If you have zeros for some categories, you end up with complete or quasi-complete separation
  • Some genotypes’ coefficients in the model can’t be estimated because those genotypes have no observations for one or more categories
  • This becomes more and more likely, the more categorical levels you have

Figure from Schwendinger et al. 2021, Comp. Stat.

Making predictions in cumulative logistic model

  • We can make several different kinds of predictions from a CLM
  • Probabilities, or cumulative probabilities for each class within each genotype
  • More honest but not as simple to interpret

Making predictions in cumulative logistic model

  • Mean class predictions (weighted average by the probability of each class for each genotype)
  • Similar-looking result as the quick and dirty way but better because the underlying model doesn’t assume it’s a continuous variable
  • Hypothesis tests can be done on either kind of comparison

Bayesian CLM

  • You can incorporate prior distributions on the fixed and random effect parameters, and the threshold probabilities
  • This can allow you to get estimates, even when there is complete separation and a classical statistical approach can’t handle it/won’t converge

Image (C) Analytics Vidhya

Machine learning approaches

  • Ordered categorical outcomes are supported by a lot of machine learning models such as random forest
  • Regularization in machine learning models can work the same way as Bayesian priors, and fix the quasi-complete separation issue

Image (C) MathWorks

Software implementations

  • R packages
    • ordinal (includes mixed models)
    • polr (does not support mixed models)
  • SAS procedures
    • proc glimmix
    • proc logistic
  • GWAS with ordinal traits are widely available
    • BGLR (R package)
    • ASReml
    • GenSel
    • OrdinalGWAS.jl (Julia)

Practical recommendations: deciding on number of categories

  • Number of categories should ideally be 3 to 5
    • Binary loses too much information, and greater than 5 is too error-prone and data-hungry
    • However: you can always lump categories together later, but you can’t split them!
  • Category thresholds should be validated to minimize disagreement between raters

Practical recommendations: analyzing the data

  • Treating it as a continuous variable may be a decent assumption in some contexts
  • Especially when it is an approximation of a true value
  • But it depends on your goal: prediction or inference?
    • Prediction performance is still good with the continuous model but it is not as good at recovering genetic parameters, e.g. heritability, as the ordinal regression model (Azevedo et al. 2024)

Practical recommendations: analyzing the data

  • If convergence or quasi-complete separation is an issue, collapse many categories into fewer
  • Going down to as few as 2 categories may help you get a working model
  • The best solution (I strongly recommend!) is move to a Bayesian model or ML model with regularization

Approach with caution

  • “All models are wrong, but some are useful”
  • No matter what, some kind of assumption will have to be made
    • Treating categorical variable like a number
    • Lumping together categories
    • Including prior distributions on the parameters or regularization

Further reading