Analyzing ordered categorical phenotypes: challenges and pitfalls

Quentin D. Read

Who is this talk for?

Anyone who works with categorical phenotypes
Hopefully you have a little experience with statistics
It’s OK if you do more “applied” stats and don’t have a solid background in statistical theory … I don’t either!

Talk outline

What is categorical phenotype data, and what are its advantages and disadvantages?
How should we analyze categorical data?
Practical recommendations

Image (C) Maria Gusarova

What is categorical phenotype data?

It may be ordered or unordered
unordered: color, etc. (not the focus of this talk)
ordered: low/medium/high, disease score from 1-5, etc.

Image (C) Allison Horst

Why do we collect categorical phenotype data?

If the “true” quantitative variable is very time-consuming to measure, many individuals can be phenotyped more quickly if you use categorical data
- Example: counting the number of hairs on a leaf vs. visually assigning low/medium/high)

Or there is no easily determined single quantitative variable, so we have no choice

Categorical data loses information

Even if the visual scoring is perfect, information is lost due to rounding

true count	category
1	≤ 10
3	≤ 10
12	11-20
19	11-20
24	21-50
55	51-100
369	> 100

Categorical data loses information, continued

Visual scoring is never perfect
There is always some additional error from individuals being assigned to the ‘wrong’ category
There may be differences among raters and rating methods in both accuracy and systematic bias
These two different sources of error are rarely explicitly accounted for

true count	category, estimated without error	category, estimated with error
1	≤ 10	≤ 10
3	≤ 10	≤ 10
12	11-20	≤ 10
19	11-20	11-20
24	21-50	21-50
55	51-100	21-50
369	> 100	> 100

How do we analyze categorical phenotype data?

Option 1: Treat categorical variable as continuous

People often treat the categorical score as a continuous numerical value from 1 ... n and fit a simple linear model
Or if the categories are binned values, people take the midpoints of the bins and treat them as continuous variables

bin	midpoint
1	1.0
2-5	3.5
6-10	8.0
11-20	15.5
21-50	35.5

Option 1: Treat categorical variable as continuous

For example disease is scored at 1-5, where 1=no disease and 5=dead of the disease

“Genotype A has a mean disease score of 1.67 (95% CI [1.35, 2.02]) and Genotype B has a mean of 3.55 (95% CI [3.15, 3.96]).”

Option 1: Treat categorical variable as continuous

I call this the “quick and dirty way”

Quick and dirty way: the downsides

The estimates you get when you assume a categorical variable is continuous seem precise and quantitative
This masks the fact that the numbers are basically arbitrary and the interpretation may not make sense
Does it make sense to say that an individual with disease score 5 has “5 times as much disease” as an individual with disease score 1?
And what if we had decided to make a disease score classification from 1-9 instead?
Now a genotype with mean disease score 9 has “9 times as much disease” as one with mean 1. But it’s the same observations!

Quick and dirty way: the downsides

It also forces us to assume that the difference between 1->2 is exactly the same as going from 2->3, 3->4, …
This is an especially bad assumption for things like disease scores where 1=no disease and 2 and above=disease
1->2 could be vastly more important than 2->3, 3->4, …, which are just different levels of disease

Quick and dirty way: the upside

Continuous model is easy to fit and often gives acceptable results, even though it violates a lot of assumptions
Usually fine to use in preliminary screening phase
A simulation study showed that it usually has similar predictions (Azevedo et al. 2024, Theor. Appl. Gen.)
Acceptable as long as the categorical variable is estimating a true underlying continuous variable (disease intensity)
Lumping multiple things together may give strange results (presence/absence of a disease and disease intensity)

Option 2: Account for the ordered categorical nature of the data

Image (C) Getty Images

Binomial distribution

Many categorical phenotype datasets are binary
Binomial distribution has one parameter, \(\theta\), the probability of a “success”
Define one outcome as a success (1) and the other as a failure (0)
Example: the outcomes of flipping a coin many times are distributed binomially with \(\theta = 0.5\)

How to analyze binary categorical data?

A linear model requires the error, or residuals, to be normally distributed
Residuals can’t be normally distributed if the data are all 0s and 1s
Normal distribution (bell curve) is symmetrical and can take any \(x\) value

How to analyze binary categorical data?

We model the probability of a success on the logit (log-odds) scale because log-odds can take any value from \(-\infty\) to \(+\infty\)
Straight between ~0.25 and ~0.75, steeper closer to 0 and 1
\(\text{logit} y = \mu + G_{ij} + E_{ij} + GE_{ij} + \epsilon\)
Probit can be used instead of logit but the outcome is similar

Multinomial distribution

Now instead of flipping a coin, we’re rolling a die
More than two outcomes, each with a probability of occurring
For fair six-sided dice the probabilities are all 1/6
But they don’t have to be equal, they just have to sum to 1
There are \(n-1\) parameters, where \(n\) is the number of outcomes (classes)

Ordered multinomial distribution

Response distribution for an ordered multinomial is a cumulative probability distribution
- If we have \(n\) classes, there are \(n-1\) probabilities
Example: For 4 classes, the probabilities are \(P(1 \text{ or less}) \leq P(2 \text{ or less}) \leq P(3 \text{ or less})\)
- \(P(4 \text{ or less}) = 1\) so we don’t need a parameter for that
The threshold probabilities can be modeled as the same for each transition between classes, or different

Ordered multinomial distribution, continued

A “cumulative logit” or “cumulative probit” link function is used to convert from cumulative probability scale to a scale we can analyze with a linear model
Extends the two-category binary model to any number of ordered categories
It is possible to include both fixed and random effects in these models
- CLM (cumulative logistic model)
- CLMM (cumulative logistic mixed model)

Cumulative logistic model: the downsides

It is a “data hungry” model because you need to have some representation of all the categories in all the treatments
- 2.25× greater training population size needed compared to continuous phenotypes (Kizilkaya et al. 2014, Gen. Sel. Evol.)
This is especially true for mixed models because you also need representation of all the categories in all the environments (blocks)
Convergence of the model fitting algorithm is often a problem

Quasi-complete separation

If you have zeros for some categories, you end up with complete or quasi-complete separation
Some genotypes’ coefficients in the model can’t be estimated because those genotypes have no observations for one or more categories
This becomes more and more likely, the more categorical levels you have

Figure from Schwendinger et al. 2021, Comp. Stat.

Making predictions in cumulative logistic model

We can make several different kinds of predictions from a CLM
Probabilities, or cumulative probabilities for each class within each genotype
More honest but not as simple to interpret

Making predictions in cumulative logistic model

Mean class predictions (weighted average by the probability of each class for each genotype)
Similar-looking result as the quick and dirty way but better because the underlying model doesn’t assume it’s a continuous variable
Hypothesis tests can be done on either kind of comparison

Bayesian CLM

You can incorporate prior distributions on the fixed and random effect parameters, and the threshold probabilities
This can allow you to get estimates, even when there is complete separation and a classical statistical approach can’t handle it/won’t converge

Image (C) Analytics Vidhya

Machine learning approaches

Ordered categorical outcomes are supported by a lot of machine learning models such as random forest
Regularization in machine learning models can work the same way as Bayesian priors, and fix the quasi-complete separation issue

Image (C) MathWorks

Software implementations

R packages
- ordinal (includes mixed models)
- polr (does not support mixed models)
SAS procedures
- proc glimmix
- proc logistic
GWAS with ordinal traits are widely available
- BGLR (R package)
- ASReml
- GenSel
- OrdinalGWAS.jl (Julia)

Practical recommendations: deciding on number of categories

Number of categories should ideally be 3 to 5
- Binary loses too much information, and greater than 5 is too error-prone and data-hungry
- However: you can always lump categories together later, but you can’t split them!
Category thresholds should be validated to minimize disagreement between raters

Practical recommendations: analyzing the data

Treating it as a continuous variable may be a decent assumption in some contexts
Especially when it is an approximation of a true value
But it depends on your goal: prediction or inference?
- Prediction performance is still good with the continuous model but it is not as good at recovering genetic parameters, e.g. heritability, as the ordinal regression model (Azevedo et al. 2024)

Practical recommendations: analyzing the data

If convergence or quasi-complete separation is an issue, collapse many categories into fewer
Going down to as few as 2 categories may help you get a working model
The best solution (I strongly recommend!) is move to a Bayesian model or ML model with regularization

Approach with caution

“All models are wrong, but some are useful”
No matter what, some kind of assumption will have to be made
- Treating categorical variable like a number
- Lumping together categories
- Including prior distributions on the parameters or regularization

Analyzing ordered categorical phenotypes: challenges and pitfalls

Who is this talk for?

Talk outline

What is categorical phenotype data?

Why do we collect categorical phenotype data?

Categorical data loses information

Categorical data loses information, continued

How do we analyze categorical phenotype data?

Option 1: Treat categorical variable as continuous

Option 1: Treat categorical variable as continuous

Option 1: Treat categorical variable as continuous

Quick and dirty way: the downsides

Quick and dirty way: the downsides

Quick and dirty way: the upside

Option 2: Account for the ordered categorical nature of the data

Binomial distribution

How to analyze binary categorical data?

How to analyze binary categorical data?

Multinomial distribution

Ordered multinomial distribution

Ordered multinomial distribution, continued

Cumulative logistic model: the downsides

Quasi-complete separation

Making predictions in cumulative logistic model

Making predictions in cumulative logistic model

Bayesian CLM

Machine learning approaches

Software implementations

Practical recommendations: deciding on number of categories

Practical recommendations: analyzing the data

Practical recommendations: analyzing the data

Approach with caution

Further reading