Kieran R Campbell - What makes a good computational genomics method?

Teng Gao recently published a blog post called Principles for new method development summarizing some guidelines on how to approach the creation and evaluation of new computational biology methods. Given the proliferation of research in this area, I think this is a really important topic and worth diving deep on. What makes a “good” methods development research project? How do we know it’s “good”? Much of this is based on mistakes I have made in the past.

Start with a problem, not a solution

A seemingly common strategy when starting a new computational genomics methods problem is to

Think of cool new model
Implement the cool new model
Work out where the cool new model might be useful

I have \(n=1\) concrete evidence for this but suspect it’s very common. For example, when single cell RNA-sequencing methods were created, there was a rush to develop specific methods for differential expression (comparing if the expression of gene differs between two conditions) rather than relying on existing methods for bulk data. However, later benchmarking papers hinted that bulk differential expression tools performed best on average. In other words, make sure there is a problem to be solved rather than creating a solution first or imagining there is a problem without confirming. A really good place to start with this is via benchmarking/testing existing methods and working out where they fail/perform poorly. It’s in those gaps the potential for important new methods lies.

Seek a new perspective, not an improved score

A related issue is to look an existing method, think

this doesn’t achieve 100% accuracy

(however you choose to define accuracy), then think

I bet I can come up with a better method

There are some merits to this approach. For example, in the field of computer vision circa 2012, the top-1 accuracy on ImageNet using hand-engineered features was ~50%. The next year, along came deep learning and raised this to ~65%, a relative increase of ~30% which is mindblowing¹:

Top-1 accuracy on ImageNet 2011-2021. Source: paperswithcode.com

The catch is, this is happens rarely in applied genomics research. It’s much more likely that after extensive methods development and (hyper-)parameter tuning, we see a 2-5% increase² in whatever score is used to benchmark the method.

The issue here is: is any new biology discovered as a result of improving a score by 2-5%? I suspect the answer is frequently “no”, and a lot of frustration of wet-lab biologists with the computational methods community is chasing such benchmarks rather than trying to uncover new biology / insights from the data. I also suspect that 30% bump in accuracy on ImageNet has laid the foundation for thinking

If only I could create the right (deep learning) method, I’ll solve this!

from many researchers (including myself), which in reality it’s uncommon for this to throw the doors open to new insights.

So what’s the alternative? The alternative is to take a fresh perspective of the data in some that discovers something new, which is hard. One nice example of this work on RNA velocity, which took existing single-cell RNA-seq data, but realized that by quantifying spliced vs. unspliced read counts with a new model (the fresh perspective) and in doing so could predict³ the future state of cells (the something new). But the fact remains that such new approaches are hard and consequently rare, but a good angle may be to go first really deep on the biology then try to align that with what’s possible from the data.

Have an evaluation-first approach

After successfully identifying a problem, the next important yet still hard question is how do you know when you’ve solved the problem? In a supervised learning setup this is somewhat straightforward: a successful method will improve the accuracy on held out test data⁴.

However, in computational genomics it’s often hard to apply such a concept, since (i) labelled (ground truth data) may be expensive to obtain and/or non existent, and (ii) many of the problems are unsupervised (e.g. clustering or dimensionality reduction), so there is no such thing as a “ground truth”.

Given this, how do we say if one method is “better” than another? Some common approaches are

Develop a set of heuristics that reflect some quality of the ground truth. For example, maybe a “good” clustering is one where the cluster markers reflect known cell lineages. We’ve developed some methodology around Bayesian optimization in such settings previously.
Get ground truth labels from somewhere (anywhere), or at least labels that may correlate with ground truth
Assess methods on their robustness to perturbations in the data / model hyperparameters / etc.

The crucial consideration isn’t the exact evaluation strategy used but to have one in place before you start. It’s much better to have

flowchart LR
  A[Identify problem] --> B[Consider\nevaluation\nstrategy]
  B --> C[Realize no\nstrategy exists]
  C --> D[Abandon\nproject]

than

flowchart LR
  A[Identify problem] --> B[Develop new\n method]
  B --> C[Consider\nevaluation\nstrategy]
  C --> D[Realize no\nstrategy exists]
  C --> E[Realize new\nmethod is inferior\nto existing]
  D & E --> F[Abandon\nproject]

Balance what’s important with what’s possible

The last consideration is probably the most obvious: aim for projects that are both important and feasible.

What is important? Projects that will deliver some new biological insight, unblock some impediment to progress in a sub-field, or (most importantly of all) deliver something back to the clinic that will improve patient outcomes.

What is feasible? Projects where the data to answer the question exist, to which you have access, and/or for which you have collaborators able to validate any predictions you make.

The problem is, the intersection of the two is often small:

Aim for the intersection of what’s important and what’s feasible.

This often turns into a chicken and egg problem: you’ve identified \(X\) as a really crucial issue in the field that needs solved urgently, but you only have the data/collaborators/skills to solve \(Y\). Consequently, iteration is needed to refine the project until you’ve identified a related problem that is still worthwhile to solve but actually feasible. This may take the form of limiting project scope (“let’s do a pan-cancer analysis” becomes “let’s do the analysis in the cancer type we have data for”) or settling on predicting a surrogate of what you actually want (e.g. predicting treatment response becomes predicting known treatment response signatures). What’s important is to consider this up-front and not get stuck on the not feasible / not important trap⁵.

Footnotes

This is a simplified re-telling but true to a first approximation.↩︎
This is an educated guess, but I suspect after we account for over-fitting this acts as an upper bound.↩︎
To a first approximation↩︎
Though other parts remain not straightforward.↩︎
What’s important is also highly subjective and not always obvious until (many decades) after the fact.↩︎