March 3, 2017

An embargo on short read alignment tools

Mick Watson wrote this excellent post on his own blog just before Christmas. I couldn't resist asking if I could repost it here as I agree with so much of it, and luckily he said yes. - RH

Two things happened recently that inspired this blog post. The first was an excellent review that revealed that there are currently over 70 short read alignment tools available today (it should be noted that another list exists at Wikipedia, which has some entries that the EBI list does not; and lets not forget the infamous SeqAnswers thread). The second was the publication of another (probably excellent) short read alignment tool in the journal Bioinformatics. I'd have called it YANAT, but the authors decided not to for some reason...

I can't help but say - I'm sorry, but isn't this a waste of time, both yours and mine?

I have nothing against the authors of the new tool, whom I am sure are excellent scientists. By some miracle, they might read this blog and comment, or email me, and tell me I'm being unkind. I'll feel bad. But still, rather than write another tool, why not contribute to the codebase of an existing tool? If BWA is not accurate enough for you, then branch the code and make it so; if Stampy is too slow, speed it up.

Now, I'm well aware of the "bioinformatics process", and most of the time it works fine. It often starts with a new problem, or technology. An initial tool is published that can deal with the data. Then a raft of new tools are published which improve on the original work, or fill a slightly different niche. There is then a "survival of the fittest" process, and the best survive to form best practice for analyzing that particular type of data.

I presented this paradigm at the Eagle Genomics Symposium 2012 when I introduced the "Watson Square" of bioinformatics research:

watson_square

Here in red we have the original and first tool published to tackle the problem; then we either improve on this tool by getting more knowledge from the data (x-axis: biology) or by getting similar results quicker or using less memory etc (y-axis: technology). The holy grail, of course, is to improve the efficiency whilst also extracting more biological knowledge.

What I don't understand is how, with over 70 short-read aligners out there, you can publish a new one that you can show is better than all of the existing tools. And also, why you would bother?

The excuse I most often hear is that there is no incentive when one contributes to an existing, already published, codebase. As an academic, publish or perish, and if you write a new tool you will get a paper out of it; if you contribute to an existing tool, at best your name will be lost in a long list of authors, and at worst you won't get published at all.

However, this argument is a complete fallacy. Take Velvet as an example, one of the first De Bruijn graph assembly tools, which launched Dan Zerbino into bioinformatics superstardom back in 2008. Velvet has proven to be an excellent starting point for many others: Namiki et al have extended the code to work on metagenomes; Torsten Seemann's group have written an essential wrapper, the Velvet Optimiser, and published Vague, a GUI for Velvet; Matthais Haimel is developing Curtain, which is another wrapper that allows users to add in read-pair information to improve assemblies; and Daniel himself, and others, have published additional algorithms and published those, including Pebble/Rockband, and Oases.

In fact, any bioinformatics codebase can be seen as a great coral reef, for many others to feed off, and to create an entire ecosystem of tools and extensions. Surely it's better that way, rather than swim off on your own and try and establish another, virtually identical reef a few miles away?

Oh, and in answer to the burning question you all have, I use Novoalign. Why? Because I value accuracy over speed, because it has some really nice little features, because I can do alignment in a single command and because it has excellent support.

So come on guys - surely now we can have an embargo on the development and publication of short-read mappers? Please? We have enough. In fact, we had enough when there were 20, never mind 70+. Do yourself, and everyone else, a favour. Stop. If you're short of things to do, why not try writing something that can align/assemble 10-100kb reads instead?

Topics: Big data technology, Bioinformatics