March 3, 2017

Four things every sequence analysis pipeline should do

Titus Brown writes a good blog, and so I am surprised that I didn't spot his December 2011 post on "Four reasons I won't use your sequence analysis pipeline" any earlier than I did. So, somewhat belatedly, here is my response to his excellent set of arguments against commercial providers of sequence analysis pipelines and their wild claims that they can solve anything.

Let's get this clear before I start - I agree with Titus 100%. Nothing bugs me more than a sales guy pitching up and claiming he knows more about my subject than I do, or that his fantastic solutions can solve any problem I can possibly think of. Nine times out of ten, they can't. A good bioinformatics solutions provider should listen, understand, make a proposal to do what they can within their ability, and leave well alone what they can't. Telling the customer what to do won't get them anywhere, but collaborating and working together using shared knowledge and skills will move the project forward a great deal further.

Titus breaks the problems down into four clear statements. Here they are, with my response after each.

1. (Claimed) methods aren't open, or open source. It's Not Science to use methods that aren't well understood or open to examination. Period.

Absolutely agreed. Therefore the majority of what Eagle does uses third-party open-source tools and data. We use closed-source tools only where there is no viable open-source alternative. We even go so far as to use exclusively open-source pipeline platforms that fully describe the flow of data and allow complete understanding of exactly what is going on at every stage. This is Science.

2. I don't know what your pipeline actually does, either. Do you record the analysis steps and parameters for each analysis, from start to end? Do you keep track of the versions of software and scripts? Is everything under version control in the first place? And can I look at it and verify what was run? Can it be included in the published record?

Yes we can, and do. Reproducible pipelines are key to reproducible science. There's no point in claiming you've run a particular analysis for publication unless you're able to give the details to someone else to independently verify your actions. So, Eagle fully document and verify every pipeline that we produce, and we share this with our customers.

3. The methods and source data you're applying are not well suited to the basic problem. For example, you're using a manifestly broken transcriptome to do mRNAseq analysis; or you're using default mapping parameters to do gene expression analysis.

Tell me about it... I get so fed up with seeing companies claiming that they've got the greatest whole genome assembly pipeline ever, or whatever. Sure they might well do for a very specific set of parameters, but most real-life use-cases are full of variables that such a generic pipeline design cannot possibly have been intended to cope with. To get accurate and reliable results, pipelines need to be designed specifically for the requirements of the scientific experiment being carried out, not hacked about from an existing pipeline intended for something else entirely. Not only that, source data QC is a vital component before even getting started - after all if you put rubbish in, you get rubbish out. So Eagle never comes along with prepackaged pipelines claiming they will solve the world's woes. Each project is built from the ground up, albeit at a greater cost than the prepackaged alternatives, for a very good reason.

4. You don't understand the biological intricacies of my model system.

You're right, we probably don't at first, which is why Eagle's ground-up approach works so well. Before we even get coding, we explore the drafts of the pipeline design with the customer in an iterative process to ensure we have fully understood what is going on and that we are not doing something that does not make sense in the context of their specific experiment. Therefore any special knowledge about the model system that needs to be taken into account soon gets expressed by the customer and incorporated into the design by Eagle. We can't possibly know everything in the world, but we do know how to ask the right questions and listen properly to the answers.

So - Titus, if you're reading this - Eagle agrees with you. Most vendors in the sequence analysis pipeline world are just turning the handle on standardised, generic pipelines which work for a certain limited set of use-cases but will never be suited for specialist, in-depth research projects. Eagle understands this and deliberately avoids the trap. Our built-to-order projects might take a bit more time or cost a bit more than buying an off-the-shelf solution, but the extra time and expense involved repays itself many times over in significantly better quality results.

Topics: analysis, Big data, Big data technology, Bioinformatics, four, NGS, pipeline, questions, reasons, response, sequence, Sequencing, workflow