March 3, 2017

Native Amazon workflows for bioinformatics

Today, Amazon's Simple Workflow Service (SWF) was launched in beta mode and marketed as "a workflow service for building scalable, resilient applications".

What does this mean for bioinformatics? Quite a lot, probably. One of the biggest headaches of any major bioinformatics task is orchestrating a workflow (or pipeline) to carry out batches of data analysis in a reproducible and consistent manner. For instance, you might have a LIMS (laboratory information management system) to manage the flow of samples through your lab from test tube to sequencing machine, but you'd also need a workflow to process the output of the sequencing machine into usable information. SWF is designed in such a way that it could easily take on the job of both LIMS and workflow as it allows for human interaction as well as fully automated processes.

Existing workflow tools such as eHive, Taverna, Knime, Pipeline Pilot, Galaxy, etc., were all initially designed long before the days of the cloud when everyone either analysed data on standalone machines (because pre-NGS most data was relatively small) or had a large in-house compute cluster with suitable job management software installed (e.g. SGE, LSF, Condor, etc.) for the workflow software to interact with. Upon the advent of the cloud, all these systems (and their competitors) sprouted cloud-compatible versions but the nature of their compatibility is tenuous at best in most cases (eHive being one good exception to the rule) - many just wrapped up existing tools and packaged them as cloud images with exactly the same limitations and restrictions as the originals.

Amazon's SWF is the first workflow management system to be designed specifically for the cloud by the very people who built the most popular cloud in the first place, know exactly how it works, and how to take best advantage of it to manage this type of task.

Like most other Amazon APIs it is HTTP-based, but has extended client APIs available in Java, .NET, PHP and Ruby, plus a richly-featured development SDK in Java for those wishing to get their hands really dirty. Whilst the workflow itself is managed from Amazon's servers, clients can run on the cloud or on local hardware, mobile phones, or can be human beings interacting with a web interface. Importantly for life science companies dealing with commercially confidential or sensitive data, this means that companies can use SWF to co-ordinate their workflows without needing to upload their private data into the cloud, putting SWF into direct competition with all existing workflow technology from other vendors (Taverna, Galaxy, etc.).

For bioinformatics specifically there is much to be excited about. A true cloud-compatible workflow environment means that more efficient use can be made of the cloud than ever before, which should help speed up the more complex analyses that biologists need on a regular basis. Access costs are very low (in the order of a tiny fraction of cents per workflow execution) which will help encourage experimentation and exploration of this new technology for scientific use. The lack of a Perl API may make adaptation of older scripts a little hard, as Perl is very common in bioinformatics, but it is not insurmountable - it is easy to set up an SWF-compliant script in another language that then simply calls out to an existing Perl script to carry out the tasks required.

All in all SWF looks great, and we can't wait to see the first useful bioinformatics workflow implemented on it. In fact I'm pretty sure a few of us here at Eagle have already started work...

Topics: Big data technology, Bioinformatics