March 3, 2017

eHive: The smart workflow system for genomic analysis

In a recent post our CTO Will Spooner discussed the benefits of the 'blackboard' approach to NGS software pipelines and how this has been used by the developers of eHive at the European Bioinformatics Institute (EBI). The eHive tool is used there and at the Sanger Institute and also here at Eagle, where we have added our own modifications to take advantage of Amazon cloud resources.

The eHive system offers a lot of flexibility, but this comes at a price. Workflows are built and configured by writing and adapting Perl modules requiring, to a certain extent, a programmers mind. There is no cosy point-and-click interface to hide behind here. However, help is at hand in a series of training workshops now being led by the core developers, the first of which took place at the Hinxton campus earlier this month. These are open to all (details are usually announced on the mailing list mentioned below), though the majority of participants tend to represent one of the two institutes on site. In six hours this could only be a whistlestop tour of eHive, giving a taster of how various features can be used to build pipelines with high resilience and performance.

One of the features which separates this engine from others is the ability of a pipeline to adapt itself during execution. Rather than having all of the work 'baked in' at the start, an eHive workflow may be written in such a way that new work is created mid-run: the pipeline responds to the size and nature of the data flowing through it. This method of dynamic work creation can be used to parallelise parts of a workflow in a highly scalable way: a simple example is an implementation of the classic 'Map Reduce' design pattern. The first stage divides up a large task (it might be mapping a set of reads against a reference) into more manageable pieces ('fanning out', as it is called). Subsequently the output from these jobs is combined (in a 'funnel'). What is crucial is that the extent of parallelisation does not need to be known at the start: this is determined at run time by the 'Map' stage.

For a system which is essentially built using a fairly low-level 'command line' approach, it is welcome to see that the developers have now implemented a graphical user interface: guiHive, which enables you to "Take control over your eHive production system". A way to build workflows from scratch it is not - for that you still need to get your hands dirty - but it is a convenient way to monitor a pipeline. You get a graphical view of the progress of your pipeline showing - at a glance - which parts have completed (or failed) and runtime statistics (such as average job time and memory usage). It also provides the opportunity to 'fine tune' a workflow by modifying runtime parameters. After a fairly straightforward installation - a Go compiler, a handful of Perl modules and a little configuration - you have a locally running web service which can be used to monitor an eHive workflow running anywhere by simply pointing it at the unique ID of the workflow.

As with much code developed at Sanger and EBI, it is open source and downloadable from Sanger CVS (one might like to see migration to a less archaic revision control system). Once you've done that, you'll want to sign-up to the users' mailing list. Future workshops may be advertised there, but it's also the place to ask questions, pick up tips and learn about the latest developments.

Topics: analysis, Big data technology, Bioinformatics, Cloud, cluster, command line, data, ehive, genomics, high throughput, hpc, job, management, performance, pipeline, scheduler, workflow