March 3, 2017

The secret to efficient data processing on HPC/Cloud

Nobel Prize-winning biochemist Julius Axelrod, showing how it ised to be done. CC0This post reveals the secret behind Eagle's uncanny ability to rapidly develop and deploy highly efficient, flexible data processing pipelines on virtually any HPC infrastructure including Cloud. And the secret's simple; the secret (back to school now, everyone over 30) is a blackboard.

Component stack - traditional HPCFor embarrassingly parallel pipelines that are common in genomics, it is typical to split application analyses into atomic jobs, have a master node submit these to a job scheduler (scheduling layer in component stack on left) that distributes their execution to the worker nodes of a multi-node cluster.

In traditional pipelines, the master node tells the worker nodes exactly what jobs to run. This is a push model. In a blackboard pipeline, the worker nodes ask the master node which jobs to run. This is a pull model. The advantage of the pull model is that the workers have far more control over their own destiny. If, for example, the job scheduler works best when workers are processing a unit of work for one hour, then the worker can continue to pull jobs from the blackboard until that optimum has been reached, i.e. the pipeline becomes self-optimising, freeing the developer from the trial-and-error task of tuning batch sizes. This becomes especially important when the size of each job is highly variable. A more complex example; if a job exhausts the resources available to a worker, then the worker may resubmit the job to the blackboard with a flag requesting a higher-spec machine.

Some bees. CC0The above scenarios (i.e. addressing the limitations of traditional job schedulers) were foremost in the mind of Eagle's COO, Abel Ureta-Vidal, when he and his team developed the eHive blackboard workflow system a few years ago.

"Eagle's pipelines are efficient and flexible because they use a blackboard (pull) model rather than traditional (push) model of job orchestration"

The disadvantage of the blackboard system is that it adds an extra layer of software to the HPC stack, and the user's application code must be specially wrapped for this layer. Interestingly, this abstraction becomes a positive advantage when porting a pipeline from one job scheduler to another; once the blackboard system has been ported, all existing pipelines written against the software (including those for genome annotation, comparative genomics, regulatory genomics etc. from Ensembl) will automagically just work. Eagle have had great success in porting eHive from its original Platform LSF scheduler to PBS, Grid Engine and Condor alternatives.

"Eagle can develop pipelines quickly because we have an extensive toolkit of components to work from"

Now for the really interesting part - the Cloud. For Cloud, there is an additional component of the HPC stack, namely the provisioning layer. This layer is responsible for e.g. starting a virtual machine instance at the request of a user. This layer is exploited by cloud-based platforms such as StarCluster and CycleServer to provision traditional Condor HPC clusters, with added auto-scaling (yes, you can run the ensembl-compara pipeline out-of-the-box on StarCluster). Very shortly after adopting the Cloud 5 years ago, Eagle realised that the provisioning layer offered all of the functionality needed by the blackboard system; the job scheduler was obsolete in a cloud environment! The ability to do this sort of stuff is exactly what makes the Cloud such a dynamic beast to work with. We rapidly adapted eHive to exploit the Amazon Web Services provisioning software, and could immediately run Ensembl pipelines natively sans scheduler on the Cloud.

"The blackboard system provides an abstraction layer that allows us to run our pipelines optimally and seamlessly across traditional HPC and cloud infrastructure"

A note of caution; eHive is an expert system with a steep learning curve. The situation is likely to improve as the small but vibrant developer community add improved eHive documentation. In the meantime specialists like Eagle are, of course, happy to help.

Topics: Amazon AWS, Big data technology, Bioinformatics, Cloud compute, cluster, condor, cycleserver, Eagle secrets, ehive, Ensembl, genomics workflow, gridengine, lsf, NGS pipeline, node, Portable Batch System, queue, scheduler, sge, starcluster