March 3, 2017

A comparison of traditional vs. cloud HPC

Strait of Magellan Oil Platforms

Strait of Magellan Oil Platforms. SpecMode CC BY-SA-3.0

One area where the cloud is receiving a lot of attention is in High Performance Compute (HPC). From a sheer number of cores perspective, the scalability of some platforms, such as Amazon Web Services (AWS), has been proven. See Cycle Computing's 50,000 node virtual supercomputer for example. But HPC is not simply about CPU cycles. This blog takes a look at some of the arguments for and against cloud HPC from the perspective of a specialist company like Eagle. And from our perspective, without a few $M or so to purchase an on-premises solution, we can only consider hosted provision.

A detailed study comparing trad-HPC with cloud-HPC was published this February by the US DoE. See Scientific Computing's synopsis, and Eagle's earlier response. Although enthusiastically accepting the benefits of cloud computing, the report was generally critical of the technical suitability of cloud-HPC when compared to trad-HPC for scientific workflows. Some unbiased comments, given that Eagle have experience with both trad-HPC and cloud-HPC, are as follows;

Per-component performance of trad-HPC is undoubtedly better. Most significantly, network latency is typically an order of magnitude lower for trad-HPC than cloud-HPC. The reason for this is that compute nodes in trad-HPC are physically close, and linked by high performance (e.g. infiniband) connections, whereas cloud machines are generally connected by plain ol' ethernet, and may well be spread across multiple data centres.

Where latency really matters is for workloads that are memory bound, and optimisation relies on Message Passing Interface (MPI) libraries. A great example from bioinformatics is genome assembly, which typically requires access to 100s GB contiguous memory. I am yet to see a quality assembly of a large genome performed on any public cloud, although I have several ideas I'd like to try myself!

Latency is also a significant factor in the performance of network filesystems. Existing workflows that depend on writing data to shared storage typically need some modification to run efficiently on cloud-HPC. This is not generally a show stopper, but something to bear in mind nonetheless. On the flip side, distributed storage approaches such as Hadoop that are becoming increasingly used in HPC are perfectly suited to cloud infrastructure.

As demonstrated by Cycle Computing, cloud-HPC clearly trumps trad-HPC in scalability (expansion to meet demand). Another advantage of cloud-HPC's scalability and reduced provisioning time is the ability to dynamically adjust the size of the cluster pool to the size of the job queue (see StarCluster, for example). This makes cloud-HPS ideal for CPU-bound "embarrassingly parallel" workloads that are very common in bioinformatics, and dominate HPC workloads generally by some accounts.

What really sets cloud-HPC apart, from a workflow developer's standpoint, is self-service management and administration. This can result in huge productivity gains from individual developers, especially when using configuration management software such as Opscode Chef.

One should also bear in mind that the world does not stand still. Cloud providers are enhancing their HPC offerings all the time, through the addition of GPU instances and low latency network options for example. At the same time, academic communities are embracing virtualisation and cloud technologies for their next generation HPC infrastructures.

To a large extent the choice of trad-HPC vs. cloud-HPC will depend on the exact nature of the workload. However, for the majority of CPU-bound workflows at play in bioinformatics today, the superior scalability and sheer convenience/ease of use together make cloud-HPC a compelling choice.

Topics: AWS, Big data technology, Bioinformatics, Cloud, Cloud Computing, Cycle Computing, genome assembly, Hadoop, hpc, traditional HPC