March 3, 2017

The Magellan Report for Cloud Computing in Science

In December, the US Department of Energy (DOE) published its Magellan Report on Cloud Computing in Science, comparing public cloud services to two in-house high-performance computing (HPC) data centres at Argonne Leadership Computing Facility (ALCF) and the National Energy Research Scientific Computing Center (NERSC). Although the title is very general, the report specifically focuses on the needs of the DOE's own scientists and research requirements and appears not to have taken the wider community into account. Still, it makes some interesting points - so here are my reactions to its key findings:

Finding 1. Scientific applications have special requirements that require solutions that are tailored to these needs.

The claim is made that scientific applications are special because they "rely on access to large legacy data sets and pre-tuned application software libraries" which are currently addressed by HPC setups that have "low-latency interconnects and rely on parallel file systems", giving a set of "unique software and specialized hardware requirements". Science needs to stop thinking of itself as special - the kinds of data processing problems faced by science are no different to many of those faced in finance or logistics in terms of their scale, complexity, and structure of data. Whilst it is true that the DOE may be dealing with particularly complex datasets related to environmental and biological research, this does not make their situation unique.

The perceived issues faced in moving from HPC to cloud are to do with addressing paradigm changes, but are the result of trying to make simple like-for-like comparisons. The report authors suggest that cloud's inability to exactly replicate HPC architectures and performance is more of a roadblock to the use of cloud than the reluctance of scientific software developers to optimise their algorithms to cloud environments, whereas I would suggest the opposite is true.

Blended into this finding in the report is a claim about incompatibility of business models - clouds work on a pay-per-use basis whilst scientists have an "open-ended need for resources" with an implied reluctance to have to account for the resources they use. That's not an obstacle, that's just a change needed in the way that IT budgets are allocated to science, and a change that is probably well overdue at that. If scientists paid the true cost of accessing existing HPC resources by paying for it in direct proportion to their usage of it, and grant providers stopped preferring the purchase of expensive (and often partially redundant) dedicated hardware over the more efficient use of shared or external resources, then this argument would no longer stand. The report says that "the cost model for scientific users is based on account allocations", well, there is no reason that this couldn't be provided for on the cloud as well via some kind of dedicated institute accounts managed by the IT department.

Finding 2. Scientific applications with minimal communication and I/O are best suited for clouds.

"Performance of tightly coupled applications running on virtualized clouds using commodity networks can be significantly lower than on clusters optimized for these workloads". Yes, true. But, the cloud can scale to a much larger number of nodes than most HPCs have, and definitely more nodes than most smaller research departments have access to, meaning that although each individual node is slower the total number of nodes available can help make up for this. Plus, as you only pay for the nodes whilst they are being used, they can be cheaper than keeping a set of fixed nodes up and running waiting for work to appear. So although the report is technically correct in pointing out that the test data ran "7x slower at 1024 cores on Amazon Cluster Compute instances" than it did on the DOE's HPC centres, that is not necessarily telling the whole story. The comment above about reluctance to optimise algorithms specifically for the cloud could equally well apply here.

Finding 3. Clouds require signicant programming and system administration support.

It seems as though the authors are comparing direct access to cloud resources against sys-admin mediated access to HPC resources. It is odd to suggest that a move from HPC to cloud would result in requiring scientists to do all the technical work themselves. HPCs already take huge amounts of programming and sys-admin support to operate and so a move to cloud would simply see the HPC staff working on cloud resources instead of in-house resources. Scientists would not see any difference if the move were managed properly. This finding is nonsense!

Finding 4. Significant gaps and challenges exist in current open-source virtualized cloud software stacks for production science use.

This finding is true, but it applies only to deployment of private in-house clouds, i.e. installing cloud software into existing data centres. It does not apply to public cloud services although it omits to mention this.

Finding 5. Clouds expose a different risk model requiring different security practices and policies.

True. But... only if you permit users to create their own images. If you are offering managed services on a private cloud that use cloud technology behind the scenes to coordinate work whilst still presenting traditional HPC-style interfaces to the end users thus restricting their ability to run arbitrary code, then the issues in this finding are greatly reduced in severity.

Finding 6. MapReduce shows promise in addressing scientic needs, but current implementations have gaps and challenges.

Absolutely. The issue here is that science relies a lot on scripted/interpreted languages such as Perl, Python, and Ruby, whilst cloud models (being from enterprise computing backgrounds) use more complex (semi-)compiled languages for increased efficiency, such as Java or C++. Unfortunately, ne'er the twain shall meet. Scientists and scientific software developers who wish to make use of advanced technologies such as MapReduce will have to learn the relevant programming languages, and staff at IT and HPC centres providing cloud-related services to scientists will have to develop and deliver the appropriate training courses.

Still it is true that current implementations of MapReduce do not support the complex and interlinked/referential/hierarchical dataset structures that are common in science. This is one area that the technology needs to improve in before it can fully realise its potential.

My earlier point comes back yet again - that scientific software developers need to optimise their code for the cloud, rather than simply port existing paradigms and claim it is the cloud's fault when they do not perform so well. This is easier said than done, as scientists like to use tools that are well-referenced and well-established. Given the choice of old-fashioned-but-functional cloud-ignorant Tool A which is the industry standard, versus brand-new cloud-optimised Tool B which has been shown to produce the same results but is too new to be widely cited in journals, Tool A will win every time (and anyone using Tool B and mentioning it in their subsequent journal paper will almost certainly get pulled up on this fact by the reviewers who will have very strong opinions of their own about which tool is most appropriate based on citations in prior publications, thus perpetuating the reign of Tool A even though Tool B may be more advanced).

Finding 7. Public clouds can be more expensive than in-house large systems.

Yes, but usually only if you attempt to replicate in-house systems like-for-like, i.e. you set up a large number of machines that are running constantly regardless of workload. It is widely accepted that cloud costs for a machine that is running 24x7x365 are similar if not greater than the equivalent total cost of ownership (TCO) for the equivalent machine in an in-house data centre, but the point of the cloud is that you shouldn't need to have machines up and running permanently in anticipation of workload - rather you should create them just-in-time when work peaks, and tear them down again as soon as they fall idle. As you pay only for the time the machine is up-and-running then this management technique will soon reduce the costs below the level of provisioning and maintaining equivalent in-house hardware. The report does not appear to address this possibility at all.

Interestingly, it is in this finding that the second sign of the authors trying to justify the existence of their own HPC centres comes into light (the first being the earlier complaint about clouds requiring extra programming and sys-admin knowledge amongst scientists), suggesting that the report overall may be biased towards the interests of the authors. They state that the costs they use for comparing cloud with HPC "do not take into consideration the additional services such as user support and training that are provided at supercomputing centers today" which are "essential for scientic users who deal with complex software stacks and dependencies and require help with optimizing their codes to achieve high performance and scalability". A move from cloud to HPC does not necessarily mean that all that support goes away - it is highly likely that staff who used to run the HPC would now manage access to the cloud resources instead, providing all the same additional services to scientists as they did before. Does this paragraph of the report imply that the DOE is seriously considering moving its HPC function to the cloud and that the HPC centres who authored this report are trying to prevent it from happening? Who knows.

Finding 8. DOE supercomputing centers already approach energy efficiency levels achieved in commercial cloud centers.

Great. That might be the case for the huge HPC resources at the DOE, in which case congratulations as that's quite an achievement, but for the rest of us this point is highly unlikely to be true of our own data centres. For those IT managers needing to comply with green agendas, the cloud is almost always going to be more energy-efficient than in-house operations.

Finding 9. Cloud is a business model and can be applied at DOE supercomputing centers.

This is an interesting finding and is very true - the cloud is indeed a business model as well as a technical innovation. The whole way in which people interact with resources changes under the cloud, with instant access to dedicated virtual resources on demand rather than queued access to shared and traffic-managed HPC resources.

The finding says that "Rapid elasticity and on-demand self-service environments essentially require different resource allocation and scheduling policies that could also be provided through current HPC centers, albeit with an impact on resource utilization". This is about private cloud vs. public cloud, about the fact that even if converted to private cloud technology, HPC centres will still face resource utilization issues due to the limited size of their compute capacity vs. the number of people needing to make use of it. Public clouds are generally larger than most in-house data centres, able to grow faster through commercial investment processes, are quicker to respond to and manage demand, are more able to fairly share demand over a greater number of diverse users and requirements, and are more highly utilized. All these reasons are why private cloud will only be useful as a halfway-house until systems can be established that make use of the public cloud.

Overall the report makes some good points, when read with a pinch of salt regarding its limitations of research (DOE requirements only) and of authorship (the authors' interests are linked with preserving the role of existing HPC centres). For an organisation of a similar size to the DOE who has made similar investments in HPC, it is a very relevant and salient report. For smaller organisations and those who do not have their own HPC resources already, remember that one size does not fit all.

Richard Holland

Topics: Bioinformatics, Cloud