March 3, 2017

To cloud or not to cloud

Today was the 1st day of the 2 day Bioinformatics Open Source Conference (BOSC) 2010, held in Boston MA. The variety of topics covered during the 1st day was huge, but here are some highlights.

The keynote speaker was Guy Coates, manager of the data centre at the Wellcome Trust Sanger Institute, who explained in depth about the different challenges in setting up a physical Ensembl mirror vs. a virtual one. The physical one, in the Western US, was set up some time ago and involved shipping a complete set of preconfigured servers from the UK to the US and configuring them to be part of the internal network at the Sanger Centre back in the UK. Whereas the virtual one was installed remotely on top of Amazon's EC2 infrastructure in the Eastern US region. Guy's main point was that despite the lack of physical infrastructure costs, depreciation, etc., the Amazon solution was only 16% cheaper than the physical solution in terms of TCO.

This is an interesting figure to consider. It should be remembered that Amazon's pricing structure is best suited to ad-hoc compute demand. Use-cases such as a permanent always-on Ensembl mirror with a fixed number of machines sitting behind a load balancer operating at or near capacity will incur heavy costs in Amazon that are almost comparable to the costs of owning the hardware yourself. The main benefit in this case is that you get free hardware upgrades as time goes on, and that you don't have to worry about running the data centre. Other than that, the savings are marginal. The gains are more intangible - peace of mind more than anything else. Depending who you are and what your use-case is, this could be just as important to you as the price.

The lesson is that Amazon is probably not best suited to an always-on fixed-size architecture unless that resource is constantly used to capacity. But, when you're running a service that sees only intermittent spikes in demand, e.g. a compute cluster that is only utilised around the 75% mark, then Amazon becomes a better bet. The ability to shut down (and not pay for) idle instances is far preferable to having to continue to pay for idle physical machines that you are not using. It is here that a properly configured EC2 architecture comes into its own with dynamic scaling of resources to suit real-time peaks and troughs in demand levels.

Guy's other main point was that Amazon does not cope well when trying to migrate existing HPC applications that expect to see a traditional job scheduler and shared filesystem with thick network features. Amazon doesn't do that - it has a thin network (i.e. fairly slow data transfer between nodes), and barely supports shared read/write filesystems at all. The migration of existing HPC applications then becomes an issue of redesigning to the Amazon paradigm (distributed data with code delivered to data, Hadoop-style) rather than simply trying to translate the physical cluster paradigm into the virtual world. This is often not  a trivial task and for many established cluster applications means that migrating them to the cloud will never be a realistic option.

For those who are designing new systems though, there is no reason not to consider the cloud. By designing the system to take advantage of the unique features of the particular cloud vendor you have chosen, you are making very similar decisions to those you take when you design a system to work with a particular job scheduler or operating system. You choose your tools, then you design your solution to make best use of those tools. You should never design your solution first then try to bend the tools to suit. Therefore a well designed system intended for the cloud from the outset should be perfectly capable and suitable, just as the Sanger's existing systems are perfectly suited for the cluster paradigm for which they were designed. The migration from cluster to cloud is nothing like it was when moving from mainframes to Beowulf clusters - it's more like moving from Berkeley DB to Oracle, or from Perl to Java.

Speaking of Hadoop, several speakers today espoused its virtues along with those of MapReduce and other related tools. It really does seem like it could be the way to the future for processing massive datasets in realistic timescales, but the totally inverted way in which you need to define your solution to the problem in hand is holding it back whilst people try to think in this new way, much like procedural programmers (e.g. Perl) find it hard to deal with declarative languages such as Lisp. In the meantime hybrid solutions such as outlined by Indiana University's Judy Qiu today allow people to slowly adapt portions of their systems, one piece at a time, transferring data across the bridge between the two halves as necessary.

Lastly for today, on a completely different subject, the ever remarkable Kazuhara Arakawa from Keio University in Japan presented yet another of his beautifully designed and intuitive graphic interfaces for interacting with biological data. This time it was a Firefox extension, G-bookmarklet, allowing users to highlight a term or sequence of interest on any web page and 'dial up' an analysis or lookup to use on that term from a spinning set of circular shortcut buttons, leading either to a subset of further analyses or a redirect to a webpage containing the final results of interest. Simple, but wonderfully effective.

Topics: AWS, Bioinformatics, Cloud, EC2