March 3, 2017

Storage at a glacial pace

Amazon's announcement of its new Glacier storage service last week was a great example of a company listening to the needs of its customers then acting accordingly. I should have blogged about it already, but was waiting to read up on the details a bit first to see if it could actually be applicable to any real-life use cases. Unsurprisingly, it can.

Existing storage from Amazon, whether S3 or EBS, can cost around 10 US cents per gigabyte per month, or roughly US$1,200 per year per terabyte. There is no such thing as a typical pharma company when it comes to NGS, but two examples I have come across recently suggest annual data production of 300 terabytes is not unusual. As most life sciences researchers never like to throw anything away when it comes to research data, then over five years this adds up to 1.5 petabytes at a whopping annual cost of $1.8m, or thereabouts. Even at the upper end of estimates of internal storage costs at big pharma, this is still a significant markup on what they'd pay to provision the same storage themselves.

S3 and EBS provide an instant-access service. Data is available on-demand, is always online, and is stored in a highly redundant manner to reduce the risk of loss through hardware failure. But, most people dealing with NGS data don't need this. It needs to be instantly available for the duration of its initial analysis so that it can be accessed by various pipelines and applications in use by the researchers, but once analysed the raw data is hardly touched again. It needs to be kept for a variety of reasons - regulatory compliance, cost of replacement, and peace-of-mind are just a few - but it doesn't need to be instantly accessible, as long as it can be accessed on request.

In-house solutions solve this by archiving older data onto tape or offline disks which are then disconnected from the online systems and stored securely elsewhere. If the data is ever needed again (which, in most cases, it never is), the correct archived tape or disk has to be identified and reconnected to the system so that the data can be transferred back into live, online storage. There is a delay associated with this, but the cost of waiting is far lower than the cost of keeping the data online at all times forever.

Glacier replicates this approach within the cloud. It provides a virtual offline storage method for archiving away old or infrequently used data. In return for the slower access time (3-5 hours per request) when the data is retrieved, the storage cost is slashed to around 1/10th of the comparable S3 or EBS costs. The terabyte-year cost comes down to just $120, or for the rapidly-growing pharma example above, an annual cost of $180k after five years. This is a massive saving of $1.6m, and severely undercuts the cost of keeping the data in-house too.

Of course there are catches. Retrieving data from Glacier is free up to certain limits, but thereafter the requests are charged at the same per-terabyte fee as the monthly storage charge. Therefore once the initial limit is consumed, it will cost about $10 to recover a terabyte of data from Glacier. This is still a very small price to pay for securely archiving data that, in general, will never need to be accessed again but must be kept available just in case.

We're already looking at ways that we can integrate Glacier into Eagle's own services. Its future promise of direct automated archival from S3 is also very exciting and we can't wait to try that out when it becomes available (Amazon suggest there might be a 12 week wait on this feature - so maybe in time for Christmas?).

Topics: Amazon, AWS, Big data, Big data technology, Cloud, data, datasets, glacier, NGS, online, pharma, storage, transfer