March 3, 2017

GlusterFS vs a future Distributed Bioinformatics File System

What is GlusterFS?

The main theme of this year’s Eagle symposium was Big Data and its implications on Bioinformatics budgets. We have been witnessing a dramatic departure from conventional back-end storage practices recently with the rise of technologies such as Amazon S3 and Hadoop. Hadoop coerces us to think in terms of the “divide and conquer” and “map & reduce” paradigms, which is not a bad thing per se. However, not all types of problems can be effectively rationalised into map & reduce necessarily, and sometimes it is simply not economical enough to try to re-engineer all your existing architecture to fit into Hadoop. Instead, you can use more generic, traditional distributed platforms, while keeping your preferred, established software approaches.

This is where GlusterFS kicks in: Your “generic” distributed file system does not have to be NFS. GlusterFS is an open source, distributed file system capable of scaling up to 72 brontobytes! It can handle thousands of connected clients running standard applications over any standard IP network. Its most outstanding properties are scalability, performance and high availability. Monolithic, legacy storage platforms can be expensive; nevertheless GlusterFS can provide a relatively less costly, virtualized storage that is easy to scale out.

GlusterFS is free to use for anyone unless you want to use it under the “Red Hat storage” brand, a GlusterFS service provided by Red Hat who in fact sponsor the open source project. I think GlusterFS or its derivatives could be more popular in the near future as this kind of technology lets you use any commodity disk and machines without you having to massively re-organize your existing infrastructure.

Why is GlusterFS becoming popular?

GlusterFS can be configured with different levels of redundancy and replication. You can have: 

i)     a replicated file system in which your data files are replicated across a predetermined number of different servers,

ii)    a RAID-like setup involving striped disk partitions for extra data protection, and finally,

iii)   a distributed system that takes a list of subvolumes and distributes files across them, corresponding to one single larger storage volume.

Support for data redundancy is the killer benefit of GlusterFS over NFS. In a distributed system with no replication, the danger is if you lose a single server, you lose access to all the files that are hosted on that server. From personal experience, I highly recommend using a setup combining replicated and distributed GlusterFS.

Cloud based storage solutions have long ago become commonplace thanks to their pay-as-you-go pricing models and scale-out capabilities. But can we marry these two guys?

GlusterFS has a tutorial explaining how to setup a distributed, replicated GlusterFS platform on the cloud, specifically in AWS:

http://www.gluster.org/community/documentation/index.php/Getting_started_setup_aws

This setup will combine all the benefits of working on the cloud, i.e., scalability, high availability, pay-as-you-go, and so on, with all the features of a full-fledged, secure, high performance distributed file system. There is more good news: any NFS client can mount a GlusterFS volume, although I recommend the use of the native GlusterFS client for performance reasons.

At Eagle we have used “Gluster on the cloud” in some client projects: the installation, configuration was very straightforward. Furthermore, adding new partitions to an existing volume, or “bricks” in the GlusterFS jargon, was trouble-free.

What do GlusterFS and its relatives need?

We bioinformaticians love moving data around, creating different versions of and modifying our “big data” along the various steps of our “analysis pipelines”. It is often the case that the same files may be saved under different directories, even without modifications. Institutions routinely release new versions of their  sequence databases, their  “big data”, by making incremental updates. But everything has a price. When talking about storage demands we no longer only mention gigabytes, but terabytes and even petabytes. If you want to work on the data directly on your distributed file systems things can get really slow, while data transfer and storage costs can add up to significant values over time. Time and space is everything (also, of course, they are a single entity according to Einstein).

I think, in an ideal world, in addition to the typical benefits of a distributed file system, a Distributed Bioinformatics File System (let’s call it DBFS) should be inspired from concepts such as “data deduplication” and “delta encoding”.

Data deduplication is also called "intelligent compression" or "single-instance storage". As a simple analogy, think of sending an email with a huge image attachment of your cute cat to 10 different people. Now, if the email server intelligently keeps only one copy of this file, instead of saving 10 copies of the attachment in different inbox folders, the server is using data deduplication. Of course, this model is not limited only to the level of whole files, it can be applied at the data chunk (or file block) level for more optimised storage. The caveat is, if we want to compress, delete or modify the original source in any way, the file system can get really messy after a while: all the files tied to this block will have to be modified as well!

For delta encoding, I will use the Git analogy. Suppose we have a new Git repository, and want to modify an existing file. Once we submit this change, Git will use delta files, or “diff” files describing only the updated content, while maintaining a pointer to the original file that is used when constructing the complete, updated version when needed. If this is all done at the file system level by our DBFS, then deleting one low quality sequence read from our huge FastQ file and writing it back to the disk will be super fast! Of course, DBFS should be occasionally able to “sync” itself to adapt the latest changes, purging all the “diff” files by doing the actual copying when perhaps not heavily used.

Once we have this technology at hand, I am sure we will start asking for more! As Socrates put it, “He who is not contented with what he has, would not be contented with what he would like to have.” 

Topics: Big data technology, Bioinformatics, GlusterFS