March 3, 2017

Big Data Sharing

BioTeam's Ari Berman wrote a guest blog over at Bio-IT World this month on the subject of sharing big data. Ari describes a transition over time from papers that themselves represented the complete set of data that needed to be shared, to the current trend for papers that present only the briefest summary and require entire archives of supporting documentation in order to deliver the complete results for third-party analysis and confirmation.

Moving this data around is a default response to the need to share it. Hard-drives by FedEx or internet transfers using high-speed technologies such as Aspera will only get you so far though, and sometimes you have no choice but to grant access to the data at its original source so that collaborators can log in and analyse it remotely - subject to sufficient local resources to support that analysis, of course. There is only so far that moving data around can get us, and we're probably there already.

Ari raises the alternative solution once more of bringing the analysis to the data, and not vice versa, using Hadoop as a case study. Hadoop, as you will know from having read previous posts on this blog by my colleague Mutlu Dogruel, is a very good way of distributing data and its analysis in the most efficient way possible across diverse and geographically distributed resources. You still have to get the data broken up and distributed in the initial instance, but you do not need to maintain multiple complete copies. A Hadoop-style approach seems appropriate and logical.

The problem lies in that distributed systems are often disparate and lack any sophisticated queuing or prioritisation. The disparate nature means that a user cannot expect his program to run in exactly the same manner in all locations, or even for the same toolkits or supplementary datasets to be available. The queuing issue leads to hardware paid for out of a local budget being used for remote purposes.

The hole in Ari's argument is that an assumption is made that servers are installed as traditional monoliths - non-virtual, or direct-to-metal installations of a single OS with directly connected data. Single OS platforms are the cause of the disparate nature and lack of common availability of tools. A platform that could run any OS on demand, with tools provided as part of a machine image delivered along with the request to process data, would circumvent this issue neatly. Likewise a platform that could cope with data stored on a fast-enough local network, e.g. on a SAN connected by fibre, rather than requiring directly connected hard-drives or other resources, would allow a much larger range of data to be computed, fast, by a single machine.

This technology already exists in large part. Network-attached SANs are already a given in most modern data centres. VMware and other virtualisation platforms allow the installation of arbitrary machines containing arbitrary software, and can sandbox/secure them to ensure that they only access the data and local services that they need to. It wouldn't be difficult to set up a VMware host at most major institutes where clients could submit their machine images on demand, maybe to act as a Hadoop node, or as an execution node in a distributed cluster system, and have them deployed (and then shut down later) automatically. It would not be difficult to restrict guest VMs to only occupy a certain percentage of host resources in order to prevent blocking out local researchers who have paid for the hardware in the first place.

This can't be hard to achieve. So why hasn't it been done?

Topics: Big data technology