March 3, 2017

Biology must never develop its own big-data systems

John Boyle claimed in a Nature blog in July that biology needs to develop its own systems for managing big data, as none of the existing off-the-shelf solutions were suitable and in-house attempts are doomed to failure. As John mentions, a few companies within a stone's throw of his office are currently working on cloud-based systems to address this issue, and as Eagle is well and truly within said distance I can only assume that he must include us in that definition! So, I feel I must respond...

To summarise John's lengthy discourse, he makes two key points:

1. Biological data is large, complex, heteregenous, and does not lend itself well to being forced into a single format for storage.

2. Biologists should not have to learn how to manage their data, but computer scientists should learn to write software that thinks like biologists.

I absolutely agree with point 1. Whilst there is a case for combining similar data, it is never going to be possible to accurately and reliably transform sources as diverse as sequencing data and imaging data into a single format. These types of data are far better left in their original form and simply linked by reference to each other than by attempting to convert them. Any decent data management system these days - cloud-based or not - will recognise this and leave all data in its original form, preferring instead to centralise only the metadata that describes what is available and where, and provide links (or APIs) to access the original data once the metadata has been queried in order to identify the subset that is required. Eagle recognises this and the cloud platform that we are working on, ElasticAP, does exactly this.

Of course managing and integrating metadata is in itself a challenge. Different sources are annotated with different terminology, either within or without a reference ontology. The challenge to be able to integrate the metadata for cross-source querying relies on mapping ontologies dynamically and ensuring that all data is categorised correctly. Note that I say categorised, not formatted. It doesn't matter what form the data is in as long as a correct note has been taken of what the data is.

Which brings me to John's second point. All the best IT systems in the world are intuitive to the end user and need little or no training to use, whilst delivering exactly what the user needs without any special intervention. Biological IT systems must be the same, and on this point I agree with John. Biologists cannot be expected to retrain on new systems every five minutes or to do battle with overly complex (or standardised) query interfaces just to get at their own data. However, even with the best of IT systems in place, there is still a need for someone to annotate the data correctly in order to explain to the system what it is supposed to be dealing with - and to make it easier to identify and retrieve the data later on - and it is here that sadly many biologists start to have problems.

How many times have people stored data in folders called 'My Experiment', or named their genes 'Gene 1', 'Gene 2', etc., or used the ontology tagging system to label everything as 'Other' rather than spending an extra second or two to choose the correct term? Part of this is the fault of the systems for making the selection of appropriate terms and names too difficult or time-consuming, but a larger part is simple that many biologists lack any basic training in information management and simply do not understand why it is important that they should spend a little extra time making sure their data is annotated correctly before submitting it to whatever storage and analsyis system they are using. Computers are stupid - they can only do what they are told by their users - and if their users do not select the correct ontology terms then the computer cannot do anything about it no matter how cleverly it has been programmed.

Again, ElasticAP has been designed to allow the users to annotate their data using any terms or ontologies they so wish, or indeed none at all, but it is not the design of the system that makes the retrieval of data easier by doing so - it is the determination of the users to ensure their data is properly annotated so that they, as well as their colleagues, stand a better chance of knowing how to retrieve and integrate it with other resources in future.

So when John says that current systems "demand that scientists change the way they work, to generate standardized sets of results" he is absolutely correct, but fails to acknowledge that this requirement comes with good reason. It is not the results themselves that most systems aim to standardize, but the way in which those results have been described and annotated. Without standardised terminologies to describe what data is, data can never be succesfully integrated into transparent cross-resource queries. Biologists will have to learn how to describe their own work in order for any such system to succeed.

To close, let's reflect on the three(-and-a-half) lessons that John suggests:

1. The data is going to change. Yes, absolutely, bring it on. Just annotate it properly!

2. People are not going to change. I disagree. For science to progress to the point where it can embrace and build upon big data, people must learn to annotate their data properly.

3a. The problem is not technical. Absolutely correct. The problem is that people don't annotate their data properly.

3b. Data-management systems must be driven by requirements, not by the latest fashionable technology. On this point at least, John, I couldn't agree more!

Lastly, none of this is unique to biology. The terminologies/ontologies chosen to annotate data may be biological in nature, but at the technical level the aggregation and integration of disparate data sources from multiple repostories by means of querying and integrating at the metadata level through common or inter-mapped annotation standards is absolutely generic in nature and applies to all data management problems, not just biology. For biology as a whole to try and build its own specialist tools to manage this is just as bad as the situation today where individual labs are building their own in-house versions. Biology must never think it is a special case - because it is not. It needs to learn, and work with, the IT industry across all potential applications, because 80% of the answer is probably already out there. This is exactly what Eagle is doing with ElasticAP and all our other solutions.

Topics: 80%, Big data, Big data technology, Bioinformatics, Bioinformatics, biological, biology, Cloud, data, data warehouse, develop, format, integration, metadata, ontology, systems