March 3, 2017

Who owns Big Data?

The NY Times ran a May 21 article entitled "Troves of Personal Data, Forbidden to Researchers" in which its correspondent discussed the twin thorny issues of size and ownership/privacy of data. Both factors can play a role in making the source data behind published research accessible to third parties wishing to analyse it or replicate the results.

Big data is the easier one to discuss. The size of bioinformatic datasets, e.g. raw NGS data straight off the machine (including images), is large but not unique to the sector. The NY Times article draws comparisons with social science where it is not uncommon to carry out research investigating communications habits based on mobile phone records. A typical mobile phone operator's call records are likely to be extremely large and best analysed in-place rather than being transferred or copied to another location. Researchers therefore have to co-operate with the holders of the data to carry out their analysis in-place rather than remotely, or ask the provider to produce aggregate summaries to the researcher's specification which are then transferred to the researcher for processing.

In bioinformatics terms the in-place option is equivalent to placing a server under the desk beside the sequencer to carry out all analysis locally (not uncommon in smaller labs yet an inherently risky strategy to adopt), or doing local pre-processing then sending only the aggregate results off for remote processing. The pre-processing step is exemplified by the simple task of base-calling - reducing a terabyte or more of image data into a few gigabytes of quality-scored base data.

However with the growing number of sequencers and sequencing projects and the limited bandwidth available on most internet connections, the gigabytes of pre-processed data from multiple projects still add up to a significant problem to manage when working with collaborators or external service providers. Internal data centres work well with relatively high speed data transfer over local networks but suffer from limited storage and compute capacity and expensive/slow upgrade procedures when expansion becomes necessary. Internet transmission protocols have improved with the development of tools such as UDT which make the job of transmitting small to medium size sequencing datasets easier, but shipping hard drives to external collaborators on any project of significant size is still the only viable option in many cases. In other words we are still taking the data to the compute, and not the other way around, despite the best efforts of many to change this situation.

More thorny is the issue of ownership that the NY Times article mentions. Much data in the social science sector is personally identifiable, as with clinical data in bioinformatics, and so legislation and common sense require it to be suitably anonymised if the raw data is to be released for scrutiny. Whilst in social science this is fairly easily achieved by randomising identifiers and phone numbers etc., in bioinformatics there is the slightly tricker issue that it is possible, at least in theory, to identify an individual participating in a study based purely on their genetic sequence. Therefore even sequencing data with anonymised identifiers is saturated with personally identifiable information and to remove or randomise the sequence itself would make it entirely useless for purposes of verifying research findings.

Then there comes copyright and legal ownership of data. The article discusses Google refusing to release data relating to search statistics that was used in an associated research paper by scientists at the University of Cambridge. It really does go against the grain of science as we know it to expect people to verify or rely on your findings without ever being able to access the data that backed them. Reproducibility is key to making general findings that can be verified by other experts after publication. Without the raw data this is impossible, leading to a world where independent opinions that support or challenge research findings become impossible. An unscrupulous researcher could concoct their findings, claim the backing data is secret, and nobody would ever be able to prove them wrong.

Should Google have opened up their raw data? Probably not given commercial sensitivity, but at least they could have provided summary data in aggregate form that would have suitably protected whatever IP may have been at risk in exposing the raw data whilst giving sufficient confidence in the provenance and accuracy of the aggregate data. Or, they could have offered the same facility to interested parties that the original researchers had - subject to suitable confidentiality agreements of course. This might seem an onerous requirement to place upon a company but if they wish to participate in the scientific community and expect to have related findings taken seriously then it is only a fair thing to expect.

In bioinformatics, this is similar to the type of licences that TCGA and EGA offer for their more sensitive data (not the free public versions) where consent has to be given by a committee on behalf of the individuals who provided their data before the full dataset can be used. The procedure is not painless but it is effective. It exists to protect both the privacy of the data providers (just as important as a company's commercial interests) whist also maintaining an option for third-party researchers to validate and verify published papers which made use of that data.

Maybe the data providers active in the world of social sciences could learn something from the current methods of bioinformatics researchers. If in the meantime social science comes up with a great solution for moving big data around, I'm sure us bioinformaticians would also love to hear about it.

Topics: Big data, Big data technology, Bioinformatics, Cloud