March 3, 2017

Hiding your genome in full sight

Last month, Science blogged about a new technique for encrypting human genomes in such a way that computation could be carried out against them whilst maintaining the privacy of the individual. The technique allows a query to be run against an encrypted genome without ever needing to decrypt it yet still returning the same results as if the query had been run against the original data.

I'm dubious about the benefits of this approach. It appears to be based on the assumption that you can identify an individual by comparing all their specific differences against the public reference genome, and that by preventing access to the complete set of those differences you effectively anonymise the data. But the method of preventing access should be nothing to do with encryption when you think about it.

Think of it this way - if I am running a clinical trial, I have a very specific couple of questions in mind when I look at patient data - I want to know which patients to opt-in to the trial by selecting them on the basis of a particular genetic marker, and I want to be able to stratify the responses to the trial by reporting the markers associated with each group of patients categorised by their response. If I only look at the markers that these specific queries return, then I will not have enough information to identify the individual patients. The encryption method described claims to work by ensuring that even if I wanted to, I could not see the original genome sequence for each patient and therefore not accidentally gather a complete set of markers.

However, this access limitation is nothing to do with the encryption. If I run enough queries of the right type against enough sets of markers of interest, there is a good chance that I could design a set of queries that would give me the full set of individual variants sufficient to identify the patient in the same way as if I had access to the original genome sequence.

The security challenge here is not one of encryption, it is one of query limitation. The aim should not be to encrypt the data, because if I am allowed to query that data in any way I like then I will eventually be able to reassemble enough of it to make it identifiable. The aim should instead be to limit my access so that I am prevented from querying the data in any way other than my originally stated purpose. What we need here is not encryption but simple access control. Encryption is the wrong solution to the wrong problem.

The closing comments in the blog may be more insightful than at first glance - Sage Bionetworks' John Wilbanks says that at some point it may be easier simply to resequence an individual from a stolen fragment of physical DNA than to attempt to break the encryption on a copy of their genome. This is very true, and echoes the knowledge in the IT industry that no matter how good your security systems are, physical access to a machine is the greatest risk of all (as is, for that matter, unrestricted querying without sufficient access control). Put this alongside the continuing growth in genomic sequencing and the idea that at some stage everyone will have their genome sequenced, then the question has to be asked - why do we store these genomes at all and thus expose them to this risk in the first place? Why not just sequence or resequence the samples themselves whenever questions need to be asked, and discard the sequenced genome as soon as the answer is found to minimise the risk of misuse? (Rare or continuously changing conditions need to be the exception, as does data collected in support of any regulated scenario, but there is certainly no need at all for a default situation where everyone is sequenced in full.)

The less data we store, the less risk there is of it being stolen. I've said it many times before but I'll say it again - think before you sequence.

Topics: Bioinformatics