March 3, 2017

Sequence once, read often?

I was recently invited to a personalised medicine conference that talked about the new 'mantra' of 'sequence once, read often'. It appeared to be suggesting that the common practice is heading in the direction of taking a single reference sequence for an individual, maybe taken at birth or at their first diagnosis of a particular disease, then storing that away somewhere secure, for example with their medical records or on some kind of centralised health service computer system. I find this trend somewhat worrying from a data analytics point of view, as it may well cause more problems than it solves.

Sequencers are not error-free, and there is always a good chance that a variation discovered in somebody's genome is a sequencing error rather than a genuine deviation from the norm. The chances of this vary according to the choice of sequencing platform, the circumstances of the experiment, the quality of the DNA sample, cleanliness and accuracy in sample prep, and many other factors, although the error rates overall are much lower now than they were in even recent years. To be absolutely confident that an observed variation is definitely not an error signal, you have to sequence to a reasonable depth of coverage - which means generating far more data than just a single copy of the genome.

The proposal is to store the final consensus of this in-depth coverage as the reference point for all future enquiries related to the patient's genetic make-up. The consensus could be stored as a complete genome readout, or as a set of variations from the 'standard' human genome in force at the time the consensus was generated. Either approach is problematic.

Using an approximation of 3 gigabases as the size of the human gnome, then storing the complete genome readout of all individuals in the health service on any kind of centralised system would require huge amounts of storage - about 186 petabases of sequence for the current UK population (~62 million) which could probably be compressed down to around a petabyte of storage if managed well. Maintaining this would be quite a headache, quite aside from the potential ethical outcries not too dissimilar to previous proposals to maintain a national DNA database of criminal suspects or to introduce national identity cards.

Therefore, it is more likely that any whole-genome readout is likely to be stored with the individual themselves, either on a USB stick that they carry or lock away at home, or with their local medical records accessible only within their local doctor's surgery. This has major issues for potential data loss (everyone loses USB sticks from time to time) and for unnecessary repetition (will the hospital have access to something stored only at the local surgery?). So it is unlikely to be practical.

Storing only the variations has a big impact on reducing data size for a centralised system, but still incurs the issues around data privacy protests above. On top of that, it is only as good as current knowledge. If later research identifies a potential new variant of interest that was not included on the existing maps then it will not show up on the patient's record, and they may have to be resequenced in order to detect it. This eliminates the savings of the sequence-once approach.

My personal opinion is that sequence-once, read-once is a much more practical approach to the issue of personal genomes. Sequencing technology is heading rapidly south in terms of price per genome, to the point where it will not be long now until the cost of resequencing an individual every time their genetics is needed as part of diagnosis will be less than the IT costs of having to store one copy of that data for their lifetime. Accuracy is also improving over time so it is likely that each new genome scan will be more reliable than the last.

There is no more efficient way of storing DNA data than DNA itself, so why not let the patient be their own repository for this information? It is impossible to lose, is always up-to-date, and as sequencing technology improves then every time it is read out via a sequencer the accuracy will be far better than each time before. Additionally, there is no central repository for privacy campaigners to be concerned about, no risk of the data being lost or revealed accidentally by a third party, and no chance that an important variation may be missed because the copy of the sequence on record predates the discovery of that vital piece of knowledge.

Topics: Bioinformatics