March 3, 2017

ENCODE: A beachcomber's guide to the genome

(CC BY 2.0)

ENCODE press coverage focused on their 'de-junking' of the genome. But semantic wrangling apart, what of the ENCODE legacy?

The public release of ENCODE (ENCyclopedia of DNA Elements) last week provides the likes of Eagle and our customers with a veritable cornucopia of new toys to play with. Of particular interest is the ENCODE Virtual Machine which contains much of the software used in the analysis. Although supplied as a VirtualBox VM, migration to AWS is reasonable. Whether ENCODE follows modENCODE and 1000 Genomes to EC2 remains to be seen, but is clearly to be encouraged.

Much of the immediate scientific reaction to ENCODE (see the roundup from OpenHelix) concerns their 'de-junking' of DNA, claiming that "80% of the genome is functional". Really? Much of the argument is caused by ambiguities in the term 'functional'. Following ENCODE it's now clear that most genomic DNA 'functions' (verb) in a biochemical sense (i.e. sticks to the cellular machinery), but it does not follow that all these interactions have a 'function' (noun) in a biological sense (implying some wider purpose). For a scientific context see Sean Eddy's excellent post on the subject. This distinction between 'functional (v)' and 'functional (n)' has proved problematic in the past. For example, the EBI Functional Genomics Group annotates the biological function of genes, whereas Ensembl FuncGen (also at the EBI) focuses on DNA binding. The latter is now referred to as "Ensembl Regulation" to avoid confusion.

Such nuances in definition are not new to genomics; the term 'gene' for example is variously used to refer to a 'unit of heredity' vs. a 'genomic region producing an mRNA'. The distinct advantage of the latter; it's much easier to comprehensively identify genes in genomic DNA using mRNA evidence than it is using the classical genetics definition. Which brings me to the reason I'm excited by ENCODE; much like RNA-Seq data is used to predict genes, the ENCODE data will be used to build of a comprehensive catalogue of 'regulants' (genomic regions putatively regulating mRNA transcription). The Ensembl Regulatory Build is, of course, the epitomy of this approach. Such catalogues will form invaluable frameworks for the systematic annotation of epigenetic processes in genomic context. We derive, therefore, a neat solution to satisfy both ends of the functional vs. functional debate.

To end with the thoughts of Chief ENCODE-ian Ewan Birney "The real measure of a foundational resource such as ENCODE is not the press reaction, nor the papers, but the use of its data by many scientists in the future."

Topics: AWS, Bioinformatics, eaglensembl, ENCODE, epigenetics, Functional genomics, high throughput, science