BIOCURATION IS THE GLUE
In recent years science has moved rapidly from identifying the first genes involved in cancer (1) to characterising entire cancer genomes (2). These advances herald our understanding the molecular basis of this entire class of complex disease and the development of personalized medicines for their treatment.
This massive worldwide collaborative undertaking has resulted in several genomic resources for cancer, which provide a vital foundation for continued advances in the field. Each resource has different and complementary strengths and weaknesses. Here we take the reanalysis of “whole exome sequences” (WXS) of large numbers of cancer patients as an important task that these resources enable. Integration of WXS datafiles from multiple resources increases the number of genomes, hence power of downstream analyses.
A single resource often contains multiple samples for an individual patient donor, notably paired tumor/normal, but also multiple tumor samples. There is also significant overlap in samples between the resources. The overlap in samples between EGA, ICGC and TCGA is shown in the figure below;
This overlap, available after extensive semi-automated curation of records spanning the resources, yields a total of over 17,000 genomes with WXS data available for analysis. This also allows us to coalesce the unique characteristics of each resource; ICGC for example, has an extensive collection of standardized clinical metadata for donors which adds considerable value to the primary sequences in EGA and TCGA.
- “A point mutation is responsible for the acquisition of transforming properties by the T24 human bladder carcinoma oncogene” P. Reddy, R. K. Reynolds, E. Santos & M. Barbacid. Nature 300, 149-152 (1982)
- “A small-cell lung cancer genome with complex signatures of tobacco exposure” Erin D. Pleasance, Philip J. Stephens, Sarah O’Meara, et.al., Michael R. Stratton, P. Andrew Futreal & Peter J. Campbell. Nature 463, 184-190 (2010), doi:10.1038/nature08629