March 3, 2017

Genome assembly at scale

A consortium formed of researchers from EMBL-EBI, TGAC, and the University of Oxford, have released a new open-source genome assembly tool entitled Cortex. Website here, review on GenomeWeb here. The main developer behind the consensus genome assembly portion is TGAC's Mario Caccamo, who will be speaking at Eagle's 2nd annual symposium on bioinformatics next week.

Cortex has interesting implications for the migration of bioinformatics infrastructure to cloud computing. Most bioinformatics tools do not require unusually large resources or can be easily made to run in parallel on a number of smaller machines, but genome assembly has so far resisted most attempts to make it behave this way. Standard approaches tend to insist on holding most if not all of the assembled genome and its supporting data frameworks in memory during the assembly process which can lead to demands for machines with a terabyte or more of RAM. Such machines are easy (if not cheap) to obtain in physical data centres, but very hard to find amongst the offerings of cloud computing vendors as they are simply not in demand and therefore not a cost effective service to offer.

However, with the absence of a super-large-memory machine in the cloud, and the need for the assembly data and results to be integrated with other bioinformatics software, many bioinformatics groups have been hesitant in migrating any of their resources to the cloud. To have low-resource machines in the cloud and high-resource machines locally not only leads to complications of integrating them at the network/software level, but also to issues with transporting data back and forth between the two sets of computational hardware. 

The arrival of Cortex (and other parallel/cloud-designed algorithms for genome assembly such as SGA) mean that researchers can now seriously look at running assembly on standard-sized resources and thus move much more of their bioinformatics infrastructure into the cloud. It will take time for the research community to review and comment and improve upon Cortex and its competitors until they are fully proven and accepted to the point where journal reviewers do not question its use, but if these tools stand up to the scrutiny and prove their worth then this could spell the end for heavyweight local bioinformatics infrastructure and make flexible lightweight cloud-based alternatives a much more attractive proposition.

Topics: Bioinformatics, genome assembly