James Bonfield of the Wellcome Trust Sanger Institute Develops Best New Compression Algorithm for NGS.
BOSTON, MA., April 23, 2012 – The Pistoia Alliance, a precompetitive alliance of life science companies, technology vendors, publishers, and academic groups, today announced the winner of the Pistoia Alliance Sequence Squeeze Competition. James Bonfield, a member of the sequence informatics team at the Wellcome Trust Sanger Institute, has received the US$15,000 prize for developing the best new algorithm for compressing next-generation sequencing (NGS) data. Bonfield will receive his prize at the Pistoia Alliance Conference in Boston at the campus of Thomson Reuters on April 24, 2012, on behalf of a prestigious judging panel comprising representatives from the BGI, Broad Institute, Wellcome Trust Sanger Institute, and the Pistoia Alliance.
Labs rely on compression to enable them to store data from NGS runs, which includes sequencing reads and associated quality scores. Yet compression technologies are themselves faltering under the data volumes produced by NGS. The Pistoia Alliance Sequence Squeeze Competition encouraged anyone with expertise in data compression to tackle this problem. Ultimately more than 100 entries were received.
The judging panelists evaluated five key dimensions of the competitors’ entries:
- Compression ratio (a measure of how much the algorithm squeezes the data)
- Compress time
- Decompress time
- Compress memory
- Decompress memory
The judges weighed three of these elements higher because of their pivotal role in real-world usage. Compression ratio and compress time impact how quickly data can be packaged for analysis and how easily it can be stored long term. Decompress time also received more weight in the judging, as it affects how readily scientists can extract value from NGS data sets. In addition, compress and decompress time were deemed important because of the role they play in expediting data transfer between proprietary datacenters and cloud-based systems for genomics storage and analysis, which are becoming increasingly prevalent in the life science industry.
Bonfield submitted a cluster of algorithms that all delivered high performance in the top three judged criteria. His approach considered the importance of preserving alignment data in addition to raw FASTQ output and employed fqzcomp as a FASTQ compressor and sam_comp for SAM/BAM output.
“The competition exposed two important elements: First, that the gzip algorithm that served as the competition baseline is actually quite sufficient for run-of-the-mill compression, and second, that it’s extraordinarily difficult to make huge improvements in all three of the judged dimensions,” said Nick Lynch, external liaison of the Pistoia Alliance and chair of the judging panel. “Tradeoffs were made, which means that ultimately a compression toolkit might be the best approach to handle specific workflows.
Bonfield praised the structure of the competition, which included an open “leaderboard” for tracking submissions. “I can confidently say my entries benefited by the open nature of the contest. In a closed competition with a score table only visible after the submission deadline, I might have sat back and waited for the results. Instead, seeing an entry beaten spurred me to improve my submissions,” said Bonfield.
The judges also praised the quality of entries received and the conversation sparked by the competition. “During the competition itself, entrants discussed ideas openly on a variety of forums, and many entrants are already talking about merging the best parts of their algorithms together to address particular sequencing workflows,” said Lynch. “This is what’s so special about the open innovation the Pistoia Alliance promotes—the end results are significantly better than what could be achieved by individuals acting alone.”
Other members of the judging panel were Yingrui Li of the BGI, Tim Fennell of the Broad Institute, and Guy Coates of the Wellcome Trust Sanger Institute. The competition closed on 15 March 2012 and was administered on the Pistoia Alliance’s behalf by Eagle Genomics Ltd., a bioinformatics services and software company. Michael Braxenthaler, president of the Pistoia Alliance, will award the prize to Bonfield at the Pistoia Alliance Conference in Boston at the campus of Thomson Reuters on April 24, 2012. Bonfield plans to donate a portion of his prize to the Wellcome Trust Sanger Institute and the remainder to the British Heart Foundation.