March 3, 2017

Heat is on at Sequence Squeeze

Over at the Pistoia Alliance Sequence Squeeze contest, which is being administered by Eagle on behalf of the Alliance, the number of entries is rapidly approaching the 40 mark - an impressive number considering the complexity of the task. Of particular interest is the number of repeat entries from individuals trying to better their previous attempt. Each one shaves just a fraction of a second from the runtime or a few extra points off the compression ratio, pushing them above the competition and back to the top of the leader-board. I don't think it was anticipated that entrants would end up directly competing in this way, but it certainly isn't doing the quality of the work any harm - if anything, quite the contrary!

Many questions have been asked about the subjectivity of the way in which the entries will be judged. Given the diversity of possible input formats and variations between platforms it is not possible to provide a test dataset that represents all of them. Indeed, many compression tricks work only if the dataset is known to be of a particular type or relate to a particular organism.

The contest judging script takes a simplistic approach of running each entry using default settings only on a fairly limited test dataset that is at least internally consistent (i.e. it all comes from the same organism and from the same sequencing platform). This provides an easy way to rank generally better entries under a number of useful categories (ratio, speed, etc.) and identify the top few in each category for further scrutiny by the human judges. The leaderboard shows the results of this automated process.

Whether the ratio or the speed is considered more important is largely subjective, and so the judging panel will consider performance in all categories as well as looking into the effectiveness of any optional flags/optimisations that may be included in the code, plus the quality and robustness of the code itself (there's no benefit in having an open-source algorithm contest if the source code of the winner is indecipherable or does things that would cause concern in a production environment). Admittedly the subjectivity of this process may cause concern, which is why the judging panel consists of leading bioinformaticians from each of the three main sequencing centres of the world - the Broad, BGI, and Sanger. As a team they stand the greatest chance of identifying what would best suit the needs of NGS data managers.

One question that comes up frequently is to do with the mismatches. Just how important is it to reproduce the input data exactly when decompressing, and can low-information data be discarded? To prevent complicating matters even further with judging this contest, it was decided that input data should be fully reconstructed at decompression (although the sequences do not have to be in the same order or even in the same files, but they should all be present). Entries are scored for the number of sequence headers or lines of bases or quality scores that do not match. A few mismatches might indicate a minor issue with the code that could be fixed with a bit of investigation - and the human judges will take this into account - but a large number of mismatches would be a definite problem.

The contest closes to new entries in just over 6 weeks on March 15th, and winners will be announced at the Pistoia Alliance Annual Conference in Boston MA (USA) on April 23rd. We eagerly await the results!

Topics: Announcements, Bioinformatics