March 3, 2017

Crowdsourcing in bioinformatics

Tuesday's announcement that James Bonfield won the Sequence Squeeze contest organised by the Pistoia Alliance was interesting for two reasons.

The first reason is that although there was an overall winner, there was not an overall single best solution for the problem that Pistoia posed - the compression of FASTQ data. Rather James authored several entries, each of which performed well on distinct sets of criteria that may be important to different users. Some centres may value fast random-access to data later on, in which case fast decompression is important. Other centres value archiving data quickly as it is produced and do not necessarily need to retrieve much of it later, in which case fast compression is best. Yet others may simply need to minimise disk space, in which case the best compression ratio is most important. It is very hard to achieve best-of-class across all these factors and so the Alliance chose to recognise James' efforts in producing a series of specialised solutions rather than one catch-all approach. Had it not been for the crowdsourcing/open-innovation approach used in this contest then this may never have happened - and the concept that it would be a good idea to do this may never have occurred to Pistoia or its members.

The second reason is that this crowdsourcing approach does indeed produce genuinely innovative solutions, but the legal problems in doing so can be prohibitive. The Sequence Squeeze contest worked because it required all entries to be completely open-source (BSD licence) and free of all commercial code, and for the entrants to warrant that this was the case, but users of the code (all source code is linked to from the contest website) will still have to be careful that if they use it then they are confident that the licence is valid and appropriate and that it does not contain any third-party private property. John Overington blogs at Chembl on just this subject, but his words do not apply just to crowdsourcing but to all open-source projects. Submitting an entry to a crowdsourcing competition is in principal no different to submitting code to an open-source project on the web. The author still needs their employer's permission if done on company time and still needs to check their code does not include any illegal third-party inclusions. The users of solutions developed by crowdsourcing methods need to treat them in exactly the same way as any other open-source project - i.e. with care, attention to detail, and due diligence.

Another crowdsourcing approach to solving a genomics problem caught my eye this week as well, admittedly though because it featured an old colleague of mine who emailed the article to me. Guillaume Bourque's team at McGill university in Montreal developed a computer game to improve the quality of comparative genomics research. The game requires players to match up colour-coded blocks between species - blocks that in reality correspond to similar genetic regions that may have been transposed or relocated at the point that two species diverged in evolutionary history. This is computationally difficult but easy to do by eye given the right information - so by crowdsourcing the alignment process in the form of a game, the authors hope to get a much more accurate picture of the rearrangement of genetic data through evolution of distinct species than is currently possible using computational approaches alone.

Topics: Bioinformatics, Open source