March 3, 2017

Crowdsourcing pros and cons

I wanted to blog about this paper on crowdsourcing in bioinformatics but had great difficulty in finding any way of getting across the message without simply cutting-and-pasting the entire paper verbatim. Readers with an interest in the field would do well to read the entire original, but here I'll try to summarise it in a few bullet points.

  1. Crowd-sourcing, for the purposes of this article, is defined as using the crowd's intelligence to carry out directed tasks requiring intelligent thought to produce discrete solutions. We're not talking about writing wiki articles here, but about problem-solving. Those problems can be small/easy to solve but many in number (microtasks) or large/difficult to solve but fewer in number (megatasks).
  2. Microtasks are suitable for using volunteers, or making micropayments to casual workers, embedding into educational or casual games, or incorporating into formal workflows requiring human input (best done with your own employees, not the general public).
  3. Megatasks are more suitable for embedding in harder games where the quality of the output can be scored in direct correlation with how successfully the game was played, or by running innovation contests similar to the Pistoia Alliance's Sequence Squeeze contest which was administered by Eagle.
  4. Microtasks are suitable for things like interpreting images, annotating function based on qualitative data, correcting text or confirming assertions of fact. For example, users might interact with a simple mobile app requiring them to answer yes or no to simple questions designed to establish the truth of a series of statements. Success relies on attracting a large base of participants willing to make regular interactions with the project.
  5. Megatasks require significant marketing to locate and identify the few talented individuals who are able to solve the complex tasks being posed. Hard games are attractive to those who wish to solve the problems passively, but large-scale open-innovation contests require major algorithm design or coding effort and so will attract many fewer participants than the easier microtasks.
  6. The social impact is interesting - pay too little and you are in effect resorting to slave labour, or make a game too addictive and you could harm the health of your players. 

I'd like to add my own observations in areas not covered by the paper:

  1. Making a task that is obviously of benefit to society will encourage participation. If the link between task and societal benefit is too long or abstract then people will struggle to understand why they are being asked to do the task and will be less enthusiastic about it.
  2. Making a task too hard may end up in a situation where the only solution is to put out the problem to tender through a traditional consulting/contracting process. Nobody will want to risk spending too much time on a problem in an open innovation contest if the chances of being compensated for their time are too slim.
  3. All crowd-sourced tasks rely on an assumption of the intelligence of the participants. There is no real way to assess whether the solutions gathered are based on good quality experience and expertise, other than by limiting participation to those who have passed some kind of selection process. Selection processes themselves are open to abuse if self-assessment is permitted, so any crowd-sourced data that requires large-scale participation will have to accept that a good portion of the data generated may be of very low quality.
  4. Similarly, crowd-sourcing is not appropriate for situations where private or sensitive data needs to be assessed. Individual chunks of data can be randomised or anonymised before distribution but it would not take too great an effort for a crowd-source hacker to gather enough chunks together to reconstruct part or all of the original dataset. Therefore this approach should never be used to process data which is confidential.
  5. Who owns the IP of a crowd-sourced answer? The commissioner of the project, the participants (particularly if they were paid), or the public? Is it right to profit from an answer generated this way without distributing at least part of that profit amongst the unpaid volunteers who helped generate it?

Topics: Big data technology, Bioinformatics