March 3, 2017

More Big Data

After last week's post on Big Data, here's another one, this time drawing inspiration from a completely different industry - Defence and Intelligence!

Thayne Coffman, CSO at 21CT, gave a presentation at FloCon 2012 in Texas on lessons learned from network analysis R&D in defence and intel. His executive summary raised three main points:

1. Analysts need tools that enable flexible workflows.

2. Analysts need tools that run mid-complexity analytics.

3. Anomaly detection is worth continued investment, but it will never be the whole answer.

Why is this relevant? Aside from having been mentioned in the context of defence and intelligence gathering, these three points may just as well have been said about bioinformatics (or physics, or in fact any other big data problem). The point is that when faced with lots of data we can never hope to mine into and understand every last detail. There is a concept of enough detail from enough data, implying a borderline that separates sufficient understanding to answer a question robustly from wasting time learning things that will never add any significant extra value to the answer.

Thayne's first point confirms Eagle's belief that taking the analyst out of the loop in data mining will never lead to correct answers. Over-reliance on automated, pre-defined workflows or processes give a false impression of security (no pun intended) in the face of impressive-looking results that are close but not quite good enough to be of any real value. Involving an analyst in the process who can tweak and adjust the process based on their in-depth domain understanding of the specific data available and questions being asked will always improve the quality of the results over any fully-automated/off-the-shelf analysis solution. Prototyping and iterative development using platform suites are the way forward.

The second point, regarding mid-complexity analytics, relates to the ever-changing nature of data faced by analysts (for analysts, read bioinformaticians). New sequencing platforms come online, new format standards are developed, new levels of detail are made possible by never-ending upgrades to existing lab hardware. By the time you have developed a method that is perfect for detecting the signal you want from one particular data source, that data source has either changed or disappeared, rendering your method pointless. Much better to develop something that is flexible, less tightly-chained to the specific source of the data, and can easily (and rapidly) be adapted to take into account lessons learned and changes in technology along the way. Signatures (Thayne talks about contract murders, we'll use SNP function prediction as a comparable equivalent in terms of complexity) should be treated as guidelines rather than absolute truth and should always be open to constant, preferably dynamic, modification.

The recommendation here is to build small, special-purpose, independent workflows that can be chained together into networks of greater functionality. Then, if any one component changes, it can be substituted or amended without having to reconstruct the entire workflow network. It seems obvious but isn't always. Allowing the analyst to get hands-on and embedded in the process rather than overly automating the process is the best way to ensure success.

Thayne mentions the Intelligence Analysis Bathtub in passing - it might as well be called the NGS Analysis Bathtub as the concept is identical. Users spend much time and money collecting their data, and much time and money reporting on and disseminating the results, but very little time in between on properly analysing the information that they have gathered in order to generate those results. Inverting this bathtub model can only be good, and the suggestions made here are a step towards that, but there is not yet any golden solution.

The last point, regarding anomaly detection, revolves around the very intelligence-specific concept that the most advanced adversaries are the ones who try the most to look normal. Therefore anomalies are not what is of interest, but can be useful in spotting where the attempt to be normal has gone wrong - it is minor changes that are interesting, not obvious or extreme ones. Current systems rely on advanced AI and are still nowhere near as good as a human mind at spotting anomalous patterns in data. Could these be applied to, for example, spotting rare SNPs in an otherwise normal population concealing as yet unknown genetic disorders? Probably far too general a question, but worth thinking about.

The take-home message from this blog at least is that bioinformatics is not unique. We can learn a lot from our colleagues in other industries, even ones who are dealing with data that is entirely unrelated to ours.

Topics: Big data technology