March 3, 2017

Fibre optics 1, FedEx 0

Moving data around is probably the single biggest headache for modern bioinformaticians. You might have a super duper compute facility in your local university data centre, or the ability to tap into Amazon or other cloud services on demand, but unless you can get your data there and your results back quickly there's very little added value in using this over deploying your own cluster somewhere much closer to the source of your datafiles. The main bottleneck is usually the local network within your university campus, or your connection to the nearest internet backbone if the data is going external.

Amazon has a very effective workaround which allows users to ship them hard drives full of data. Load up a lorry with drives and there isn't a network in the world which would get the data there faster, but for a small job with one or two drives the time delay caused by the physicality of the process does not stack up well against a decent network connection.

An article on Bio-IT World describes a company in the US called Courtagen who discovered that by upgrading their local internet link to 100-gigabit ethernet, they could massively improve on their upload speeds and remove their reliance on shipping hard drives. Having previously relied on a slower network link which often failed, then resorted to shipping drives which added two days or more to their project turnaround, the simple act of upgrading to a high-speed fibre ethernet link to the nearest backbone allowed them to reduce the two-day drive-shipping process to just minutes of upload time. 

Admittedly, they had two particular things working in their favour. One was their proximity to the local backbone - just a couple of blocks away. All data transfer is only as fast as the slowest link in the chain, so by removing as many steps as possible between themselves and the backbone they were able to avoid the problems associated with being routed through numerous traffic aggregators before hitting the highway. The other thing working in their favour was that the data was being transmitted to their nearest Amazon data centre in Virginia, which being a major cloud provider is also directly connected to the same backbone thus avoiding any excessive additional network hops once the data reached the other end of the pipe.

Courtagen's experience is relatively unique and is more dependent on the happy chance of their physical location and the internet infrastructure available down the US East Coast than it is on anything else, but it demonstrates that there is not a one-size-fits-all approach when it comes to managing the process of data transfer. Each and every person considering moving data around the internet must think carefully about what it is they're moving, where it is headed, where they are sending it from, and understand completely the network that lies between the two points and how that might affect the transfer procedure. Courtagen realised after having done this that all they needed to do was switch a couple of network cards in their existing hardware then sign a new contract with their provider - as well as negotiating with downstream providers to ensure their traffic got there securely and with priority - but none of this involved investing in expensive new routers or digging up any streets to lay new cables.

Interestingly, the article almost glosses over the fact that this is HIPAA-affected data that Courtagen are dealing with, and that they have found a way of managing this data securely in Amazon (through pre-upload anonymisation) which satisfies the auditors. This just goes to show that it is possible to analyse private patient data in the cloud, as long as you know what you're doing and have identified and addressed the risks. Hopefully we'll see more projects like this coming on-stream in the near future.

Topics: Bioinformatics