March 3, 2017

Cloud BioLinux paper published

About 10 days ago BMC Bioinformatics published a paper entitled "Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community" (Krampis et. al.). Two of the co-authors, Tim Booth and Brad Chapman, are long-time friends of Eagle and we are currently making use of Cloud BioLinux in some of our projects.

Bioinformatics Linux distributions have been done before (DNALinux, BioSlax, and BioKnoppix, to name but a few), but they are all 'standard' Linux distros, i.e. they are built to be installed and run on bare metal. This still works when installing inside standalone virtualisation environments such as VMware, but rules out their use on the cloud where only specific kernel and distro combinations are supported (others won't even boot without jumping through some very difficult hoops, and those that depend on kernel-level modifications such as BioSlax are completely incompatible).

Cloud BioLinux's approach is slightly different. Rather than provide a standard distro, it provides pre-packaged virtual machines compatible with common virtualisation environments so that it works out-of-the-box on modern virtualised/cloud infrastructure (currently supported are Eucalyptus, VirtualBox, and Amazon EC2). Once one of their VMs is fired up it still behaves like any other distro because it is essentially just a reconfigured Ubuntu, but it is the removal of that initial installation headache that is its biggest and best feature. Its second best feature is its ability to self-replicate into specialised custom versions configured to meet local needs by using a simple and straightforward piece of software to carry out the process.

Of course, bioinformatics Linux distros are only complete if they have easy-to-install or pre-installed bioinformatics tools ready to go. Cloud BioLinux achieves this by shipping with a small selection of common standards and making the rest available through a dedicated Ubuntu package repository. The repository is already added and configured so you don't need to do anything special to access it. In this respect it competes with DebianMed, but the software provided at each repository are largely interchangeable (they're both Debian-derived) and we at Eagle would certainly expect that one day the two will be fully compatible.

All the technical wizardy for installation/replication aside, there is often very little left to distinguish between the various bioinformatics Linux distributions - they all offer pretty much the same tools just with differing choices of desktop and command line interfaces. It is hard to see anything different in this respect with Cloud BioLinux either. We chose it at Eagle simply because of its ready-to-go Amazon compatibility and relatively quick responsiveness to new feature/application packaging requests, otherwise we would probably have gone with the more established DebianMed.

We're not huge fans of duplication of effort at Eagle and so it would be nice one day to see future projects develop around the idea of improving other pre-existing bioinformatics distros rather than creating an entirely new distro - but sadly that is not the way that most funding agencies design their grants for bioinformatics research, and it won't get you any papers published either. New is always better, and old is yesterday's news and not worth improving.

Topics: Big data technology, Cloud, Shell scripting