March 3, 2017

The problems of information management in life science research

iStock_woodenfigure17048174_800x530How do life science organisations manage their experimental information and what problems do they face? These are complex questions to answer and it is worth looking at how the information is managed from the individual level (simple case) up to the organisation level (more complex) in order to understand why this is so.

What’s the value of experimental datasets?
Before delving into the detail, it might be worth explaining why life science organisations value the data they generate so much: having already spent a lot of time and money on research, these organisations want to obtain maximum value from their experimental assets and datasets. This includes reusing datasets in new experiments or for new analyses. How can they do this if they a) don’t know what data they have in the first place and b) know what they have but not where to find it?

Standard management of information
An individual researcher might manage all the information themselves. This includes the location and storage of the raw data files, the metadata (data describing the data) and the experimental details. If someone asked the researcher about a specific experiment they had carried out, the researcher would generally know where to find this information, and where the raw data was located. In the simplest case, the experimental information might be recorded in an ELN (electronic laboratory notebook) and the raw data files might be organised into folders and files on the researcher’s computer.

Imagine what happens as the researcher carries out more experiments over time. More experiments lead to an increase in the amount of information to be captured. Organisation of this information becomes harder due to the volume and complexity of it. For example, there may be duplication in file names (e.g. if the raw data is automatically named as it comes off the machine, all experiments may have data files called sample001, sample002…), and this requires creation of separate folders to differentiate identically named files from different experiments.

It only gets harder
Utilising all of this raw data also becomes harder. What happens, for example when the researcher wants to reuse existing datasets in a new analysis or experiment? How do they keep track and maintain the provenance of the data?

This increase in experimental information may also have other consequences that add to the complexity of its overall management. For example, the researcher may run out of disc space on their computer or might simply upgrade their computer over time. This can lead to them having to manage their datasets and experimental information over multiple locations.


Archive. Many folders on white isolated background. 3d

What about the department manager?
Now consider a department manager who is required to have an overview of all research carried out from different researchers in the department. How do these managers access the datasets? How do they know what to look for and where to find it? Perhaps datasets are centralised in a shared network drive or SharePoint. That is all very well, but how does a manager still find the information? Search? Ask the dataset owner?

Lack of efficiencies
The problem with search is that it may not be efficient: searches are carried out on the metadata, rather than the raw data files themselves. Although metadata might be captured somewhere (perhaps in an Excel spreadsheet or an ELN), this file may not reside alongside the raw data files, and so requires an additional step to find the correct information and cross check it. On top of this, not all relevant metadata might have been recorded, or it might not be have been captured consistently between different experiments and between different researchers (for example, by using synonyms such as sex and gender). This results in time and effort wasted carrying out multiple searches to ensure all relevant results are found.

Asking the dataset owner to find experimental information is fine when that person is still working at the organisation, but what happens if the datasets of interest are historic and the owner has since left/retired?

Breaking down the problem
Luckily, organisations are realising and acknowledging that the problems of information management of datasets exist, and that they are increasing and getting more complex. As a result, they are beginning to take steps to address them before they get out of hand. Central to any solution is the organisation and linking of the raw data files and metadata in a way so they can be easily discovered, reused, and shared with collaborators.

The Importance of Cataloguing
Such a ‘metadata catalogue’ allows all the experimental information to be brought together in one place. Experimental information and metadata can be mined from various sources, such as an ELN. By curating carefully (for example, with ontologies), a controlled vocabulary of metadata terms can be imposed which allows for easier searching and retrieval of datasets. The catalogue can link to the raw data files (either by acting as a data warehouse or by data federation) and can even be set up to be part of an overall information management solution that allows data analysis to be initiated (through a third party programme), with the results being linked back to the catalogue.

Having a central repository also allows the integration of public datasets, such as the European Bioinformatics Institute’s (EMBL-EBI’s) ArrayExpress, and these can be used alongside proprietary datasets to add insight and value to experiments. Another advantage of such a catalogue is that it allows for easy and secure collaborations of data, both to internal and external partners.  For further background information, you may want to read our blogs on Biocuration in the Enterprise Part I, and Part II.   

iStock_Computerfiles10905895_800x600Solution: metadata catalogue
Implementing such a metadata catalogue can be achieved in several ways: 1) an in-house solution, based upon the optimisation of existing infrastructure and systems. 2) the adoption of a specific catalogue software product (such as eaglecore
 from Eagle) that is used in conjunction with other existing systems to provide overall information management of the datasets 3) the utilisation of a comprehensive information management solution, encompassing everything from the data capture, recording, storage, cataloguing and analysis of the raw data and metadata.

Regardless of which option is chosen, a metadata catalogue is a simple way to gather all experimental assets in one place and allows for organisations to easily discover the information they already hold. This helps to remove the roadblock of finding information and so speeds up the rate of research and ultimately leads to reduced costs.

Topics: access, analysis, Big data technology, Bioinformatics, Blog, categorisation, data integration, data management, data silos, database, eaglecore, information management, life science data, life science R&D, metadata, metadata catalog