March 3, 2017

Biocuration in the enterprise part I: From a trip to the movies to curating tumor data

A post by: Eleanor Stanley, biocurator and information security manager and Yasmin Alam-Faruque, biocurator.


Cautionary tales for bioinformaticians (with apologies to Hilaire Belloc): For Frederick, who didn't pay enough attention to biocuration, and fell foul of his metadata as a result.

For Frederick, and anyone out there who hasn't been paying attention, biocuration is a growing field, with an increasingly important role in academic and enterprise research. This is a two-part blog about biocuration; the first part introduces the concepts of biocuration and metadata, and the second part explores standards in biocuration, and at the application of metadata management and biocuration.

We live in a world filled with data, with volumes increasing every day. The aim of biocuration is to make it easier for computational scientists and bioinformaticians to find the important information in amongst all this noise. This is by capturing, translating and integrating the relevant datasets and metadata and filtering out any irrelevancies.

"Biocuration involves the translation and integration of information relevant to biology into a database or resource that enables integration of the scientific literature as well as large data sets. Accurate and comprehensive representation of biological knowledge, as well as easy access to this data for working scientists and a basis for computational analysis, are primary goals of biocuration"International Society for Biocuration

What is metadata?

Metadata is an important part of data content, and collating it is a necessary part of the biocurators role. But what is it? Metadata is simply a description of datasets, and is essential across many aspects of life, because it puts the information in context. As individuals, we use metadata all the time, maybe without realising it or considering its implications and actions.

"Simply put, metadata is data about data. It is descriptive information about a particular dataset, object, or resource, including how it is formatted, and when and by whom it was collected. Although metadata most commonly refers to web resources, it can be about either physical or electronic resources. It may be created automatically using software or entered by hand"Indiana University

As a way to understand metadata, we can use making a film as an example. To make decisions about the cast and the location for the next instalment of the popular franchise, the directors of Walt Disney’s ‘Pirates of the Caribbean 5’ effectively used metadata from the previous films. Johnny Depp wasn’t killed off in the previous film, because the absence of this main character could have had detrimental effects on the film's viewing figures. Geoffrey Rush had been killed and resurrected once already in the sequence, but he is also popular and so he has survived the fourth film. Australia offers great location shooting and they are proffering financial incentives, so why not shoot the film there?

Curating the data

Collecting metadata, categorising datasets and using controlled terms are all important steps in data discovery and management of legacy data collections.

Deciding precisely what descriptive metadata to capture and how to capture it is a very complex task. Taking an example from everyday life, metadata are like the variables that have to be taken into account when doing the family food shopping – is there a birthday coming up? Has someone become vegetarian? Is the weather going to stay good enough for a barbecue?

The layout of the shop relies on categorisation – fruit and vegetables, meat, rice and pasta, breakfast cereals – and the suppliers dictate standardised choices (the terms), from Royal Gala apples to Savoy cabbage, and from fettuccine to Frosted Shredded Wheat, so that we know what to expect and where to find it. The fact that similar groupings of items will be found in supermarkets across the country (or even across the world) means that shoppers can find what they want, wherever they are, and reuse the same shopping list in different places. Missing out something vital can cause major disruptions – just imagine the undesired consequences of not being able to find the candles for the birthday cake.

But there’s more to categorisation and standardisation than just being able to find the right brand of peanut butter - it’s important in biocuration, in order to be able to find the vital pieces of data. In the biological sciences, there are several standards that have been developed and maintained. These standards define a framework that supports the collection and communication of complex metadata. Within this framework, glossaries of controlled vocabulary terms (also known as ontologies) are used to obtain and define individual data items. Like shopping lists and supermarket categories, using controlled terms makes it easier to access, reuse and share information. The controlled terms can range from study designs through phenotypes to technologies, and the ontologies are under constant review, being improved by the revision of existing terms, generation of new terms and removal of redundant terms.

Curation of data and metadata is an important step in every discipline, from the night at the movies through the shopping trip to the categorisation and analysis of tumour data. Come back soon for Biocuration in the enterprise part II to look more closely at standards in biocuration, and at the application of metadata management and biocuration through Eagle Genomics' solution eaglecore.


About Yasmin Alam-Faruque

Biocurator joined Eagle in early 2014.

Q: "Why do I enjoy data curation at Eagle?"
A: "It gives me the opportunity to find out about new industries, their areas of research, investigate and organize new datasets and work with the biomedical scientists who create and submit the data to make the data more accessible."

Yasmin came to biocuration from a start as a bench scientist, and brings an understanding of biomedical science from an academic perspective, with an MSc in immunology comparing the immunological mechanism involved in corneal and skin graft rejection, a PhD in differential gene expression in mucosal cancers and a postdoctoral experience in autoimmune skin disease.

In her previous role as a scientific database curator at the European Bioinformatics Institute (EMBL-EBI), she worked on the Renal Gene Ontology Annotation Initiative, a project funded by the charity Kidney Research UK, to produced a resource that can be utilized in the interpretation of data from small-and-large-scale experiments investigating molecular mechanisms of kidney function and development, providing new biological insights and thereby help towards alleviating renal disease. She also worked on the curation of various proteins, across species, in the UniProt Knowledgebase, including contributions to the Gene Ontology Annotation and the IntAct protein-protein interaction databases.

Topics: Big data, biocuration, Bioinformatics, Blog, computational biology, eaglecore