News and Blog

Biocuration in the enterprise part II: Managing the data

A post by: Eleanor Stanley, biocurator and information security manager and Yasmin Alam-Faruque, biocurator

BioCuration part II

Cautionary tales for bioinformaticians (with apologies to Hilaire Belloc): For Jane, who didn't believe that standardisation and biocuration was important, and paid with the quality of her research

The previous part of this blog looked at the concepts of biocuration and metadata, and why standardisation is important. This second part will look more closely at standards in biocuration, and at the application of metadata management and biocuration through software for metadata management, such as Eagle Genomics' solution eaglecore.

Standardisation in data management and biocuration is a critical step, as without it, it's hard to manage large datasets, find key pieces of information, or to compare like with like. One of the upcoming standards used in large-scale experimentation is the ISA framework.

"The Investigation/Study/Assay (ISA) metadata tracking framework provides a tool kit to facilitate standards compliant collection, curation and local management of experiments used by an increasingly diverse set of life science groups world-wide" Oxford e-Research Centre, University of Oxford

The ISA framework is made up of three layers:

  • ‘Investigation’ (the project context)
  • ‘Study’ (a unit of research, often encapsulating the hypothesis under investigation)
  • ‘Assay’ (analytical measurement)

The framework is a general-purpose skeleton to provide a rich description of the experimental metadata (i.e. sample characteristics, technologies used, type of measurements made) from 'omics-based' experiments so that the results and discoveries are reproducible and reusable.

Managing metadata

Managing metadata is an important part of the process, and a number of developers have created software to support the organisation and storage of various types of heterogenous datasets. These programs aim to provide experimental context to the datasets and describe the data processing steps as completely as possible. This is achieved by using ontologies to standardise data capture and by bringing data together from multiple sources, such as literature, array databases, sequence read databases and clinical measurements, into a centralised, secure resource.

Using the hierarchical ISA standard for the intelligent management and sharing of genomics data and metadata creates a structured and standardised format. The ISA hierarchy is flexible and scalable allowing appropriation of transcription profiling by array data, RNA-Seq expression data, results of clinical studies, among many other study examples.

This structured metadata provides specifics on samples or subjects and the methodological steps involved in generating and assaying them, and can link to the publicly-archived raw or processed data, provided in machine-readable formats that suit data users.

Metadata management systems allow researchers to tackle scientific data problems within life science research and development. For example, if there is legacy data stored in an ad hoc manner in multiple locations, these systems could allow datasets to be brought together into one central location, systematically capturing all data items and allowing reuse for relevant current studies, and for re-analysis using new technologies. Connected platforms also improve communication and efficiency across an organisation, by allowing authorised researchers to see which studies or investigations have already been undertaken and hence reduce the potential for duplication.

Metadata management: eaglecore

In collaboration with market-leading life science R&D customers, Eagle Genomics has developed eaglecore for metadata management.

eaglecore is Eagle Genomics' enterprise cloud solution for the management of genomic data and metadata Eagle Genomics

Curation of many data types is possible within eaglecore. This curation can be shared internally within a team or externally between different groups. The data is fully queryable and can be exported in a number of different formats for reuse in other tools, as desired.

The developers behind eaglecore have understood the importance of data security, and so the platform is compatible with ISO 27001:2013 certification, NHS Information Governance Commercial third-party registration, and HIPAA compliance. The system includes multiple levels of security, such as the use of secure web interfaces, user authentication via Enterprise Access Management integration, role-based access control at both application and storage level, and data encryption both at rest and in transit.

Effective management of datasets, from setting standards in biocuration through to storing and organising metadata using software such as Eagle Genomics' solution eaglecore, is vital to help bioinformaticians get the most out of the vast quantity of information that is being generated every day by researchers around the world.

About  Eleanor Stanley

Biocurator and information security manager Eleanor Stanley is a biocurator at Eagle Genomics, and is also responsible for information security. She joined the company in mid 2014 from the Wellcome Trust Sanger Institute (WTSI), where she worked as a bioinformatician building a pipeline for genome annotation within the 50 Helminth Genomes Initiative, which is part of the Global health research project at WTSI. Eleanor’s entire career since university has been biocuration, though she had a flutter and gained a Masters degree in bioinformatics in 2012. She began as a literature curator with FlyBase at the University of Cambridge and then the Uniprot initiative at European Bioinformatics Institute (EMBL-EBI), focusing on Drosophila, worms, alternative splicing and complete proteome sets. From here she mixed bioinformatics and biocuration at WTSI, taking the automatically generated gene models for Onchocerca volvulus and manually improving them for WormBase. "While fly biology and biocuration of worm datasets isn't the most common route into human genomics, it's still all about getting new data and finding out what it's about. Eagle has given me a great opportunity to learn all about a new area."

Big data Big data technology biocuration Bioinformatics Bioinformatics Blog computational biology eaglecore ISA framework

Yasmin Alam-Faruque

About Yasmin Alam-Faruque

Biocurator, Yasmin Alam-Faruque is a member of Eagle Genomics' Biocuration team, joining in early 2014. "Why do I enjoy data curation at Eagle? It gives me the opportunity to find out about new industries, their areas of research, investigate and organise new datasets and work with the biomedical scientists who create and submit the data to make the data more accessible." Yasmin came to biocuration from a start as a bench scientist, and brings an understanding of biomedical science from an academic perspective, with an MSc in immunology comparing the immunological mechanism involved in corneal and skin graft rejection, a PhD in differential gene expression in mucosal cancers and postdoctoral experience in autoimmune skin disease. In her previous role as a scientific database curator at the European Bioinformatics Institute (EMBL-EBI), she worked on the Renal Gene Ontology Annotation Initiative, a project funded by the charity Kidney Research UK, to produce a resource that can be utilised in the interpretation of data from small- and large-scale experiments investigating molecular mechanisms of kidney function and development, providing new biological insights and thereby help towards alleviating renal disease. She also worked on the curation of various proteins, across species, in the UniProt Knowledgebase, including contributions to the Gene Ontology Annotation and the IntAct protein-protein interaction databases.