The assertion is that there is a molecular basis behind every phenotype; in our genome it is the riddle that can be solved to determine our appearance, behaviour, disease susceptibility, mental and physical health, and many other areas of interest. But how can we navigate through the magnificent plethora of available data to find the answers?
To achieve this ultimate dream, biological data needs to be captured in a defined structure or a standard. The terms used to describe the information need to be sourced from controlled vocabularies and ontologies. Together, the use of standards and ontologies allow for more effective queries to be performed in scientific research such as between patient cohorts of genomic and biological data sets that can then identify elusive biomarkers that could inform a consultant of the probability of a particular patient being predisposed to a specific disease e.g. cardiovascular disease, skin cancer or Alzheimer's. Such beneficial research can be achieved if the data is structured well. Unfortunately, not all R&D information comes in a ready-to-use format.
Data standards are documented agreements on the representation, format, definition, structuring, tagging, transmission, manipulation, use and management of data. A simple example of how complex data representation can be is a calendar date. Is the date that complicated you may ask? Programmers may agree time can be misunderstood! Its representation has many variations on the order of the basic components (day, month, year), the format of the components and the separators, for example:
- 1st April 2008
- (or January 4, 2008)
Never fear, there is a standard to solve this problem: “ISO 8601, Data elements and interchange formats – Representation of date and time”. This standard allows an unambiguous and well-defined method of representing dates and times, avoiding misinterpretation of numeric representations. For the example above, the correct format would be 2008-04-01.
An example of a complex data standard would be CDISC; a global non-profit charitable organization, with over 300 supporting member organizations from across the clinical research and healthcare arenas. Their vision is to inform patient care and safety through higher quality medical research. The U.S. Food and Drug Administration (FDA) mandates CDSIC as the standard for the exchange of clinical trial data, this enables advances in data evaluation and as a consequence speeds up new discoveries to the public. The potential for this area of research cannot be overstated.
Implementation of a standard does not solve inherent data quality problems. To conquer this issue, controlled vocabularies provide a way to organise knowledge. Another is to use ontologies, they are designed to enable knowledge sharing and reuse. A bio-ontology provides a formal naming and definition of the types, properties, and inter-relationships of entities within an area of scientific research.
Health care systems (GPs, medical informaticians, researchers) use several ontologies (SNOmed CT, ICD9/10, MedDRA, READ codes, HPO, EFO, MESH), each having a distinct role within the system with regards to describing the disease symptoms, lab tests, diagnosis, patient care and statistics of outcomes. Use of controlled vocabularies and bio-ontologies allows use of consistent, unambiguous terms which in turn aids the searching and browsing of the harmonised metadata. It is commonly supported that this harmonisation benefits scientific and medical research. Although in certain environments, data standards have even produced controversy and been attacked as potentially hindering medical practice and instead may exist to serve the insurance industry.
We have written a bit more information on the use of metadata in scientific organisations. See our other two blog posts – Biocuration in the enterprise Part 1, and Biocuration Part 2: Managing the Data.
Data stewardship is an ethic that embodies the responsible planning and management of information-based resources. Aspects of proper data stewardship include ensuring structure to the capture of metadata to improve the retrieval, sharing and preservation of information over time. Eagle Genomics services have traditionally emphasised proper data stewardship by providing biocuration services to overcome the hurdle of unstructured datasets.
Data stewardship appears on the surface to be a prosaic problem but I hope you can now see this is a difficult issue; unambiguous and well-defined methods are required to enable the capture, organisation, linking and interpretation of data. For example, data collected for a disease area of interest may involve: 15,000-20,000 arrays, the information is spread over multiple data sets, file systems and research centres. From multi-omic datasets, researchers need to find the complementarity between cohorts to identify profiles in the data that may aid clinical interpretation and decision. The rate of processing all this information becomes the limiting factor in coming to the correct clinical conclusion.
What’s the solution?
Well, for one, it’s important for scientists, researchers, and R&D decision makers to really understand how standards impact the effective use of scientific data. Eagle Genomics firmly believes in helping life science companies navigate these murky waters. We developed our information management platform: eaglecore, to provide researchers a unified view of their datasets, provide easy and consistent access to quality data, and integrate applications into one platform to aid the exploration of scientific data and build patient cohorts. We believe, ultimately this enables major discoveries to issues facing our world such as identification of unique biomarkers and the development of more diagnostic tests for genetic diseases.
A few questions for our readers:
- What do you think is holding research data back from being used effectively?
- How much do you think data standards really matter?