Can I reuse my data ?
As data scientists are fond of telling us, data reuse reduces time to discovery, promotes cost efficiency and ultimately leads to discoveries that would not otherwise have been made. Can organisations currently reuse their existing datasets? Often this is not the case. Once a dataset has fulfilled its primary purpose there is limited incentive to augment it with the context and provenance required to enable secondary use. Simply put, access to consistent, high-quality metadata is critical to finding, understanding, and reusing scientific data but this is too often missing.
The Eagle DataCatalog+ process
The current state is inefficient, forcing data scientists to rely on institutional knowledge or their network of colleagues to locate datasets. There is no guarantee that all relevant data will be identified. By extension, data scientists new to the company have no aid to accelerate their effectiveness. These issues are far from restricted to the life sciences, last year the New York Times coined the term ‘Data Janitor’ to refer to the process of finding, selecting and preparing data prior to analysis that typically consumes 50%-80% of a data scientist’s time.
This vacuum of knowledge is dramatically improved by building a dataset catalog: a single place where the most valuable systems and data sources in the business are described. Such catalogs allow researchers to conveniently find what is available, where it is, who produced it and what format it is in.
The construction of a dataset catalog that is fit for purpose is not a simple undertaking. Eagle evangelise an approach called “DataCatalog+”.
The possible benefits of data reuse, bringing disparate data types together, providing the capabilities to aggregate and analyse common data sets and ultimately providing scientific insight are potentially transformative.