Data management for large collaborative projects: challenges and solutions
By Arek Kasprzyk, Head of Data Management at Center for
Translational Genomics and Bioinformatics,
Speaker for our 4th Annual Symposium:
Talk time: 12:30 pm 27.3.2014
Talk abstract: "Biological data management is a challenging undertaking. It is challenging for database designers, because biological concepts are complex and not always well defined, and therefore the data models that are used to represent them are constantly changing as new techniques are developed and new information becomes available. It is challenging for collaborating groups based in different geographical locations who wish to have unified access to their distributed data sources, because combining and presenting their data creates logistical difficulties. Finally, it is challenging for users of biological databases, because in order to correctly interpret the experimental data located in one database, additional information from other databases is frequently needed, requiring the user to learn multiple systems. The BioMart project (www.biomart.org) was initiated to address these challenges.
BioMart is a freely available, open source, federated database system that provides unified access to disparate, geographically distributed data sources. It is designed to be data agnostic and platform independent, such that existing databases can easily be incorporated into the BioMart framework. BioMart offers different types of access tailored to different groups of users. For biologists, BioMart offers a number of interactive and customisable web-based graphical user interfaces. For bioinformaticians, BioMart provides data access through a range of application programing interfaces. For service providers, BioMart offers a highly customizable system that can be installed locally and tailored to support different types of data management needs.
In this talk I will share my experiences in managing data for large international collaborations involving academic and industry partners. I will also outline the current status of BioMart’s software and services, and describe its new features - such as tools for analysing next generation sequencing data."
Bio: "After earning a medical degree and a PhD in molecular biology, Arek Kasprzyk decided to pursue his passion for information technology and obtained an MSc in Bioinformatics. This unique background enabled him to obtain a position at the European Bioinformatics Institute in Hinxton, United Kingdom, where he designed and implemented BioMart, an innovative open source software for biomedical research, which provided the first large scale federated data management solution. While BioMart’s original goal was to manage data from the Human Genome Project hosted by the Sanger Institute, it has since grown to become a multi-institute collaboration involving a large number of different database projects and 28 different scientific organizations on five continents: Asia, Australia, Europe, North America and South America.
As a result of this success, he was personally recruited by the Ontario Institute for Cancer Research (OICR) in Toronto, Canada to lead the International Cancer Genome Consortium Data Coordination Centre, and create the architecture to manage their data, which will eventually be equivalent to 50,000 Human Genome Projects. Under his leadership and guidance, a group of developers and scientists re-engineered the BioMart software and achieved this ambitious goal 2 years ahead of schedule. This accomplishment resulted in the publication of the November 2011 issue of Database: The Journal of Biological Databases and Curation solely dedicated to BioMart. Shortly after, BioMart was featured in the January 2012 issue of Nature Methods featuring novel adaptive technologies.
Kasprzyk currently holds the position of Head of Data Management at the Center for Translational Genomics and Bioinformatics San Raffaele Research Institute in Milan, Italy. He provides data management support for the institute’s initiatives and oversees the development of tools for analysing next generation sequencing data."