Here is the first abstract of one of the talks which will be held at our 2nd Symposium: "The Next 10 Years of Genome Content Management" on 29th March held at Cambridge, Babraham Research Campus. Don't forget to register before the 15th January to benefit from the early bird discount.
The title is: A data warehouse approach for large genotype-by-sequencing datasets
Presented by: Mario Caccamo, Head of Bioinformatics, TGAC
"As the cost of sequencing continues to drop the use of these technologies to directly genotype large populations is becoming the method of choice. Direct sequencing goes beyond the detection of single-nucleotide polymorphisms allowing for screening more complex variants including insertions/deletions and translocations. The manipulation of large genotype datasets using conventional relational database solutions, however, does not scale resulting in many cases in situations where the size of the indexes surpasses the size of the data. There are many characteristics of genotype data that cannot be efficiently exploited by relational approaches. For instance, genotype data are generated once to be used many times in what is called WORM data for write-once read-many, therefore an indexing structure supporting updates is not required. Another important observation is that even for complex structural variants, simple and uniform data types can model genotypes. This suggests that simple flat files in high-performance disks can be a satisfactory solution to implement genotype warehouse databases. Unfortunately, when genotype data are combined with phenotype information, this simple solution will not suffice. In this presentation we will discuss the development of a database platform that can efficiently support genotypes-by-sequencing datasets with some example of applications to plant genomics data."