News and Blog

What taxonomy database should I use?

There is no single answer to this question.

I'm not going to attempt a comprehensive review of the various taxonomy databases and their pros and cons here. Instead I'll show some examples of why you might chose a particular database in particular circumstances. Basically, the database you will choose depends on several factors including: the type of sequences are you analysing, the overall goal of the study and the data volume being analysed.

The sequence type (amplicon or shotgun) makes a big difference as to what database you will want to use. In order to give you some specific examples, I'm going to assume that you are looking at 16S amplicon sequences. In this case, the most commonly used databases are Greengenes, RDP and Silva. These are all good choices - they cover a large number of species, and they are manually curated and revised as necessary. There are also sequence databases for particular environments e.g. HOMD for the oral microbiome.

There are many good taxonomy databases to use, so the one you choose is heavily dependent on the overall goal of your study. RDP and Silva* are updated more frequently than Greengenes, so if you want the most up to date and comprehensive taxonomies then they can be the best choice. However, Greengenes was (and still is) used for analysing many human microbiome studies. If your aim is to compare your results to these studies, then Greengenes would be a better choice. Alternatively, if you are analysing a very specific microbiome you might want to use a database tailored for that particular environment e.g. HOMD for oral microbiomes. 

Once you have decided on the taxonomy database, you need to consider what taxonomy assignment methodology you are going to use. This is where data volume becomes important. Taking Qiime as an example, you can assign taxonomies to sequences with your chosen taxonomy database with any of the following: uclust, SortMeRNA, BLAST, the RDP classifier, RTAX and Mothur (for more information see here). The different assignment methodologies will give differing results, and their speed varies enormously e.g. SortMeRNA is hundreds of times faster than BLAST. Therefore the data volume is critical to choosing the appropriate methodology. For example, if you clustered your sequences at 97% identity to generate OTUs, there will typically at most be only a few thousand sequences to analyse, so BLAST would be an appropriate analysis tool to use. On the other hand, if you wanted to assign taxonomies to unique sequences then generate OTUs by taxonomy clustering (this is less common than generating OTUs by identity clustering), there may be millions of sequences to analyse - in this case SortMeRNA would probably be a better choice.

In summary, there are many considerations that factor into the choice of taxonomy database to use. The decision is completely dependent on what your needs are, and your available resources.

* RDP and Greengenes are free to use for everyone, while Silva requires a license for nonacademic use (this may also affect your choice).

Data sequencing database microbial research Sequencing taxonomy microbiome

Craig McAnulla

About Craig McAnulla

Craig earned his Ph.D. in Microbology at the University of Warwick, he then completed his Postdoctoral research at the John Innes Centre before joining EBI and developing their bioinformatics capability. Dr McAnulla has extensive experience in developing large scale pipelines for analyzing microbiome datasets. His skills cover both software and biology and he often provides deep insights to industry experts on these topics.