March 3, 2017

Importing Ensembl into Amazon RDB

Ensembl very kindly distributes their databases on Amazon as public datasets in a variety of forms. The most straightforward to use is the set of MySQL database files that can simply be copied into a MySQL server's data directory and they will become instantly available.

However, the most cloud-like way to use databases within Amazon is to use their RDB (MySQL) service, but of course because this is a cloud shared service they do not allow direct access to the data directory, or indeed any other aspect of the operating system upon which MySQL is running.

So how to get the data in? You have no option but to use mysqlimport against what Ensembl calls the 'flat file dumps', currently available as a volume snapshot in Amazon's US-East region only.

This task can be tedious, so here's a script that does it all for you (based on the conventions that Ensembl use for distributing their flatfiles):


# Change these variables before execution

# Do not alter below here
for db in `ls -1 $FLATFILE_PATH`
echo "Creating database $db."
echo "create or replace database $db;" |
mysql -h $RDB_HOST -u $RDB_ADMIN -p$RDB_PWD
echo "Extracting and executing $db DDL."
zcat $FLATFILE_PATH/$db/$db.sql.gz |
mysql -h $RDB_HOST -u $RDB_ADMIN -p$RDB_PWD $db
for tabf in `ls -1 $FLATFILE_PATH/$db |
grep -v CHECKSUMS | grep -v $db`
tab=`echo $tabf | sed 's/(.*).txt.gz/1/'`
echo "Uncompressing and splitting data for $tab."
zcat $FLATFILE_PATH/$db/$tabf |
split -C 256m -d - $tab.part_
echo "Importing data for $tab."
mysqlimport -d --local --compress  -h $RDB_HOST
-u $RDB_ADMIN -p$RDB_PWD $db $tab.part_*
echo "Cleaning up $tab temporary files."
rm $tab.part_*
echo "Finished table $tab."
echo "Finished database $db."

This script needs to be started from a directory where there is plenty of free space for unpacking the largest Ensembl table. 100Gb should do it.

Minor things to watch out for are the warnings about importing data into tables that don't exist. In the Ensembl 59 flatfile dump, these are all in the funcgen databases and end in '_view'. Don't worry, you can ignore these warnings. Ensembl have accidentally dumped out the data from temporary tables used during production and these can be safely discarded.

Topics: AWS, Big data technology, Bioinformatics, eaglensembl, Ensembl, import, MySQL, Open source, RDB