Tuesday, 6 December 2011

Danio rerio: five species in one ... BIN!

So, I've just got back from the 4th International Barcode of Life Conference in Adelaide. An enjoyable time was had by all, and there's plenty to think about. Now, if you don't quite understand the title of this blog post, bear with me, and hopefully all will be explained by the end. There were three main themes I got from the conference, and I will try to draw them together.

Bonython Hall, University of Adelaide

Data access

We heard this again and again. Having data languishing in private projects is helping nobody, but publishing on other people's hard-collected data is certainly not cool either. The "Fort Lauderdale Agreement" aims to make a comprise between the two, and allow fair use where appropriate. As an incentive for the rest of us, leading researchers and museums will be releasing significant barcode datasets very soon.

A problem with early data release is the massive accumulation of sequences on GenBank without proper binomials; these have been termed "dark taxa" by Prof. Rod Page in his thoughtful blog post on the subject. Much of these data have come from BOLD. This has caused something of a problem for GenBank, especially where taxon names had subsequently been changed on BOLD. It was announced that a system of phases is to be introduced to differentiate data with different levels of annotation. The "phase zero" data with very little information other than the sequence will be "cleansed" off GenBank soon (removed from searches, but remain in the system). BOLD and GenBank databases are now expected to update each other more regularly too.

However, in answer to Rod's question of what can we do with "bad data" like this, we saw several excellent presentations on the kind of science that can be done on large datasets even without taxonomic names (I will try to get some links up to the videos when they are available).

BINs (barcode index numbers)

These were unveiled with perhaps a little less fanfare than expected given their importance; they had apparently been around since the last barcode conference two years ago, but have only now been made visible in BOLD 3.0 beta.

They are essentially clusters recognised by BOLD as putative species or species-like groups, independent of the taxonomic name system. Importantly, they are indexed and can be treated just like taxonomic names (i.e., created, stored and synonymised). I think a system like this is required, due to the fact that modern biodiversity science is as much a problem of information management as it is of species concepts and taxon definitions.

They offer many attractive advantages by: (1) linking sequences together with taxonomic names, literature, databases and museum vouchers; (2) simplifying the identification process; (3) tracking conflicting identifications and species with interim code names; and (4) offering scalable assessment of biodiversity.

Although announced as an "interim" taxonomic system I can't help but think that this endeavour may obviate the need for Linnaean names altogether in many groups. This could particularly be the case where one is more interested in say broad phylogenetic patterns across geographic areas or ecological guilds. It will now be all but impossible for "traditional taxonomy" to catch up with these BINs given the rate at which barcode data are now generated. Those who believe taxonomy is but a "service industry" to other branches of science will rejoice, as there is now the potential for a rapid, semi-automated, and fully scalable biodiversity assessment tool commensurate to the challenge at hand. Therefore there may no longer be room to argue that traditional taxonomy is required to document our deteriorating world. Those who prefer a "whole organism" approach may not be so impressed. The onus is perhaps on them now to justify why such a holistic science is valuable in the short-medium term. Of course the reasons are obvious to me*, but it may be a hard sell in today's output driven world.

Specifically, some issues also need to be ironed out with the BIN framework, particularly the repeatability of these clusters, as the algorithms under which they were generated are yet to be published and scrutinised, despite BOLD 3.0 going live and effectively hitting the detonate button.

Conflicting IDs

Now, this issue of BINs brings me nicely back to the title. In case you didn't get it, it's a play on the paper in PNAS entitled "Ten species in one: DNA barcoding reveals cryptic species in the Neotropical skipper butterfly Astraptes fulgerator". There the authors reported cryptic diversity in a widely dispersed species.

In contrast, here the problem is that currently the BIN for the zebrafish Danio rerio contains five different binomials! Given that of all 40,000+ fishes this species is arguably the one we humans know most about, this is perhaps surprising and worrying. One record was D. rerio proper, one was labelled D. cf. rerio, another was a legitimate synonym of D. rerio, another was what looked like a misspelling of a legitimate synonym of D. rerio, and the last was labelled Xiphophorus hellerii, a fish in a completely different order! Some of the public D. rerio records were just identified as "Cypriniformes sp.".

Danio cf. rerio (BIN AAE3739)

This certainly calls into question the utility of barcoding for regulatory purposes such as seafood substitution, or monitoring invasive aquarium fish imports. Non-biologist regulators will be relying on good barcode reference libraries, and may end up acting conservatively, e.g. by rejecting all imports of aquarium zebra danios because BOLD was unable to give an unambiguous ID to species level. In a presentation by Dr Bob Hanner, it was estimated that for fishes, one in ten BINs contain more than one species. This I can only assume will rise especially where a number of labs are working on the same groups.

This type of data conflict was a hot topic at the conference, especially among the fish people. Having a database of synonyms would certainly help getting rid of the legitimate synonyms, but the other problems will require more work. A community-based curation and ranking system for the quality of the supporting data was proposed, and BOLD 3.0 already offers a Wiki-like annotation feature. A great idea, but will end users (e.g. regulatory agencies) really understand the technicalities, and will project managers bother to actively maintain their records after the manuscript has been published and they move onto the next project/job? It's a lot easier to upload some dodgy data than it is to prove someone else's data are dodgy.

I think one of the keys lies in access to literature. Getting hold of taxonomic literature is as good as impossible for many groups, yet thoroughly demonstrating the characters used to identify your specimens will make the whole system more transparent and reliable. Conflicts cannot be resolved without universal access to this literature. But ultimately, the best prevention lies with collaboration, and working through identification uncertainties between labs before data are uploaded as reference specimens.

* How would we ever know that Cypriniformes BIN AAF7369 shows "spectacular morphological novelty" from its COI sequence. Even though most of the big or important creatures have now been described, I think many startling discoveries are yet to come ...

Wednesday, 27 July 2011

Batch extracting GenBank data from journal articles

Repeatability, one of the central tenets of science. In theory, any published study should be repeatable. But, if anyone actually wants to do this, it's not always as straightforward as it sounds.

Molecular people do have it easy in comparison to say ecologists—we use GenBank, the Web based repository of all things DNA. Simple, just want to re-analyse some data, grab it from GenBank. Not so easy*, and here's an example.

Take the recent cypriniform relationships proposed by Mayden and Chen (2010). Being as polite as possible, their results are "interesting" to say the least. Lets assume we want to get our grubby hands on this data and see whether their conclusions have any meaningful support. They used six genes, and a table listing the accession numbers is presented:

These data are not all generated in this study though, so it makes it tricky to access them from the GenBank Web site. Copying and pasting each of them into GenBank is just not an option, so here's a hack that might just save you some time:

(1) Copy and paste the whole table from the pdf into a text file. Save the text file (e.g. as "input.txt").

(2) Open a terminal session, cd to the directory, and copy this command:

grep -o '[A-Z][A-Z][0-9][0-9][0-9][0-9][0-9][0-9]' input.txt | sed -e ':a;N;$!ba;s/\n/", "/g' -e 's/^/acc <- c("/g' -e 's/$/")/g' > output.txt

Eugh, that looks horrible, but what it does is create a ready made vector (called "acc") of accession numbers, which can be copied straight into R. Now, we assume a few things first though: that the table copied perfectly, and the fonts are compatible between pdf and txt; that the accessions are common eight digit GenBank accessions—some of the older GenBank accessions may be six digits (the regex can be modified though).

(3) Now, copy the output straight into R. We could use ape's read.GenBank function, but we would like to access the gene names too**, so we will use Samuel Brown's read.GB function instead.

Run the following R code to download the data and add taxon labels:

dat <- read.GB(acc)
names(dat) <- paste(acc, "|", attr(dat, "gene"), sep="")

(4) Now, this dumps us with all of the data in one vector, but we really want to analyse the seperate loci (e.g. rhodopsin). No worries, grep comes to the rescue again by grabbing all the sequences with "rhodopsin" in the gene attribute:

rho <- grep("rhodopsin", names(dat))
rhoSeqs <- dat[rho,]

The names ascribed to genes in GenBank from different studies may not be consistent, so make sure you check that your identifying phrase will work as expected. Now all you need to do is write this into a fasta file, align it, and you're away.

write.dna(rhoSeqs, file="mayden2010_rho.fas", format="fasta", colw=10000)


* It really should be mandatory that researchers use services such as TreeBASE to upload their alignments.

** read.GenBank in ape 2.7 now has an optional "gene" attribute, but I couldn't get it to work ...

Thursday, 21 July 2011

Importing pdf R plots into Inkscape: the "q" problem

I know some people like to do absolutely everything in R (it's like an admission of failure if they can't), but my approach is somewhat more pragmatic; I like to get things done as quickly and as easily as possible.

As is my usual way, I get most of my plot basics done in R, then touch them up and move things around afterwards in Inkscape. This really does make life simple, but recently I came unstuck with this method.

R was exporting my plot just fine, and the pdf looked okay, but when I imported it into Inkscape, the points had all turned into the letter "q". I think this a a fairly well known problem, and is caused by R using letters (i.e. "o") as circles in the plot, but Inkscape not being able to deal with the same fonts.

There are multitude of fixes out there on the Web, but here is what worked for me, and took about 30 seconds. Instead of using pdf as output, try postscript. The code is as follows:


Inkscape can now import the eps file, and the points are nicely rendered as circles. I would be interested to know if this works across other platforms using other font packages ...

Monday, 16 May 2011

Dealing with large R plots

Had a small problem today. I was preparing several plots for a manuscript, but there was a lot of number crunching involved. One plot in particular comprised over 1.5 million data points! My preferred format of choice from saving out of R is as .pdf, because this vector-based format is scalable at any resolution, searchable, and easily compiled into another document by pdfLaTeX. However, due to the amount of data the pdf ended up over 25mb—a little too large! I could have gone down the low-quality raster route (e.g. as a .png) to reduce file size, but didn't really want to due to the crappy, pixelated appearance of the text.

What I needed was a compromise. Normally, I like to get the bulk of a plot done in R, and then polish it up a bit in Inkscape (it's so much easier this way). I adopted a similar approach here. Using the ggplot2 package, I removed all extraneous axis labelling using options including:

qplot() + opts(axis.title.x=theme_blank(), axis.title.x=theme_blank())

I also did the same when not using ggplot2, with the standard R graphics commands:

plot(dat, xlab="", ylab="", xaxt="n", yaxt="n")
axis(side=1, labels=FALSE)
axis(side=2, labels=FALSE)

Then, I exported it as a high quality (600 dpi) png:

png(file="file.png", width=11.7, height=11.7, units = "in", res=600)

This resulted in a series of plain plots something like this:

Next, I simply imported this png into Inkscape and added the axis labels myself as text. This was then saved as .svg (master copy), and then copies exported as .pdf.

The resulting file was much smaller at under 1mb (could have probably made it smaller still), and retained the all-important vector graphic text. The actual plotting was still embedded as raster, but it was high enough quality and I could live with that.