Boops boops: regular expressions

Wednesday, 27 July 2011

Batch extracting GenBank data from journal articles

Repeatability, one of the central tenets of science. In theory, any published study should be repeatable. But, if anyone actually wants to do this, it's not always as straightforward as it sounds.

Molecular people do have it easy in comparison to say ecologists—we use GenBank, the Web based repository of all things DNA. Simple, just want to re-analyse some data, grab it from GenBank. Not so easy*, and here's an example.

Take the recent cypriniform relationships proposed by Mayden and Chen (2010). Being as polite as possible, their results are "interesting" to say the least. Lets assume we want to get our grubby hands on this data and see whether their conclusions have any meaningful support. They used six genes, and a table listing the accession numbers is presented:

These data are not all generated in this study though, so it makes it tricky to access them from the GenBank Web site. Copying and pasting each of them into GenBank is just not an option, so here's a hack that might just save you some time:

(1) Copy and paste the whole table from the pdf into a text file. Save the text file (e.g. as "input.txt").

(2) Open a terminal session, cd to the directory, and copy this command:

grep -o '[A-Z][A-Z][0-9][0-9][0-9][0-9][0-9][0-9]' input.txt | sed -e ':a;N;$!ba;s/\n/", "/g' -e 's/^/acc <- c("/g' -e 's/$/")/g' > output.txt

Eugh, that looks horrible, but what it does is create a ready made vector (called "acc") of accession numbers, which can be copied straight into R. Now, we assume a few things first though: that the table copied perfectly, and the fonts are compatible between pdf and txt; that the accessions are common eight digit GenBank accessions—some of the older GenBank accessions may be six digits (the regex can be modified though).

(3) Now, copy the output straight into R. We could use ape's read.GenBank function, but we would like to access the gene names too**, so we will use Samuel Brown's read.GB function instead.

Run the following R code to download the data and add taxon labels:

dat <- read.GB(acc)
names(dat) <- paste(acc, "|", attr(dat, "gene"), sep="")

(4) Now, this dumps us with all of the data in one vector, but we really want to analyse the seperate loci (e.g. rhodopsin). No worries, grep comes to the rescue again by grabbing all the sequences with "rhodopsin" in the gene attribute:

rho <- grep("rhodopsin", names(dat))
rhoSeqs <- dat[rho,]

The names ascribed to genes in GenBank from different studies may not be consistent, so make sure you check that your identifying phrase will work as expected. Now all you need to do is write this into a fasta file, align it, and you're away.

write.dna(rhoSeqs, file="mayden2010_rho.fas", format="fasta", colw=10000)
---------------------------------------------------------------------------------

* It really should be mandatory that researchers use services such as TreeBASE to upload their alignments.

** read.GenBank in ape 2.7 now has an optional "gene" attribute, but I couldn't get it to work ...

Sunday, 10 October 2010

Negative branch lengths in neighbour-joining analyses

A recent analysis of some fish COI data revealed these really odd branch tips going backwards on the tree. I'd never heard of this before and neither had most people I asked. So what are they, what causes them, and how do I get rid of them?

NJ phylogram of cyprinid COI sequences

Well they seem to be an artefact of the stepwise NJ clustering algorithm and how it adds new branches. I don't pretend to know the details, but it seems to occur on most analytical platforms (e.g. R, MEGA, PAUP*). These negative branches don't really have any biological meaning, so it is best to remove them before the tree is presented. Ideally they are redistributed to adjacent branches, but practically this isn't feasible. Apparently, PAUP* is able to deal with this, but I didn't have any joy when I tried. Setting them to zero seems the most justified approach at this stage. Of course the original distance matrix on which the NJ tree is based remains the same, so identifications should obviously be checked against the data, rather than by looking at just the tree.

Using the ape package in R, my simple workaround is to generate a neighbour-joining tree as usual with the nj() function. Next I save this tree object to file (in Newick parenthetical format) using the write.tree() command, and open the Newick text file in a text editor (I use SciTE). To remove all the negative branch numbers and replace them with zero, you need to use a regular expression search and replace. This is quite a powerful feature if you know how to use it. As well as normal negative numbers (e.g. -0.0005558445244), ape also adds negative exponents for really short branches (e.g. -5.199093188e-17), so these need to be dealt with too. First off you need to replace the exponent string by entering:

-\d\.\d*e-\d\d

Next, the normal negative number can be addressed with:

-\d\.\d*

I won't go into the details of how these instructions work, but a good tutorial on regular expressions can be found here. Make sure you have no other hyphens in your Newick file that may interfere (e.g. in the taxon labels), and always test it first with just find before you replace all and save. Now your modified Newick file can be reloaded into R and printed using the respective read.tree() and plot() commands. Hopefully someone will eventually develop a more sophisticated way of dealing with this natively in R.

EDIT 18.10.10 ...
And sure enough, yes, there is a much easier way of doing this straight in R. Simply create or load your tree object, and then access the branch (edge) lengths with the $ command, replacing all with zero.

TREEOBJECT$edge.length[TREEOBJECT$edge.length<0]<-0

Many thanks to Samuel Brown for pointing this out. For the R-phobic, the more long-winded approach posted above can still be used for trees produced in other programs such as MEGA or PAUP*.