Sunday, 10 October 2010

Negative branch lengths in neighbour-joining analyses

A recent analysis of some fish COI data revealed these really odd branch tips going backwards on the tree. I'd never heard of this before and neither had most people I asked. So what are they, what causes them, and how do I get rid of them?

NJ phylogram of cyprinid COI sequences
Well they seem to be an artefact of the stepwise NJ clustering algorithm and how it adds new branches. I don't pretend to know the details, but it seems to occur on most analytical platforms (e.g. R, MEGA, PAUP*). These negative branches don't really have any biological meaning, so it is best to remove them before the tree is presented. Ideally they are redistributed to adjacent branches, but practically this isn't feasible. Apparently, PAUP* is able to deal with this, but I didn't have any joy when I tried. Setting them to zero seems the most justified approach at this stage. Of course the original distance matrix on which the NJ tree is based remains the same, so identifications should obviously be checked against the data, rather than by looking at just the tree.

Using the ape package in R, my simple workaround is to generate a neighbour-joining tree as usual with the nj() function. Next I save this tree object to file (in Newick parenthetical format) using the write.tree() command, and open the Newick text file in a text editor (I use SciTE). To remove all the negative branch numbers and replace them with zero, you need to use a regular expression search and replace. This is quite a powerful feature if you know how to use it. As well as normal negative numbers (e.g. -0.0005558445244), ape also adds negative exponents for really short branches (e.g. -5.199093188e-17), so these need to be dealt with too. First off you need to replace the exponent string by entering:
-\d\.\d*e-\d\d
Next, the normal negative number can be addressed with:
-\d\.\d*
I won't go into the details of how these instructions work, but a good tutorial on regular expressions can be found here. Make sure you have no other hyphens in your Newick file that may interfere (e.g. in the taxon labels), and always test it first with just find before you replace all and save. Now your modified Newick file can be reloaded into R and printed using the respective read.tree() and plot() commands. Hopefully someone will eventually develop a more sophisticated way of dealing with this natively in R.

EDIT 18.10.10 ...
And sure enough, yes, there is a much easier way of doing this straight in R. Simply create or load your tree object, and then access the branch (edge) lengths with the $ command, replacing all with zero.
TREEOBJECT$edge.length[TREEOBJECT$edge.length<0]<-0
Many thanks to Samuel Brown for pointing this out. For the R-phobic, the more long-winded approach posted above can still be used for trees produced in other programs such as MEGA or PAUP*.


Why do I have a blog?

Is it because I'm putting off more important things; is it because everyone else has one now; is it because Samuel Brown has one and I copy everything he does; or is it because it's a handy record of stuff I would otherwise forget? Well, all of the above really, but mostly because they're a jolly helpful resource for other people. I'm always googling various questions, and it's so useful when people have taken the time to post up their solutions. Imagine I spend a day investigating a problem, it's nonsensical that others should waste their time repeating the same process when the answer should be just a couple of clicks away.
 
And the name? Well, I had really wanted "The Blogfish" in reference to my mascot the charismatic blobfish, but predictably that domain had been snapped up ages ago. Instead, I plumped for possibly the most amusingly named of all fishes, Boops boops. Unfortunately it's a relatively unremarkable looking sparid, but I like the name nonetheless.
 
Boops boops [image: http://www.eol.org/pages/203866]
 
What'll be on this blog? I dunno, mostly boring science stuff. Sorry. If you are still reading, however, what better place to start than with a relatively simple method of dealing with negative branch lengths in neighbour-joining analyses. Enjoy ...