Boops boops: DNA barcoding

Showing posts with label DNA barcoding. Show all posts

Friday, 28 September 2012

Self publishing "failed" thesis chapters on Figshare

Sometimes in life, things just don't work out, and this is especially the case when doing scientific research. Experiments fail, you ran out of time/money, you didn't collect as much data as you wanted, you get a boring negative result, the conclusions are littered with caveats, or maybe the idea was just a duff one in the first place? Unfortunately, one of my thesis chapters ended up suffering from pretty much all of these problems, but is that time I spent on it now wasted?

Perhaps not. The Web site Figshare was set up by a "frustrated Imperial College PhD student" and it looks great (not that I'm biased you understand). It's a "community-based, open science project", allowing "researchers to publish all of their research outputs in seconds in an easily citable, shareable and discoverable manner".

Despite the fact that I felt this chapter was not of the expected quality, rigour, and interest required by a peer-reviewed journal, there are still elements I think would perhaps be useful in the public domain (particularly to aquarists). More importantly though, by putting it in the public domain, an editor, a reviewer, or even myself, doesn't have to make that subjective decision. This is a bit like the PLoS ONE model of publishing, expect without the all-important peer review stage to check that the science is sound. Seeing as I don't really have any strong conclusions other than "more work is required", I can't see much of a problem there.

A hybrid Synodontis catfish. Image used with permission (Mike Norén).

The study is on investigating a simple way to find out if an aquarium fish is a hybrid or not. Hybrid fishes are quite commonly sold in the ornamental trade (especially African Synodontis catfishes), and this has implications for biosecurity agencies who have a responsibility to know which exotic organisms are entering their country. There is also the possibility of fraud, with these "fakes" often passed off as high-value species such as Synodontis granulosa. Finding experts experienced enough to know what they are is hard, and often all they are able to do is make an educated guess based on a photo. One solution is using DNA.

Given a good reference library, mitochondrial DNA with tell you who the maternal species is, but will not itself give you an indication that the fish is a hybrid, or what the paternal species is. Enter nuclear DNA. Microsatellites or SNPs are the best options, but these are too expensive and time consuming for a simple at-the-border test.

What I tried to do was see if a single nuclear gene could give me what I wanted. Results were mixed. It worked nicely for the control (hybrid danios bred in the lab), and some purchased hybrids too. However, for various unexplored reasons, it didn't work so well for the Synodontis (which was really the aim here).

Anyway, see for yourself at http://dx.doi.org/10.6084/m9.figshare.96149. Comments are welcome; if they are about self publishing, add them to this blog, if they are about the manuscript use the comment feature on Figshare, and if they are on catfish hybrids, then please add them to the PlanetCatfish discussion thread on the subject.

Tuesday, 6 December 2011

Danio rerio: five species in one ... BIN!

So, I've just got back from the 4th International Barcode of Life Conference in Adelaide. An enjoyable time was had by all, and there's plenty to think about. Now, if you don't quite understand the title of this blog post, bear with me, and hopefully all will be explained by the end. There were three main themes I got from the conference, and I will try to draw them together.

Bonython Hall, University of Adelaide

Data access

We heard this again and again. Having data languishing in private projects is helping nobody, but publishing on other people's hard-collected data is certainly not cool either. The "Fort Lauderdale Agreement" aims to make a comprise between the two, and allow fair use where appropriate. As an incentive for the rest of us, leading researchers and museums will be releasing significant barcode datasets very soon.

A problem with early data release is the massive accumulation of sequences on GenBank without proper binomials; these have been termed "dark taxa" by Prof. Rod Page in his thoughtful blog post on the subject. Much of these data have come from BOLD. This has caused something of a problem for GenBank, especially where taxon names had subsequently been changed on BOLD. It was announced that a system of phases is to be introduced to differentiate data with different levels of annotation. The "phase zero" data with very little information other than the sequence will be "cleansed" off GenBank soon (removed from searches, but remain in the system). BOLD and GenBank databases are now expected to update each other more regularly too.

However, in answer to Rod's question of what can we do with "bad data" like this, we saw several excellent presentations on the kind of science that can be done on large datasets even without taxonomic names (I will try to get some links up to the videos when they are available).

BINs (barcode index numbers)

These were unveiled with perhaps a little less fanfare than expected given their importance; they had apparently been around since the last barcode conference two years ago, but have only now been made visible in BOLD 3.0 beta.

They are essentially clusters recognised by BOLD as putative species or species-like groups, independent of the taxonomic name system. Importantly, they are indexed and can be treated just like taxonomic names (i.e., created, stored and synonymised). I think a system like this is required, due to the fact that modern biodiversity science is as much a problem of information management as it is of species concepts and taxon definitions.

They offer many attractive advantages by: (1) linking sequences together with taxonomic names, literature, databases and museum vouchers; (2) simplifying the identification process; (3) tracking conflicting identifications and species with interim code names; and (4) offering scalable assessment of biodiversity.

Although announced as an "interim" taxonomic system I can't help but think that this endeavour may obviate the need for Linnaean names altogether in many groups. This could particularly be the case where one is more interested in say broad phylogenetic patterns across geographic areas or ecological guilds. It will now be all but impossible for "traditional taxonomy" to catch up with these BINs given the rate at which barcode data are now generated. Those who believe taxonomy is but a "service industry" to other branches of science will rejoice, as there is now the potential for a rapid, semi-automated, and fully scalable biodiversity assessment tool commensurate to the challenge at hand. Therefore there may no longer be room to argue that traditional taxonomy is required to document our deteriorating world. Those who prefer a "whole organism" approach may not be so impressed. The onus is perhaps on them now to justify why such a holistic science is valuable in the short-medium term. Of course the reasons are obvious to me*, but it may be a hard sell in today's output driven world.

Specifically, some issues also need to be ironed out with the BIN framework, particularly the repeatability of these clusters, as the algorithms under which they were generated are yet to be published and scrutinised, despite BOLD 3.0 going live and effectively hitting the detonate button.

Conflicting IDs

Now, this issue of BINs brings me nicely back to the title. In case you didn't get it, it's a play on the paper in PNAS entitled "Ten species in one: DNA barcoding reveals cryptic species in the Neotropical skipper butterfly Astraptes fulgerator". There the authors reported cryptic diversity in a widely dispersed species.

In contrast, here the problem is that currently the BIN for the zebrafish Danio rerio contains five different binomials! Given that of all 40,000+ fishes this species is arguably the one we humans know most about, this is perhaps surprising and worrying. One record was D. rerio proper, one was labelled D. cf. rerio, another was a legitimate synonym of D. rerio, another was what looked like a misspelling of a legitimate synonym of D. rerio, and the last was labelled Xiphophorus hellerii, a fish in a completely different order! Some of the public D. rerio records were just identified as "Cypriniformes sp.".

Danio cf. rerio (BIN AAE3739)

This certainly calls into question the utility of barcoding for regulatory purposes such as seafood substitution, or monitoring invasive aquarium fish imports. Non-biologist regulators will be relying on good barcode reference libraries, and may end up acting conservatively, e.g. by rejecting all imports of aquarium zebra danios because BOLD was unable to give an unambiguous ID to species level. In a presentation by Dr Bob Hanner, it was estimated that for fishes, one in ten BINs contain more than one species. This I can only assume will rise especially where a number of labs are working on the same groups.

This type of data conflict was a hot topic at the conference, especially among the fish people. Having a database of synonyms would certainly help getting rid of the legitimate synonyms, but the other problems will require more work. A community-based curation and ranking system for the quality of the supporting data was proposed, and BOLD 3.0 already offers a Wiki-like annotation feature. A great idea, but will end users (e.g. regulatory agencies) really understand the technicalities, and will project managers bother to actively maintain their records after the manuscript has been published and they move onto the next project/job? It's a lot easier to upload some dodgy data than it is to prove someone else's data are dodgy.

I think one of the keys lies in access to literature. Getting hold of taxonomic literature is as good as impossible for many groups, yet thoroughly demonstrating the characters used to identify your specimens will make the whole system more transparent and reliable. Conflicts cannot be resolved without universal access to this literature. But ultimately, the best prevention lies with collaboration, and working through identification uncertainties between labs before data are uploaded as reference specimens.

* How would we ever know that Cypriniformes BIN AAF7369 shows "spectacular morphological novelty" from its COI sequence. Even though most of the big or important creatures have now been described, I think many startling discoveries are yet to come ...

Tuesday, 21 December 2010

A method of photographing and preserving fishes for molecular studies

Voucher specimens are important in molecular studies, almost maybe as important as for morphological studies. A good voucher will be useful to both molecular and morphological research for many years to come; a good voucher will allow any misidentified specimens to be easily corrected, and will permit any interesting molecular results to be effectively corroborated with morphology.

But generating good vouchers in molecular studies is hard. Formalin, the fixative chemical of choice for ichthyologists, degrades DNA and makes extraction/PCR difficult (but see Zhang, 2010). Instead, ethanol can be used as a fixative, but ethanol fixed specimens are often brittle, faded, and of poorer long-term quality.

It's often best to take a tissue sample from your specimen, store this in ethanol, and formalin fix the rest of the fish as a voucher. This is fine, but you'll want to know which tissue sample comes from which specimen, and for small fishes it's not possible to permanently attach the label to the specimen without causing damage. Of course, you could put them all in individual jars, but you could soon run out of jars or space. Transporting them is a big problem too, and this is where you really need to save space.

So, after trying out some quite unsatisfactory methods, and ruining many good specimens, I have developed a nice method of generating quality molecular vouchers:

Step 1. Fill vials for tissue samples with high-grade 100% ethanol. Label the tubes internally with pencil on archive quality "goatskin" paper, and externally with marker pen. The vouchers can be kept separate using small polythene zip-seal bags. They need to be perforated first, however, with a paper hole punch (do several at a time). They should also have their bottom corners cut off to allow the bags to drain. Place another label in the bag.

Step 2. Get everything ready in advance. Here I have:

Latex gloves
10% formalin (clearly labelled)
MS-222 (fish anaesthetic)
Spirit burner to decontaminate tools
Variety of forceps and scalpel
Pencil
Squares of cardboard to use as a clean surface for tissue preparation.
Vials for tissue samples
Bags for voucher

Step 3. Assemble your light source and photo rig. Here I use an adjustable microscopy light (halogen desk lamps can be substituted) and a shallow white tray. I used a piece of folded graph paper as a scale for these photos. Now, mix up your MS-222 (overdosed) and water into a shallow clear tray (the lid of a tube rack), and the fish can now be added (wait for 10 mins to ensure death). Make sure the fish is only just covered.

Step 4. Adjust the light angle and photograph the left-hand side of the fish, always adding the label. Remember to set your camera's white balance correctly (usually using the custom mode). The picture can then be cropped and the file name changed.

Step 5. Take the fish out of the solution and place on the card sheet. Use the scalpel to carefully excise a tissue sample from the right-hand side of the fish. Pectoral fin clips can also be taken to cause less damage, but on small fishes this won't yield much tissue, and using mitochondrion rich muscle may reduce the likelihood of numts (Song et al., 2008).

Note: don't cut from the caudal peduncle area if characters such as caudal peduncle scale counts may be important for identifying your fish.

Step 6. Next, place the fish into the plastic bag with the forceps, and place into the formalin. The position of the fish and fins can be manipulated through the holes in the bag with the forceps. This ensures the fish is not bent and the fins are not folded down.

Step 7. Throw away the card sheet and replace with new. Clean the implements with a wet tissue and then sterilise with the spirit burner. Repeat process for rest of specimens.

Step 8. Leave vouchers in formalin for approximately three days (longer for larger fishes). After three days, remove from formalin and wash thoroughly with water. Leave in water for 24 hours to dilute remaining formalin. Place into weak 35% alcohol (ethanol or clear methylated spirit) solution for three days before final storage in 70% alcohol. The voucher will have lost a lot of its colour by now, but can be photographed again to document the preserved colour pattern.

Of course, these bags have not been tested for long-term (i.e. indefinite) storage, and are only recommended as a temporary (<5yr) storage or transport solution.

In addition, although I haven't yet tested it, this method could hopefully be adapted for use in the field.

Sunday, 10 October 2010

Negative branch lengths in neighbour-joining analyses

A recent analysis of some fish COI data revealed these really odd branch tips going backwards on the tree. I'd never heard of this before and neither had most people I asked. So what are they, what causes them, and how do I get rid of them?

NJ phylogram of cyprinid COI sequences

Well they seem to be an artefact of the stepwise NJ clustering algorithm and how it adds new branches. I don't pretend to know the details, but it seems to occur on most analytical platforms (e.g. R, MEGA, PAUP*). These negative branches don't really have any biological meaning, so it is best to remove them before the tree is presented. Ideally they are redistributed to adjacent branches, but practically this isn't feasible. Apparently, PAUP* is able to deal with this, but I didn't have any joy when I tried. Setting them to zero seems the most justified approach at this stage. Of course the original distance matrix on which the NJ tree is based remains the same, so identifications should obviously be checked against the data, rather than by looking at just the tree.

Using the ape package in R, my simple workaround is to generate a neighbour-joining tree as usual with the nj() function. Next I save this tree object to file (in Newick parenthetical format) using the write.tree() command, and open the Newick text file in a text editor (I use SciTE). To remove all the negative branch numbers and replace them with zero, you need to use a regular expression search and replace. This is quite a powerful feature if you know how to use it. As well as normal negative numbers (e.g. -0.0005558445244), ape also adds negative exponents for really short branches (e.g. -5.199093188e-17), so these need to be dealt with too. First off you need to replace the exponent string by entering:

-\d\.\d*e-\d\d

Next, the normal negative number can be addressed with:

-\d\.\d*

I won't go into the details of how these instructions work, but a good tutorial on regular expressions can be found here. Make sure you have no other hyphens in your Newick file that may interfere (e.g. in the taxon labels), and always test it first with just find before you replace all and save. Now your modified Newick file can be reloaded into R and printed using the respective read.tree() and plot() commands. Hopefully someone will eventually develop a more sophisticated way of dealing with this natively in R.

EDIT 18.10.10 ...
And sure enough, yes, there is a much easier way of doing this straight in R. Simply create or load your tree object, and then access the branch (edge) lengths with the $ command, replacing all with zero.

TREEOBJECT$edge.length[TREEOBJECT$edge.length<0]<-0

Many thanks to Samuel Brown for pointing this out. For the R-phobic, the more long-winded approach posted above can still be used for trees produced in other programs such as MEGA or PAUP*.