## Friday, 24 April 2015

### Submitting to GenBank

Few things fill the hearts of molecular biologists with as much dread as embarking on a GenBank submission. We all benefit from GenBank, but getting our data up there can be a somewhat onerous process. I am sure that if this community-minded attitude were not insisted upon by journals, GenBank would be a rather empty place.

Here, I will try to help with some of the organisational and practical aspects to make the process a little more painless. I will assume that you are familiar with at least the basics of R and running programs on the Unix command line.

Proper Planning and Preparation Prevents Piss Poor Performance

To most, GenBank submission is something of an afterthought, to be done once your paper has been accepted. Here I will try to challenge that, and suggest that you should be putting your data up on GenBank before you submit your manuscript. This is because:
1. You have an opportunity to notice potentially serious errors in your data at a much earlier stage in your project.
2. Your paper will benefit from fewer delays during the revision process, and is likely to get published faster.
3. If you send the GenBank flatfile as information supporting your manuscript, reviewers will have access to your raw data and will be able to review that data.
4. It forces you to get organised up front, which is generally a good thing all round.
5. If you are worried about being scooped, you always have the option of releasing the data after publication.
What are the options?

There are three main ways you can get data onto GenBank: 'BankIt', 'Sequin', and 'tbl2asn'. The main differences are described here, but essentially BankIt is a Web-based submission tool, Sequin is a standalone offline program with a GUI (graphical user interface), and tbl2asn is the scary command line option.

In my experience, BankIt is suitable for trivial uploads of small numbers of sequences, and it pretty easy to use. Sequin seems to be the most popular method, but I would avoid it, as it not as straightforward as it seems, and many find it quite confusing and time consuming. I favour tbl2asn, as it means you can script your submission and save yourself considerable time. This is the method I will demonstrate now.

Where to start

First, you will need get your data into a suitable file format as a master copy. Master copies are important, because when you return to a project after six months, you might want to know which of the 347 fasta files in your folder is the correct one to use. For a master copy, I recommend CSV or TSV, mainly for simplicity's sake. This file should be version controlled of course, using git. The benefit of CSV over say fasta or nexus, is that an unlimited amount of ancillary information can also be stored in it, including for example: scientific names, higher classifications, geography, GPS coordinates, and specimen voucher data (essentially whatever metadata are available and useful). Try to use a controlled vocabulary here wherever possible (Darwin Core is a good place to start). Having all the data in this format is also a big help when running your analyses and making your figures.

Here's a fictitious example of a format of a master file (much simplified, and do not copy this, as extra tabs were added to visualise). As you can see, I sampled three Boops boops all from the same MNHN museum lot and generated data for both mitochondrial cytb and 16S.

Here, we read the table into R, and then because some of our individuals were not sequenced for both genes, we need to remove those represented with 'NA' (for the cytb gene first). Also remember that when working with text, that strings should not be factors!

Gene names

Now, we need to use consistent nomenclature in our submission, which means using the correct gene and product names. Anyone who has ever searched GenBank for specific genes will know how frustrating it is when one gene is known under half a dozen names. Here I looked up the NCBI organelle resources for the official name of cytb, which is 'CYTB' (all in upper-case). The official name of its product—in this case the protein it codes for—is "cytochrome b" (all lower case).

You'll need to also be aware of general usage patterns, as you'll notice that COI is officially called 'COX1' on this list, but very few people use this name when submitting. For nuclear gene names, take a look here and here .

Source modifiers

Here, we write a GenBank formatted fasta file containing our important source modifying annotations, specifically:
1. The 'Sequence_ID', in our case the number of the tissue sample under 'otherCatalogNumbers' in our spreadsheet pasted together with the gene name to make it unique for each gene.
2. Organism name made by pasting 'genus' and 'specificEpithet' fields together.
3. The 'Bio_material' code, which is the code number used in the lab tissue collection, i.e. the same as our 'otherCatalogNumbers' field.
4. The specimen voucher code made by pasting together the 'institutionCode' and the 'catalogNumber' fields.
5. The genomic location of our sequence, in this case it is 'mitochondrion' (leave this blank for nuclear DNA).
6. The genetic code to translate the sequences, here '2' is the mitochondrial genetic code (again, leave this blank for nuclear DNA)
There are many more source modifiers available. See the list here.

Here's the result:

Feature tables

Now we create the feature table containing the locations of attributes of the sequence. The table is tab separated. Pay attention to the angle brackets '< >', as these signify that the coding sequence starts or ends outside of range of our nucleotides, i.e. if it is a partial sequence that does not code for the complete gene you may need to use the angle brackets. Here, our sequence starts on the first base of cytb, but it is only a partial sequence, so we use the angle bracket at the end only.

We are also assuming here that your coding sequence is in the correct reading frame, in other words that the first nucleotide in the alignment corresponds to the first nucleotide of the amino acid. While it is possible to specify the coding sequence to start on the second or third bases, please never do this, because this information is lost when the sequences are downloaded as fasta files from GenBank, and it then makes the very simple task of aligning your sequences using codon models difficult without manually correcting these errors.

Here's the result:

Repeat for 16S

That's it, all done for cytb. Now we repeat for our 16S sample, and append the new data to the files we already made. Remember we are overwriting our previous objects, but that's okay as these were already written to disk.

Here's the resulting fasta file and feature table for both genes:

Author info template

Next, we need to generate a file for the author/publication information. This is done online, and you simply type the details into the Web form here and download the file into your working directory. Easy.

Running 'tbl2asn'

We have generated three files so far: 'sequences.fsa' contains our DNA sequences and source modifiers, 'features.tbl' contains the locations of the features in our data, and 'template.sbt' contains our author and publication information. Now, we need to pass these files to the 'tbl2asn' program.

The latest version of the program is available from the NCBI via their FTP site (versions exist for Windows, Mac, and Linux). If you find your institution blocks FTP connections such as this, you can find the tool as part of the NCBI's published toolkit—simply 'sudo apt-get install ncbi-tools-bin' to install in Ubuntu—but this version is over a year old, so I do not know if its output would be acceptable for GenBank today. Run this program from the terminal as follows:

Output

When the program has run (it's quick), it outputs a series of files. This includes one called 'errorsummary.val', which is important, as it contains a list of errors. In this case, we have none, and the file is empty. The two main output files are called 'sequences.gbf', and 'sequences.sqn'. The gbf file is a GenBank flatfile, the same as you would access through GenBank. This file is designed to be human-readable, so you need to review this in order to check you are happy with everything. You can also read the flatfile into Geneious, and also check there for any errors.

Here's an example:

Submit

If all looks okay, you are ready to submit to GenBank! This is as simple as emailing the 'sequences.sqn' file to GenBank staff (gb-sub@ncbi.nlm.nih.gov). In a couple of days, you should receive via email your shiny new accession numbers, although it may take several further weeks before the sequences go live.

Have a go ...

If you wish to try this out yourself, all the files to repeat this example (except tbl2asn) can be found at https://github.com/boopsboops/genbank-submit. For the time being you will have to adapt these R scripts for your own needs, but one day I might wrap this up into a nice function.

While this is okay for a small-medium sized project, you might want to be thinking about relational databases for more complicated structures linking several projects.

## Wednesday, 5 February 2014

### Making figures for scientific papers

The next few blog posts (when I get round to doing them) are going to be a series of tutorials into common but neglected aspects of the academic process, such as how to make figures (this post), and how to submit/prepare data for GenBank (coming later). Very few students get taught how to do these things, so we just muddle our way through and learn along the way. Considering that these kind of activities are a central part of our research time, however, a little advice at the early stages can save a lot of effort.

What I won't be covering today though, is how to design good figures in the first place. There's enough information on that out there already. See, for example, the paper "A brief guide to designing effective figures for the scientific paper"; this article runs through dos and don'ts of making a figure to express your results clearly and effectively. What I will talk about, however, is how you actually put your figures together—what tools/processes are used, and how to be organised.

The basics

The really basic stuff is probably worth repeating. There are essentially two types of digital graphics: "raster" graphics (also known as "bitmap") and "vector" graphics. Raster graphics are comprised of entirely of coloured/shaded pixels, and as a result lose quality when you zoom in. Vector graphics (e.g. SVG, PDF, EPS) are comprised of instructions to draw objects, and so are scalable and therefore do not lose quality when zoomed in. Raster graphic formats (e.g. JPG, PNG, TIF) are for photographic images, and vector formats are for graphs, plots, diagrams, and cartoons. Raster images can be inserted into vector files (where they remain as raster), but not vice versa. Importantly in vector plots, the text remains as text. It amazes me how many rasterised phylogenetic trees I still see in journal articles. It might seems a trivial detail, but when you rasterise a tree, the names of the taxa in those trees become invisible to searching both within the PDF article, or by Web search engines.

 Fig 1. Difference between raster and vector graphics

Software

So, what programs do we use to create/edit our figures? Well first of all, I will not be recommending any proprietary software. The reasons are threefold: (a) a lot of students work on their own personal machines, so it's unfair to make them spend their limited money on these expensive software packages; (b) with open-source software you can often use the same program on any operating system (Mac, Windows, Linux); and (c) there's no need, as the free tools are good enough. So, for generating the data plots I use the R statistical package with ggplot2 (the sort of equivalent of Microsoft Excel), for editing raster images I use Gimp (the equivalent of Adobe Photoshop), for maps I use QGIS (the equivalent of ArcGIS), and for putting it all together and making adjustments to vector figures I use Inkscape (the equivalent of Adobe Illustrator). Now I'm not pretending that Gimp, QGIS, and Inkscape are better than their paid-for equivalents, but they are free, and for academic use I think they do pretty much all that is needed.

Before we get started, though, here's one useful piece of advice: go to the Web site of the journal that you're intending to submit your manuscript to, and read the instructions for authors. Each journal has slightly different requirements, and knowing the specifics at the beginning makes things a lot easier. They will tell you the dimensions of the figures, the label formats, the fonts to use, etc. If you don't know where you will submit the article, just do something generic, but importantly, make it easy for yourself to change it as a later date; you will have to. The usual format journals want for vector graphics is EPS (sometimes PDF), while for pure raster graphics, it is usually TIFs at > 300 pixels per inch (ppi, but often also called dpi).

For plots/graphs

So, for a simple plot, my workflow is straightforward. In basic R graphics I just run the code (see below) which saves my plot as a PDF (I'd prefer to save directly as SVG, but R Cairo does not currently support editable text objects in SVG). You may have to make minor changes to the plot—such as trimming up margins, moving the axis labels or changing fonts etc—before it's good enough for publication. Here, we have two choices: (a) do all the plot polishing in R, and export as a final version EPS (to save as EPS, just change "pdf" to "postscript" in the code below); or (b) save as PDF, import into Inkscape, and make minor adjustments there. Generally I find that it's easier to do this kind of thing in Inkscape, rather than messing around with details too much in R. However, there is a law of diminishing returns here, which means that only very minor changes are worth doing by hand instead of taking the time to perfect the code in R. There's a good chance you will certainly have to make the plot several times in the end—co-author doesn't like it, reviewers don't like it, need to submit to different journal, etc—so think about this in advance and do the bulk of the work in R. If working on a plot in Inkscape, it is best to save it as a master copy in SVG (the native format of Inkscape), or if it's a big file, as a compressed SVG (SVGZ). You can export to EPS, PDF and a variety of other vector formats in Inkscape via the "File > Save a Copy" menu.

data(mtcars)#load up the car data that comes with R
pdf("carplot.pdf", useDingbats=FALSE)#open PDF file
plot(mtcars$mpg~mtcars$wt)#create plot
abline(lm(mtcars$mpg~mtcars$wt), col="red")#fit linear model
dev.off()#close PDF file

 Fig 2. A basic, no frills R plot showing car MPG (efficiency) as function of car weight.
 Fig 3. After editing the plot in Inkscape to reduce the margins and change the axis font (please never use comic sans, it's just for illustrative purposes).

In a ggplot2 session, I do this (ggplot2 plots are pretty, so usually good-to-go with no editing at all):

data(mtcars)#load up the car data that comes with R
c <- ggplot(mtcars, aes(mpg, wt))#create plot
c + stat_smooth(method="lm", se=TRUE) + geom_point()#layers
ggsave("carplotgg.pdf")#save as PDF

 Fig 4. A ggplot2 plot.

For mixed figure types such as multi-panel figures

Sometimes you'll need to combine several photographic images onto one figure. Again, I use Inkscape for this, but if I need to make any alterations such as contrast, brightness, or resolution, I do this first in Gimp (Inkscape cannot edit the image in this way). What I don't do is crop the photo in Gimp first. This can be done in Inkscape, and importantly, if you need to change the proportions of the figure, it can be reversed or re-edited at any later date from within the same file. A useful thing worth mentioning at this point, is the value of keeping a backed-up history of each version of the figure. This can be really handy, not only for getting out of trouble if you make mistakes, but also to keep your working folders free of dozens of copies of the same file. I recommend looking as git for this purpose.

Now, to import the raster images into Inkscape, go to "File > Import", choose the file and select the "embed" box. Align the imported objects using "View > Grid". You can also resize them by selecting them, and going to the "Object > Transform > Scale" menu, remembering to "scale proportionally" to keep the width/height ratio. Add any arrows or text boxes that are required. The format/colour of these can be changed by selecting the object and going to "Object > Fill and Stroke". A useful thing to know when making arrows in Inkscape, is that you need to go to "Extensions > Modify Path > Colour Markers to Match Stroke" in order to make the arrow head match the arrow shaft.

Journals generally want minimal margins on plots. In Inkscape I use "File > Document Properties > Resize page to drawing or selection" to trim the plot to just a few pixels round each of the edges. Experiment to see what looks best. Here's one I made earlier.

 Fig 5. Two pygmy seahorse images edited onto a single figure with arrows and labels.

Again, I save this as a master copy SVG and work in that format. The figure can be exported as an EPS or PDF for the final version, but pay attention to the "resolution for rasterisation" option of the internal raster elements when you save (journals usually require 300 ppi or more, even if embedded in a vector file). Sometimes the file sizes can end up being quite big here, so what I often do when emailing figures to co-authors, or even when submitting a manuscript at the peer review stage, is to generate a low-resolution raster image for this purpose (high quality images are submitted later on in the review process). You can do this by going to "File > Export Bitmap" and select the page as the export area, and change the dpi to, say, 90. This will export a PNG file. Repeat at lower resolution if file size is still too big. Follow this process if the journal, for some strange reason, require the figure to be a raster image; if they do not accept PNG, you will need to open Gimp and export the PNG as a TIF.

In conclusion, I hope this provides some useful advice to get people started on making figures for their publications. Please don't hesitate to add any of your own nuggets of advice in the comments.

## Saturday, 17 August 2013

### Writing scientific papers with git and LaTeX

I wrote my last paper using the 'git' version control software.

You may have heard of git, and you may have even downloaded a program or R script from GitHub. It's been around a while (8 years), but it's only recently starting to be used by non-technical types (= biologists!). It's mainly used by programmers and web designers to keep track of changes to their code, but this applies equally to writing a manuscript. The principles are the same.

No matter how organised you are, everyone must have had at some time folders containing files called 'final.doc', 'finalfinal.doc', 'finalfinalfinal.doc', 'finalfinalfinal-version2_july12.doc', finalfinalfinal-version2_aug12_TJedit3_submitted.doc'. You get the picture. With git, this is a thing of the past. You have one file for your manuscript, and one file only. Git's magic happens in the background. Should you need to, you can roll back to any previous version of your manuscript, and it instantly changes in your working directory.

One of the things with writing manuscripts is different journals will require different formatting, or even an entirely different structure/focus of your work. Using git, we can accommodate this using the branching functions. To set up a new branch is simple, and it acts effectively as an additional, independent copy of your manuscript (although only the changes are actually stored by git—it's very efficient like that). The beauty is in the way that git allows you to transfer changes from one version to another, or merge them completely should you want.

Lets say you want to submit to Nature, but realistically you have to admit that they're unlikely to publish your important research on the length of ant's legs (but it's worth a try just in case). You branch off from 'master' into a new version called 'nature' and alter all the formatting to their requirements, but it's not the final version and you notice some typos or something more you want to change. It's easy to switch between branches, so you make the corrections, and using the 'cherry-pick' tool in git, you send only these specific changes you made in the nature branch, back to master while ignoring the new formatting. Unfortunately, you get rejected by Nature, and you decide that perhaps the Bulgarian Journal of Myrmecology is more appropriate. No problem, your master is up to date, and you just create another new branch from master (which can also be cherry picked if needed). If you are organised and wrote informative commit messages, you can even do the cherry picking at a much later date.

I hope I've demonstrated that this is a more intelligent way than copy/pasting, but one of the key features is that all authors can work on the same document at the same time, without fear of screwups. This is where git really shines. No longer will you have to email drafts out to all authors and then clean up the mess afterwards using track changes. Each person can independently work on the project at the same time and changes can be incorporated as desired.

However, there's a big snag in adopting this git approach, and as usual, it's other people. Lets be honest, it's not easy to persuade busy/important people to drop what they're doing to learn how to do something new, even if they are genuinely curious. This could make collaborating on a paper hard, which is ironic, as this is one of git's big strengths. So in my case I sent a pdf to my co-authors and received back comments annotated on the pdf. I was relatively fortunate in that my co-authors only wanted minor changes to the text, so this was not a problem to do manually. If they needed to get really stuck in, then the pdf option would have been a no-go (same goes for the dead tree option if they are in the same building).

But what's this about pdfs and this 'LaTeX' thing? Why can't git manage my Word documents? Well, git can track word documents, but it's a bad idea. Word stores its content as binary or compressed data, and while git can in theory be set to handle this, it gets complicated and unless you know exactly what you're doing, you can lose the main benefits of git—i.e. how it tracks differences between files and effortlessly merges them. Git works best with a plain text file, and therefore the LaTeX system is the obvious choice, as it stores the content of your manuscript in this format. You simply run the text file through compiler software, and a fully typeset pdf is produced. The formatting relies on 'markup language', so for example italic text would be presented as follows: \textit{Homo sapiens}. If you've ever written anything in html it's a similar idea, and not as difficult as it sounds.

However, again, the big problem with LaTeX is other people. If the journal you want to submit to is a nice modern one, then a LaTeX template will be available on their site. Conforming to their punctilious formatting rules is a doddle—you just use the template, and all is good. I've submitted to PLoS, Springer, and Elsevier* journals, and each was very straightforward (almost a pleasure). If your chosen journal does not accept LaTeX, however, you're in a world of pain. Converting to .rtf and then .doc via latex2rtf is straightforward enough, but how do you conform to their ridiculous rules (nobody could possibly peer review a manuscript if the subfigures are numbered with lower case rather than upper case letters, right?). You could do this by hand in the Word doc, but our time is just too valuable to be wasted like that. Changing these things in LaTeX is possible, but it's a royal pain the arse sometimes, especially when you need to change minutiae in the reference formatting. Besides, it goes against the LaTeX mantra of letting LaTeX take care of these things for you.

So, if I haven't put you off, how does it all work? First you install git. Next you create a local folder on your machine to hold your documents. This would just be the same as for any other file. Next, you need to set up a repository, or 'repo' as it's known in the trade. This repo is usually online, but need not be; it could be another folder on the same computer. Obviously if you wish to work on multiple machines, or you wish to collaborate, it needs to be online. There are a few options out there for that. GitHub is the most well known; public accounts are free, but if you want a private project you'll need to pay (I think some academics can apply for a free private account). BitBucket is another option that does offer free, private repos. So, what do I use? I use Dropbox. If your reading this with any knowledge of git, you'll know that Dropbox is not recommended to be used as a git repo, as it is simply not designed for it. However, the main problem lies with the fact that it can't deal with two people working on the same files at the same time—it becomes corrupted. But in my experience, if you are the only user of the repo, then that isn't a problem and it works fine.

Git essentially works by tracking your files and noticing when they have been changed. Once this occurs, the changes are now sitting in what's called the 'staging area'. When you are happy with the changes, you can 'commit' them to git, and they are assigned a unique 'hash', which acts as a permanent record of those changes. At the end of the day, you can 'push' your commits to the remote repo, and they can be accessed by you on another machine later, or by a collaborator. The main advantage of this three step system is that you can craft exactly who sees which changes and when. Git is a command line program, and although GUIs are available, it is good to start familiarising yourself with the basic commands when you learn. They are very simple (see below), and any problems/questions can be easily Googled. There's tons of information out there.

So, I will definitely be using the git/LaTeX combination in future, assuming I can convince people to join me. There's a lot more to it than I've mentioned here, but here's a few commonly used commands below, mainly to illustrate how simple it is. For further information, read these helpful git tutorials here, here, and here. If you're interested in LaTeX, the Wikibook is here.

Git is not limited to dealing with manuscripts either. I also added my figures and data there too. In fact, any version of the whole project at any time can be accessed with a single command. Another cool feature is that a repo such as GitHub can double as a preprint server, should you wish to share your results with the research community prior to journal submission.

#adds file(s) to be tracked by git

#you can make some changes to several files and commit these changes (e.g. a day's work) all at once with one message (the -a specifies all)
#you can alternatively tailor your commits to apply to just one file, or just one specific edit, and this makes rolling back a specific file a lot easier
#your future self will thank you for informative commit messages!
git commit -a -m "a message describing what you did"

#view the history of commits
git log

#send your commits to the repo
#can be set up to push automatically with just 'git push'
git push remotename branchname

#to create a new branch called 'newbranch'
git branch newbranch

#switch to new branch
git checkout newbranch

#switch back to master
git checkout master

#rollback to a previous commit
#the commits are stored as unique alphanumerical 'hashes' and can be accessed with 'git log'
#they can be truncated too.
git checkout c96c8009

#permanently reset to a previous commit: you lose all later commits
git reset --hard c96c8009

#cherry pick a specific commit and incorporate into your current branch
#need to have checked out the branch you want to cherry-pick IN to to do this.
#a tip for using with LaTeX is to write each sentence on a separate line. This will minimise conflicts (the same line getting edited in different places by different people).
git cherry-pick c96c8009

#for significant points in your timeline, add a version tag to a commit
git tag -a v2.0 -m 'version submitted to Nature' c96c8009

#compare two versions of the same file
#there are many additional options including colouring and word differences
git diff <commit1> <commit2> <file_name>

*Say what you like about Elsevier, their LaTeX support is very good.

## Sunday, 11 November 2012

### Non-zero exit status

I have been recently attempting to install and update some new R packages on my Ubuntu 12.10 machine, namely "rfishbase" and "phytools" (and their depends).

Unfortunately I got the fairly opaque error message: "installation of package had non-zero exit status".

After a bit of hunting I realised I was missing some development files from the Ubuntu install that are used to compile the package code. After installing these with the following commands, the packages installed in R no problem.

sudo apt-get update
sudo apt-get install libxml2-dev libcurl4-gnutls-dev libglu1-mesa-dev

## Friday, 28 September 2012

### Self publishing "failed" thesis chapters on Figshare

Sometimes in life, things just don't work out, and this is especially the case when doing scientific research. Experiments fail, you ran out of time/money, you didn't collect as much data as you wanted, you get a boring negative result, the conclusions are littered with caveats, or maybe the idea was just a duff one in the first place? Unfortunately, one of my thesis chapters ended up suffering from pretty much all of these problems, but is that time I spent on it now wasted?

Perhaps not. The Web site Figshare was set up by a "frustrated Imperial College PhD student" and it looks great (not that I'm biased you understand). It's a "community-based, open science project", allowing "researchers to publish all of their research outputs in seconds in an easily citable, shareable and discoverable manner".

Despite the fact that I felt this chapter was not of the expected quality, rigour, and interest required by a peer-reviewed journal, there are still elements I think would perhaps be useful in the public domain (particularly to aquarists). More importantly though, by putting it in the public domain, an editor, a reviewer, or even myself, doesn't have to make that subjective decision. This is a bit like the PLoS ONE model of publishing, expect without the all-important peer review stage to check that the science is sound. Seeing as I don't really have any strong conclusions other than "more work is required", I can't see much of a problem there.

 A hybrid Synodontis catfish. Image used with permission (Mike Norén).

The study is on investigating a simple way to find out if an aquarium fish is a hybrid or not. Hybrid fishes are quite commonly sold in the ornamental trade (especially African Synodontis catfishes), and this has implications for biosecurity agencies who have a responsibility to know which exotic organisms are entering their country. There is also the possibility of fraud, with these "fakes" often passed off as high-value species such as Synodontis granulosa. Finding experts experienced enough to know what they are is hard, and often all they are able to do is make an educated guess based on a photo. One solution is using DNA.

Given a good reference library, mitochondrial DNA with tell you who the maternal species is, but will not itself give you an indication that the fish is a hybrid, or what the paternal species is. Enter nuclear DNA. Microsatellites or SNPs are the best options, but these are too expensive and time consuming for a simple at-the-border test.

What I tried to do was see if a single nuclear gene could give me what I wanted. Results were mixed. It worked nicely for the control (hybrid danios bred in the lab), and some purchased hybrids too. However, for various unexplored reasons, it didn't work so well for the Synodontis (which was really the aim here).

Anyway, see for yourself at http://dx.doi.org/10.6084/m9.figshare.96149. Comments are welcome; if they are about self publishing, add them to this blog, if they are about the manuscript use the comment feature on Figshare, and if they are on catfish hybrids, then please add them to the PlanetCatfish discussion thread on the subject.

## Friday, 6 July 2012

### Research round-up

Unfortunately there has been little activity on the blog of late, mainly due to the small matter of getting my PhD thesis handed in, submitting manuscripts to journals, and finding a job etc! Having said that, I have been somewhat busy in other parts of the Web. Boopboops now has a sister Twitter feed for science related things (@boopsboops), and I have now coded up a Website promoting my CV, publications, and research skills etc, etc.

So, in absence of anything better, and as I've been meaning to do for a while, I thought I'd write about my favourite fish papers of 2009, 2010, and 2011.

#### (1) Larmuseau et al. (2009) To see in different seas: spatial variation in the rhodopsin gene of the sand goby (Pomatoschistus minutus). Molecular Ecology10.1111/j.1365-294X.2009.04331.x

I like the idea of looking at how organisms adapt to their surroundings. This study compared variation in the rhodopsin visual pigment locus with phylogeographic patterns in "neutral" mitochondrial and microsatellite markers (i.e. likely to detect any population-genetic structure), and found that in the sand goby, the two were discordant. Variation in the rhodopsin gene (RHO/RH1/RHOD) was partitioned differently and corresponded to photic environment (light penetration, water turbidity etc). There were also signs of positive selection at sites coding for amino acid changes relevant to spectral adaptation.

It's also interesting to note that rhodopsin is a commonly used marker for phylogenetic studies, which is probably due to early studies on vertebrate visual systems providing easy to use primer sets. However, I would be cautious about its use now, as these apparent convergences due to environmental conditions may not give a good indication of common ancestry for a species tree!

#### (2) Lavoué et al. (2011) Remarkable morphological stasis in an extant vertebrate despite tens of millions of years of divergence. Proceedings of the Royal Society B10.1098/rspb.2010.1639

If you've ever kept a tropical aquarium, you may have seen the African butterfly fish (Pantodon buchholzi) lurking in the oddball tanks. They're indeed a strange fish and are great fun to keep, clinging to the surface and greedily snapping up any insects that you feed them. Pantodon buchholzi is the species in a monotypic genus and family, known from the Niger and Congo basins.

When their mitochondrial genomes were sequenced, the researchers estimated that the two isolated populations had diverged over 50 million years ago, despite looking almost identical in terms of shape and meristics!

Evolution is taking place on the DNA clearly, but not on the external anatomy it seems. The reasons as to why and how this has happened are fascinating. The authors state "Proposed mechanisms of morphological stasis include stabilizing selection, ecological niche conservatism and genetic and developmental constraints". I look forward to further studies on this.

#### (3) Mims et al. (2010) Geography disentangles introgression from ancestral polymorphism in Lake Malawi cichlids. Molecular Ecology10.1111/j.1365-294X.2010.04529.x

The cichlid flocks of the African Rift Lakes are an almost extreme opposite example to the one presented above. There is huge phenotypic diversity, but often very little in the way of molecular differences. The mbuna cichlids Labeotropheus fuelleborni and Metriaclima zebra, are quite different in appearance, but share mitochondrial DNA haplotypes typical of very recently diverged, or hybridising species. The authors also report "greater mtDNA differentiation among localities than between species".

Information from the nuclear genome can help in these situations of understanding levels of gene flow, but can have limited resolving power when not used in sufficient number. Enter NGS. Modern sequencing methods can now provide orders of magnitude more data, and with a large SNP (single nucleotide polymorphism) set, here the authors report that the two species are indeed genetically distinct, and that recent hybridisation among the two species is unlikely. Certainly a useful tool for exploring these questions further.

## Tuesday, 6 December 2011

### Danio rerio: five species in one ... BIN!

So, I've just got back from the 4th International Barcode of Life Conference in Adelaide. An enjoyable time was had by all, and there's plenty to think about. Now, if you don't quite understand the title of this blog post, bear with me, and hopefully all will be explained by the end. There were three main themes I got from the conference, and I will try to draw them together.

Data access

We heard this again and again. Having data languishing in private projects is helping nobody, but publishing on other people's hard-collected data is certainly not cool either. The "Fort Lauderdale Agreement" aims to make a comprise between the two, and allow fair use where appropriate. As an incentive for the rest of us, leading researchers and museums will be releasing significant barcode datasets very soon.

A problem with early data release is the massive accumulation of sequences on GenBank without proper binomials; these have been termed "dark taxa" by Prof. Rod Page in his thoughtful blog post on the subject. Much of these data have come from BOLD. This has caused something of a problem for GenBank, especially where taxon names had subsequently been changed on BOLD. It was announced that a system of phases is to be introduced to differentiate data with different levels of annotation. The "phase zero" data with very little information other than the sequence will be "cleansed" off GenBank soon (removed from searches, but remain in the system). BOLD and GenBank databases are now expected to update each other more regularly too.

However, in answer to Rod's question of what can we do with "bad data" like this, we saw several excellent presentations on the kind of science that can be done on large datasets even without taxonomic names (I will try to get some links up to the videos when they are available).

BINs (barcode index numbers)

These were unveiled with perhaps a little less fanfare than expected given their importance; they had apparently been around since the last barcode conference two years ago, but have only now been made visible in BOLD 3.0 beta.

They are essentially clusters recognised by BOLD as putative species or species-like groups, independent of the taxonomic name system. Importantly, they are indexed and can be treated just like taxonomic names (i.e., created, stored and synonymised). I think a system like this is required, due to the fact that modern biodiversity science is as much a problem of information management as it is of species concepts and taxon definitions.

They offer many attractive advantages by: (1) linking sequences together with taxonomic names, literature, databases and museum vouchers; (2) simplifying the identification process; (3) tracking conflicting identifications and species with interim code names; and (4) offering scalable assessment of biodiversity.

Although announced as an "interim" taxonomic system I can't help but think that this endeavour may obviate the need for Linnaean names altogether in many groups. This could particularly be the case where one is more interested in say broad phylogenetic patterns across geographic areas or ecological guilds. It will now be all but impossible for "traditional taxonomy" to catch up with these BINs given the rate at which barcode data are now generated. Those who believe taxonomy is but a "service industry" to other branches of science will rejoice, as there is now the potential for a rapid, semi-automated, and fully scalable biodiversity assessment tool commensurate to the challenge at hand. Therefore there may no longer be room to argue that traditional taxonomy is required to document our deteriorating world. Those who prefer a "whole organism" approach may not be so impressed. The onus is perhaps on them now to justify why such a holistic science is valuable in the short-medium term. Of course the reasons are obvious to me*, but it may be a hard sell in today's output driven world.

Specifically, some issues also need to be ironed out with the BIN framework, particularly the repeatability of these clusters, as the algorithms under which they were generated are yet to be published and scrutinised, despite BOLD 3.0 going live and effectively hitting the detonate button.

Conflicting IDs

Now, this issue of BINs brings me nicely back to the title. In case you didn't get it, it's a play on the paper in PNAS entitled "Ten species in one: DNA barcoding reveals cryptic species in the Neotropical skipper butterfly Astraptes fulgerator". There the authors reported cryptic diversity in a widely dispersed species.

In contrast, here the problem is that currently the BIN for the zebrafish Danio rerio contains five different binomials! Given that of all 40,000+ fishes this species is arguably the one we humans know most about, this is perhaps surprising and worrying. One record was D. rerio proper, one was labelled D. cf. rerio, another was a legitimate synonym of D. rerio, another was what looked like a misspelling of a legitimate synonym of D. rerio, and the last was labelled Xiphophorus hellerii, a fish in a completely different order! Some of the public D. rerio records were just identified as "Cypriniformes sp.".

 Danio cf. rerio (BIN AAE3739)

This certainly calls into question the utility of barcoding for regulatory purposes such as seafood substitution, or monitoring invasive aquarium fish imports. Non-biologist regulators will be relying on good barcode reference libraries, and may end up acting conservatively, e.g. by rejecting all imports of aquarium zebra danios because BOLD was unable to give an unambiguous ID to species level. In a presentation by Dr Bob Hanner, it was estimated that for fishes, one in ten BINs contain more than one species. This I can only assume will rise especially where a number of labs are working on the same groups.

This type of data conflict was a hot topic at the conference, especially among the fish people. Having a database of synonyms would certainly help getting rid of the legitimate synonyms, but the other problems will require more work. A community-based curation and ranking system for the quality of the supporting data was proposed, and BOLD 3.0 already offers a Wiki-like annotation feature. A great idea, but will end users (e.g. regulatory agencies) really understand the technicalities, and will project managers bother to actively maintain their records after the manuscript has been published and they move onto the next project/job? It's a lot easier to upload some dodgy data than it is to prove someone else's data are dodgy.

I think one of the keys lies in access to literature. Getting hold of taxonomic literature is as good as impossible for many groups, yet thoroughly demonstrating the characters used to identify your specimens will make the whole system more transparent and reliable. Conflicts cannot be resolved without universal access to this literature. But ultimately, the best prevention lies with collaboration, and working through identification uncertainties between labs before data are uploaded as reference specimens.

* How would we ever know that Cypriniformes BIN AAF7369 shows "spectacular morphological novelty" from its COI sequence. Even though most of the big or important creatures have now been described, I think many startling discoveries are yet to come ...