Showing posts with label R. Show all posts
Showing posts with label R. Show all posts

Friday, 24 April 2015

Submitting to GenBank

Few things fill the hearts of molecular biologists with as much dread as embarking on a GenBank submission. We all benefit from GenBank, but getting our data up there can be a somewhat onerous process. I am sure that if this community-minded attitude were not insisted upon by journals, GenBank would be a rather empty place.

Here, I will try to help with some of the organisational and practical aspects to make the process a little more painless. I will assume that you are familiar with at least the basics of R and running programs on the Unix command line.

Proper Planning and Preparation Prevents Piss Poor Performance

To most, GenBank submission is something of an afterthought, to be done once your paper has been accepted. Here I will try to challenge that, and suggest that you should be putting your data up on GenBank before you submit your manuscript. This is because:
  1. You have an opportunity to notice potentially serious errors in your data at a much earlier stage in your project.
  2. Your paper will benefit from fewer delays during the revision process, and is likely to get published faster.
  3. If you send the GenBank flatfile as information supporting your manuscript, reviewers will have access to your raw data and will be able to review that data.
  4. It forces you to get organised up front, which is generally a good thing all round.
  5. If you are worried about being scooped, you always have the option of releasing the data after publication.
What are the options?

There are three main ways you can get data onto GenBank: 'BankIt', 'Sequin', and 'tbl2asn'. The main differences are described here, but essentially BankIt is a Web-based submission tool, Sequin is a standalone offline program with a GUI (graphical user interface), and tbl2asn is the scary command line option.

In my experience, BankIt is suitable for trivial uploads of small numbers of sequences, and it pretty easy to use. Sequin seems to be the most popular method, but I would avoid it, as it not as straightforward as it seems, and many find it quite confusing and time consuming. I favour tbl2asn, as it means you can script your submission and save yourself considerable time. This is the method I will demonstrate now.

Where to start

First, you will need get your data into a suitable file format as a master copy. Master copies are important, because when you return to a project after six months, you might want to know which of the 347 fasta files in your folder is the correct one to use. For a master copy, I recommend CSV or TSV, mainly for simplicity's sake. This file should be version controlled of course, using git. The benefit of CSV over say fasta or nexus, is that an unlimited amount of ancillary information can also be stored in it, including for example: scientific names, higher classifications, geography, GPS coordinates, and specimen voucher data (essentially whatever metadata are available and useful). Try to use a controlled vocabulary here wherever possible (Darwin Core is a good place to start). Having all the data in this format is also a big help when running your analyses and making your figures.

Here's a fictitious example of a format of a master file (much simplified, and do not copy this, as extra tabs were added to visualise). As you can see, I sampled three Boops boops all from the same MNHN museum lot and generated data for both mitochondrial cytb and 16S.


Importing your data into R

Here, we read the table into R, and then because some of our individuals were not sequenced for both genes, we need to remove those represented with 'NA' (for the cytb gene first). Also remember that when working with text, that strings should not be factors!


Gene names

Now, we need to use consistent nomenclature in our submission, which means using the correct gene and product names. Anyone who has ever searched GenBank for specific genes will know how frustrating it is when one gene is known under half a dozen names. Here I looked up the NCBI organelle resources for the official name of cytb, which is 'CYTB' (all in upper-case). The official name of its product—in this case the protein it codes for—is "cytochrome b" (all lower case).

You'll need to also be aware of general usage patterns, as you'll notice that COI is officially called 'COX1' on this list, but very few people use this name when submitting. For nuclear gene names, take a look here and here .


Source modifiers

Here, we write a GenBank formatted fasta file containing our important source modifying annotations, specifically:
  1. The 'Sequence_ID', in our case the number of the tissue sample under 'otherCatalogNumbers' in our spreadsheet pasted together with the gene name to make it unique for each gene.
  2. Organism name made by pasting 'genus' and 'specificEpithet' fields together.
  3. The 'Bio_material' code, which is the code number used in the lab tissue collection, i.e. the same as our 'otherCatalogNumbers' field.
  4. The specimen voucher code made by pasting together the 'institutionCode' and the 'catalogNumber' fields.
  5. The genomic location of our sequence, in this case it is 'mitochondrion' (leave this blank for nuclear DNA).
  6. The genetic code to translate the sequences, here '2' is the mitochondrial genetic code (again, leave this blank for nuclear DNA)
There are many more source modifiers available. See the list here.


Here's the result:


Feature tables

Now we create the feature table containing the locations of attributes of the sequence. The table is tab separated. Pay attention to the angle brackets '< >', as these signify that the coding sequence starts or ends outside of range of our nucleotides, i.e. if it is a partial sequence that does not code for the complete gene you may need to use the angle brackets. Here, our sequence starts on the first base of cytb, but it is only a partial sequence, so we use the angle bracket at the end only.

We are also assuming here that your coding sequence is in the correct reading frame, in other words that the first nucleotide in the alignment corresponds to the first nucleotide of the amino acid. While it is possible to specify the coding sequence to start on the second or third bases, please never do this, because this information is lost when the sequences are downloaded as fasta files from GenBank, and it then makes the very simple task of aligning your sequences using codon models difficult without manually correcting these errors.


Here's the result:

Repeat for 16S

That's it, all done for cytb. Now we repeat for our 16S sample, and append the new data to the files we already made. Remember we are overwriting our previous objects, but that's okay as these were already written to disk.


Here's the resulting fasta file and feature table for both genes:

Author info template

Next, we need to generate a file for the author/publication information. This is done online, and you simply type the details into the Web form here and download the file into your working directory. Easy.

Running 'tbl2asn'

We have generated three files so far: 'sequences.fsa' contains our DNA sequences and source modifiers, 'features.tbl' contains the locations of the features in our data, and 'template.sbt' contains our author and publication information. Now, we need to pass these files to the 'tbl2asn' program.

The latest version of the program is available from the NCBI via their FTP site (versions exist for Windows, Mac, and Linux). If you find your institution blocks FTP connections such as this, you can find the tool as part of the NCBI's published toolkit—simply 'sudo apt-get install ncbi-tools-bin' to install in Ubuntu—but this version is over a year old, so I do not know if its output would be acceptable for GenBank today. Run this program from the terminal as follows:


Output

When the program has run (it's quick), it outputs a series of files. This includes one called 'errorsummary.val', which is important, as it contains a list of errors. In this case, we have none, and the file is empty. The two main output files are called 'sequences.gbf', and 'sequences.sqn'. The gbf file is a GenBank flatfile, the same as you would access through GenBank. This file is designed to be human-readable, so you need to review this in order to check you are happy with everything. You can also read the flatfile into Geneious, and also check there for any errors.

Here's an example:

Submit

If all looks okay, you are ready to submit to GenBank! This is as simple as emailing the 'sequences.sqn' file to GenBank staff (gb-sub@ncbi.nlm.nih.gov). In a couple of days, you should receive via email your shiny new accession numbers, although it may take several further weeks before the sequences go live.

Have a go ...

If you wish to try this out yourself, all the files to repeat this example (except tbl2asn) can be found at https://github.com/boopsboops/genbank-submit. For the time being you will have to adapt these R scripts for your own needs, but one day I might wrap this up into a nice function.


While this is okay for a small-medium sized project, you might want to be thinking about relational databases for more complicated structures linking several projects.

Wednesday, 5 February 2014

Making figures for scientific papers

The next few blog posts (when I get round to doing them) are going to be a series of tutorials into common but neglected aspects of the academic process, such as how to make figures (this post), and how to submit/prepare data for GenBank (coming later). Very few students get taught how to do these things, so we just muddle our way through and learn along the way. Considering that these kind of activities are a central part of our research time, however, a little advice at the early stages can save a lot of effort.

What I won't be covering today though, is how to design good figures in the first place. There's enough information on that out there already. See, for example, the paper "A brief guide to designing effective figures for the scientific paper"; this article runs through dos and don'ts of making a figure to express your results clearly and effectively. What I will talk about, however, is how you actually put your figures together—what tools/processes are used, and how to be organised.

The basics

The really basic stuff is probably worth repeating. There are essentially two types of digital graphics: "raster" graphics (also known as "bitmap") and "vector" graphics. Raster graphics are comprised of entirely of coloured/shaded pixels, and as a result lose quality when you zoom in. Vector graphics (e.g. SVG, PDF, EPS) are comprised of instructions to draw objects, and so are scalable and therefore do not lose quality when zoomed in. Raster graphic formats (e.g. JPG, PNG, TIF) are for photographic images, and vector formats are for graphs, plots, diagrams, and cartoons. Raster images can be inserted into vector files (where they remain as raster), but not vice versa. Importantly in vector plots, the text remains as text. It amazes me how many rasterised phylogenetic trees I still see in journal articles. It might seems a trivial detail, but when you rasterise a tree, the names of the taxa in those trees become invisible to searching both within the PDF article, or by Web search engines.

Fig 1. Difference between raster and vector graphics

Software

So, what programs do we use to create/edit our figures? Well first of all, I will not be recommending any proprietary software. The reasons are threefold: (a) a lot of students work on their own personal machines, so it's unfair to make them spend their limited money on these expensive software packages; (b) with open-source software you can often use the same program on any operating system (Mac, Windows, Linux); and (c) there's no need, as the free tools are good enough. So, for generating the data plots I use the R statistical package with ggplot2 (the sort of equivalent of Microsoft Excel), for editing raster images I use Gimp (the equivalent of Adobe Photoshop), for maps I use QGIS (the equivalent of ArcGIS), and for putting it all together and making adjustments to vector figures I use Inkscape (the equivalent of Adobe Illustrator). Now I'm not pretending that Gimp, QGIS, and Inkscape are better than their paid-for equivalents, but they are free, and for academic use I think they do pretty much all that is needed.

Before we get started, though, here's one useful piece of advice: go to the Web site of the journal that you're intending to submit your manuscript to, and read the instructions for authors. Each journal has slightly different requirements, and knowing the specifics at the beginning makes things a lot easier. They will tell you the dimensions of the figures, the label formats, the fonts to use, etc. If you don't know where you will submit the article, just do something generic, but importantly, make it easy for yourself to change it as a later date; you will have to. The usual format journals want for vector graphics is EPS (sometimes PDF), while for pure raster graphics, it is usually TIFs at > 300 pixels per inch (ppi, but often also called dpi).

For plots/graphs

So, for a simple plot, my workflow is straightforward. In basic R graphics I just run the code (see below) which saves my plot as a PDF (I'd prefer to save directly as SVG, but R Cairo does not currently support editable text objects in SVG). You may have to make minor changes to the plot—such as trimming up margins, moving the axis labels or changing fonts etc—before it's good enough for publication. Here, we have two choices: (a) do all the plot polishing in R, and export as a final version EPS (to save as EPS, just change "pdf" to "postscript" in the code below); or (b) save as PDF, import into Inkscape, and make minor adjustments there. Generally I find that it's easier to do this kind of thing in Inkscape, rather than messing around with details too much in R. However, there is a law of diminishing returns here, which means that only very minor changes are worth doing by hand instead of taking the time to perfect the code in R. There's a good chance you will certainly have to make the plot several times in the end—co-author doesn't like it, reviewers don't like it, need to submit to different journal, etc—so think about this in advance and do the bulk of the work in R. If working on a plot in Inkscape, it is best to save it as a master copy in SVG (the native format of Inkscape), or if it's a big file, as a compressed SVG (SVGZ). You can export to EPS, PDF and a variety of other vector formats in Inkscape via the "File > Save a Copy" menu.

data(mtcars)#load up the car data that comes with R
pdf("carplot.pdf", useDingbats=FALSE)#open PDF file
plot(mtcars$mpg~mtcars$wt)#create plot
abline(lm(mtcars$mpg~mtcars$wt), col="red")#fit linear model
dev.off()#close PDF file

Fig 2. A basic, no frills R plot showing car MPG (efficiency) as function of car weight.
Fig 3. After editing the plot in Inkscape to reduce the margins and change the axis font
(please never use comic sans, it's just for illustrative purposes).

In a ggplot2 session, I do this (ggplot2 plots are pretty, so usually good-to-go with no editing at all):

library(ggplot2)#load ggplot2 library
data(mtcars)#load up the car data that comes with R
c <- ggplot(mtcars, aes(mpg, wt))#create plot
c + stat_smooth(method="lm", se=TRUE) + geom_point()#layers
ggsave("carplotgg.pdf")#save as PDF

Fig 4. A ggplot2 plot.

For mixed figure types such as multi-panel figures

Sometimes you'll need to combine several photographic images onto one figure. Again, I use Inkscape for this, but if I need to make any alterations such as contrast, brightness, or resolution, I do this first in Gimp (Inkscape cannot edit the image in this way). What I don't do is crop the photo in Gimp first. This can be done in Inkscape, and importantly, if you need to change the proportions of the figure, it can be reversed or re-edited at any later date from within the same file. A useful thing worth mentioning at this point, is the value of keeping a backed-up history of each version of the figure. This can be really handy, not only for getting out of trouble if you make mistakes, but also to keep your working folders free of dozens of copies of the same file. I recommend looking as git for this purpose.

Now, to import the raster images into Inkscape, go to "File > Import", choose the file and select the "embed" box. Align the imported objects using "View > Grid". You can also resize them by selecting them, and going to the "Object > Transform > Scale" menu, remembering to "scale proportionally" to keep the width/height ratio. Add any arrows or text boxes that are required. The format/colour of these can be changed by selecting the object and going to "Object > Fill and Stroke". A useful thing to know when making arrows in Inkscape, is that you need to go to "Extensions > Modify Path > Colour Markers to Match Stroke" in order to make the arrow head match the arrow shaft.

Journals generally want minimal margins on plots. In Inkscape I use "File > Document Properties > Resize page to drawing or selection" to trim the plot to just a few pixels round each of the edges. Experiment to see what looks best. Here's one I made earlier.

Fig 5. Two pygmy seahorse images edited onto a single figure with arrows and labels.

Again, I save this as a master copy SVG and work in that format. The figure can be exported as an EPS or PDF for the final version, but pay attention to the "resolution for rasterisation" option of the internal raster elements when you save (journals usually require 300 ppi or more, even if embedded in a vector file). Sometimes the file sizes can end up being quite big here, so what I often do when emailing figures to co-authors, or even when submitting a manuscript at the peer review stage, is to generate a low-resolution raster image for this purpose (high quality images are submitted later on in the review process). You can do this by going to "File > Export Bitmap" and select the page as the export area, and change the dpi to, say, 90. This will export a PNG file. Repeat at lower resolution if file size is still too big. Follow this process if the journal, for some strange reason, require the figure to be a raster image; if they do not accept PNG, you will need to open Gimp and export the PNG as a TIF.

In conclusion, I hope this provides some useful advice to get people started on making figures for their publications. Please don't hesitate to add any of your own nuggets of advice in the comments.

Sunday, 11 November 2012

Non-zero exit status

I have been recently attempting to install and update some new R packages on my Ubuntu 12.10 machine, namely "rfishbase" and "phytools" (and their depends).

Unfortunately I got the fairly opaque error message: "installation of package had non-zero exit status".

After a bit of hunting I realised I was missing some development files from the Ubuntu install that are used to compile the package code. After installing these with the following commands, the packages installed in R no problem.

sudo apt-get update
sudo apt-get install libxml2-dev libcurl4-gnutls-dev libglu1-mesa-dev

Wednesday, 27 July 2011

Batch extracting GenBank data from journal articles

Repeatability, one of the central tenets of science. In theory, any published study should be repeatable. But, if anyone actually wants to do this, it's not always as straightforward as it sounds.

Molecular people do have it easy in comparison to say ecologists—we use GenBank, the Web based repository of all things DNA. Simple, just want to re-analyse some data, grab it from GenBank. Not so easy*, and here's an example.

Take the recent cypriniform relationships proposed by Mayden and Chen (2010). Being as polite as possible, their results are "interesting" to say the least. Lets assume we want to get our grubby hands on this data and see whether their conclusions have any meaningful support. They used six genes, and a table listing the accession numbers is presented:


These data are not all generated in this study though, so it makes it tricky to access them from the GenBank Web site. Copying and pasting each of them into GenBank is just not an option, so here's a hack that might just save you some time:

(1) Copy and paste the whole table from the pdf into a text file. Save the text file (e.g. as "input.txt").

(2) Open a terminal session, cd to the directory, and copy this command:

grep -o '[A-Z][A-Z][0-9][0-9][0-9][0-9][0-9][0-9]' input.txt | sed -e ':a;N;$!ba;s/\n/", "/g' -e 's/^/acc <- c("/g' -e 's/$/")/g' > output.txt

Eugh, that looks horrible, but what it does is create a ready made vector (called "acc") of accession numbers, which can be copied straight into R. Now, we assume a few things first though: that the table copied perfectly, and the fonts are compatible between pdf and txt; that the accessions are common eight digit GenBank accessions—some of the older GenBank accessions may be six digits (the regex can be modified though).

(3) Now, copy the output straight into R. We could use ape's read.GenBank function, but we would like to access the gene names too**, so we will use Samuel Brown's read.GB function instead.

Run the following R code to download the data and add taxon labels:

dat <- read.GB(acc)
names(dat) <- paste(acc, "|", attr(dat, "gene"), sep="")

(4) Now, this dumps us with all of the data in one vector, but we really want to analyse the seperate loci (e.g. rhodopsin). No worries, grep comes to the rescue again by grabbing all the sequences with "rhodopsin" in the gene attribute:

rho <- grep("rhodopsin", names(dat))
rhoSeqs <- dat[rho,]

The names ascribed to genes in GenBank from different studies may not be consistent, so make sure you check that your identifying phrase will work as expected. Now all you need to do is write this into a fasta file, align it, and you're away.

write.dna(rhoSeqs, file="mayden2010_rho.fas", format="fasta", colw=10000)

---------------------------------------------------------------------------------

* It really should be mandatory that researchers use services such as TreeBASE to upload their alignments.

** read.GenBank in ape 2.7 now has an optional "gene" attribute, but I couldn't get it to work ...

Thursday, 21 July 2011

Importing pdf R plots into Inkscape: the "q" problem

I know some people like to do absolutely everything in R (it's like an admission of failure if they can't), but my approach is somewhat more pragmatic; I like to get things done as quickly and as easily as possible.

As is my usual way, I get most of my plot basics done in R, then touch them up and move things around afterwards in Inkscape. This really does make life simple, but recently I came unstuck with this method.

R was exporting my plot just fine, and the pdf looked okay, but when I imported it into Inkscape, the points had all turned into the letter "q". I think this a a fairly well known problem, and is caused by R using letters (i.e. "o") as circles in the plot, but Inkscape not being able to deal with the same fonts.


There are multitude of fixes out there on the Web, but here is what worked for me, and took about 30 seconds. Instead of using pdf as output, try postscript. The code is as follows:

postscript(file="plot.eps")
plot(object)
dev.off()


Inkscape can now import the eps file, and the points are nicely rendered as circles. I would be interested to know if this works across other platforms using other font packages ...

Monday, 16 May 2011

Dealing with large R plots

Had a small problem today. I was preparing several plots for a manuscript, but there was a lot of number crunching involved. One plot in particular comprised over 1.5 million data points! My preferred format of choice from saving out of R is as .pdf, because this vector-based format is scalable at any resolution, searchable, and easily compiled into another document by pdfLaTeX. However, due to the amount of data the pdf ended up over 25mb—a little too large! I could have gone down the low-quality raster route (e.g. as a .png) to reduce file size, but didn't really want to due to the crappy, pixelated appearance of the text.

What I needed was a compromise. Normally, I like to get the bulk of a plot done in R, and then polish it up a bit in Inkscape (it's so much easier this way). I adopted a similar approach here. Using the ggplot2 package, I removed all extraneous axis labelling using options including:

qplot() + opts(axis.title.x=theme_blank(), axis.title.x=theme_blank())

I also did the same when not using ggplot2, with the standard R graphics commands:

plot(dat, xlab="", ylab="", xaxt="n", yaxt="n")
axis(side=1, labels=FALSE)
axis(side=2, labels=FALSE)


Then, I exported it as a high quality (600 dpi) png:

png(file="file.png", width=11.7, height=11.7, units = "in", res=600)

This resulted in a series of plain plots something like this:


Next, I simply imported this png into Inkscape and added the axis labels myself as text. This was then saved as .svg (master copy), and then copies exported as .pdf.

The resulting file was much smaller at under 1mb (could have probably made it smaller still), and retained the all-important vector graphic text. The actual plotting was still embedded as raster, but it was high enough quality and I could live with that.

Sunday, 10 October 2010

Negative branch lengths in neighbour-joining analyses

A recent analysis of some fish COI data revealed these really odd branch tips going backwards on the tree. I'd never heard of this before and neither had most people I asked. So what are they, what causes them, and how do I get rid of them?

NJ phylogram of cyprinid COI sequences
Well they seem to be an artefact of the stepwise NJ clustering algorithm and how it adds new branches. I don't pretend to know the details, but it seems to occur on most analytical platforms (e.g. R, MEGA, PAUP*). These negative branches don't really have any biological meaning, so it is best to remove them before the tree is presented. Ideally they are redistributed to adjacent branches, but practically this isn't feasible. Apparently, PAUP* is able to deal with this, but I didn't have any joy when I tried. Setting them to zero seems the most justified approach at this stage. Of course the original distance matrix on which the NJ tree is based remains the same, so identifications should obviously be checked against the data, rather than by looking at just the tree.

Using the ape package in R, my simple workaround is to generate a neighbour-joining tree as usual with the nj() function. Next I save this tree object to file (in Newick parenthetical format) using the write.tree() command, and open the Newick text file in a text editor (I use SciTE). To remove all the negative branch numbers and replace them with zero, you need to use a regular expression search and replace. This is quite a powerful feature if you know how to use it. As well as normal negative numbers (e.g. -0.0005558445244), ape also adds negative exponents for really short branches (e.g. -5.199093188e-17), so these need to be dealt with too. First off you need to replace the exponent string by entering:
-\d\.\d*e-\d\d
Next, the normal negative number can be addressed with:
-\d\.\d*
I won't go into the details of how these instructions work, but a good tutorial on regular expressions can be found here. Make sure you have no other hyphens in your Newick file that may interfere (e.g. in the taxon labels), and always test it first with just find before you replace all and save. Now your modified Newick file can be reloaded into R and printed using the respective read.tree() and plot() commands. Hopefully someone will eventually develop a more sophisticated way of dealing with this natively in R.

EDIT 18.10.10 ...
And sure enough, yes, there is a much easier way of doing this straight in R. Simply create or load your tree object, and then access the branch (edge) lengths with the $ command, replacing all with zero.
TREEOBJECT$edge.length[TREEOBJECT$edge.length<0]<-0
Many thanks to Samuel Brown for pointing this out. For the R-phobic, the more long-winded approach posted above can still be used for trees produced in other programs such as MEGA or PAUP*.