Friday, 24 April 2015

Submitting to GenBank

Few things fill the hearts of molecular biologists with as much dread as embarking on a GenBank submission. We all benefit from GenBank, but getting our data up there can be a somewhat onerous process. I am sure that if this community-minded attitude were not insisted upon by journals, GenBank would be a rather empty place.

Here, I will try to help with some of the organisational and practical aspects to make the process a little more painless. I will assume that you are familiar with at least the basics of R and running programs on the Unix command line.

Proper Planning and Preparation Prevents Piss Poor Performance

To most, GenBank submission is something of an afterthought, to be done once your paper has been accepted. Here I will try to challenge that, and suggest that you should be putting your data up on GenBank before you submit your manuscript. This is because:
  1. You have an opportunity to notice potentially serious errors in your data at a much earlier stage in your project.
  2. Your paper will benefit from fewer delays during the revision process, and is likely to get published faster.
  3. If you send the GenBank flatfile as information supporting your manuscript, reviewers will have access to your raw data and will be able to review that data.
  4. It forces you to get organised up front, which is generally a good thing all round.
  5. If you are worried about being scooped, you always have the option of releasing the data after publication.
What are the options?

There are three main ways you can get data onto GenBank: 'BankIt', 'Sequin', and 'tbl2asn'. The main differences are described here, but essentially BankIt is a Web-based submission tool, Sequin is a standalone offline program with a GUI (graphical user interface), and tbl2asn is the scary command line option.

In my experience, BankIt is suitable for trivial uploads of small numbers of sequences, and it pretty easy to use. Sequin seems to be the most popular method, but I would avoid it, as it not as straightforward as it seems, and many find it quite confusing and time consuming. I favour tbl2asn, as it means you can script your submission and save yourself considerable time. This is the method I will demonstrate now.

Where to start

First, you will need get your data into a suitable file format as a master copy. Master copies are important, because when you return to a project after six months, you might want to know which of the 347 fasta files in your folder is the correct one to use. For a master copy, I recommend CSV or TSV, mainly for simplicity's sake. This file should be version controlled of course, using git. The benefit of CSV over say fasta or nexus, is that an unlimited amount of ancillary information can also be stored in it, including for example: scientific names, higher classifications, geography, GPS coordinates, and specimen voucher data (essentially whatever metadata are available and useful). Try to use a controlled vocabulary here wherever possible (Darwin Core is a good place to start). Having all the data in this format is also a big help when running your analyses and making your figures.

Here's a fictitious example of a format of a master file (much simplified, and do not copy this, as extra tabs were added to visualise). As you can see, I sampled three Boops boops all from the same MNHN museum lot and generated data for both mitochondrial cytb and 16S.

We can make this file beautiful and searchable if this error is corrected: It looks like row 2 should actually have 8 columns, instead of 14 in line 1.
otherCatalogNumbers genus specificEpithet institutionCode catalogNumber country nucleotides_CYTB nucleotides_16S
BB-001 Boops boops MNHN 1978-0632 Spain NA TATGGAGCTTAA
BB-002 Boops boops MNHN 1978-0632 Spain ATGGCTAGCCT NA
BB-003 Boops boops MNHN 1978-0632 Spain ATGGCTAGCCT TATGGAGCTTAA
view raw master_fake.tsv hosted with ❤ by GitHub

Importing your data into R

Here, we read the table into R, and then because some of our individuals were not sequenced for both genes, we need to remove those represented with 'NA' (for the cytb gene first). Also remember that when working with text, that strings should not be factors!

tab <- read.table("master.tsv", header=TRUE, sep="\t", stringsAsFactors=FALSE)
reduced_table <- tab[-which(is.na(tab$nucleotides_CYTB)), ]
view raw read_reduce.R hosted with ❤ by GitHub

Gene names

Now, we need to use consistent nomenclature in our submission, which means using the correct gene and product names. Anyone who has ever searched GenBank for specific genes will know how frustrating it is when one gene is known under half a dozen names. Here I looked up the NCBI organelle resources for the official name of cytb, which is 'CYTB' (all in upper-case). The official name of its product—in this case the protein it codes for—is "cytochrome b" (all lower case).

You'll need to also be aware of general usage patterns, as you'll notice that COI is officially called 'COX1' on this list, but very few people use this name when submitting. For nuclear gene names, take a look here and here .

gene_name <- "CYTB"
prod_name <- "cytochrome b"
view raw prod_name.R hosted with ❤ by GitHub

Source modifiers

Here, we write a GenBank formatted fasta file containing our important source modifying annotations, specifically:
  1. The 'Sequence_ID', in our case the number of the tissue sample under 'otherCatalogNumbers' in our spreadsheet pasted together with the gene name to make it unique for each gene.
  2. Organism name made by pasting 'genus' and 'specificEpithet' fields together.
  3. The 'Bio_material' code, which is the code number used in the lab tissue collection, i.e. the same as our 'otherCatalogNumbers' field.
  4. The specimen voucher code made by pasting together the 'institutionCode' and the 'catalogNumber' fields.
  5. The genomic location of our sequence, in this case it is 'mitochondrion' (leave this blank for nuclear DNA).
  6. The genetic code to translate the sequences, here '2' is the mitochondrial genetic code (again, leave this blank for nuclear DNA)
There are many more source modifiers available. See the list here.

fasta_description <- paste0(">", paste0(reduced_table$otherCatalogNumbers, "_", gene_name), " ", #
"[organism=", reduced_table$genus, " ", reduced_table$specificEpithet, "]", " ", #
"[Bio_material=", reduced_table$otherCatalogNumbers, "]", " ", "[Specimen-voucher=", #
reduced_table$institutionCode, ":", reduced_table$catalogNumber, "]", " ", "[location=mitochondrion] [mgcode=2]")
fasta_complete <- paste(fasta_description, reduced_table$nucleotides_CYTB, sep="\n")# add data to fasta
write(fasta_complete, file="sequences.fsa", append=FALSE)# write out the fasta file
view raw write_fasta.R hosted with ❤ by GitHub

Here's the result:

>BB-002_CYTB [organism=Boops boops] [Bio_material=BB-002] [Specimen-voucher=MNHN:1978-0632] [location=mitochondrion] [mgcode=2]
ATGGCTAGCCTTCGAAAAACGCACCCCCTATTAAAAATTGCTAATCACGCATTAGTTGATCTCCCTGCACCCTCCAATATTTCCGTCTGATGAAATTTTGGCTCCCTGCTTGGCCTCTGTCTTATTTCCCAGCTCCTTACAGGGCTATTCCTCGCCATACACTATACCTCCGATATCGCTACAGCCTTCTCTTCCGTTGCC
>BB-003_CYTB [organism=Boops boops] [Bio_material=BB-003] [Specimen-voucher=MNHN:1978-0632] [location=mitochondrion] [mgcode=2]
ATGGCTAGCCTTCGAAAAACGCACCCCCTATTAAAAATTGCTAATCACGCATTAGTTGATCTCCCTGCACCCTCCAATATTTCCGTCTGATGAAATTTTGGCTCCCTGCTTGGCCTCTGTCTTATTTCCCAGCTCCTTACAGGGCTATTCCTCGCCATACACTATACCTCCGATATCGCTACAGCCTTCTCTTCCGTTGCC

Feature tables

Now we create the feature table containing the locations of attributes of the sequence. The table is tab separated. Pay attention to the angle brackets '< >', as these signify that the coding sequence starts or ends outside of range of our nucleotides, i.e. if it is a partial sequence that does not code for the complete gene you may need to use the angle brackets. Here, our sequence starts on the first base of cytb, but it is only a partial sequence, so we use the angle bracket at the end only.

We are also assuming here that your coding sequence is in the correct reading frame, in other words that the first nucleotide in the alignment corresponds to the first nucleotide of the amino acid. While it is possible to specify the coding sequence to start on the second or third bases, please never do this, because this information is lost when the sequences are downloaded as fasta files from GenBank, and it then makes the very simple task of aligning your sequences using codon models difficult without manually correcting these errors.

feature_tab <- paste0(paste0(">Feature", " ", reduced_table$otherCatalogNumbers, "_", gene_name),"\n", #
"1", "\t", ">", nchar(reduced_table$nucleotides_CYTB), "\t", "gene", "\n", #
"\t", "\t", "\t", "gene", "\t", gene_name, "\n", #
"1", "\t", ">", nchar(reduced_table$nucleotides_CYTB), "\t", "CDS", "\t", "\t", "\n", #
"\t", "\t", "\t", "product", "\t", prod_name, "\n", #
"\t", "\t", "\t", "codon_start", "\t", "1")
write(feature_tab, file="features.tbl", append=FALSE)# write out
view raw feature_table.R hosted with ❤ by GitHub

Here's the result:

>Feature BB-002_CYTB
1 >201 gene
gene CYTB
1 >201 CDS
product cytochrome b
codon_start 1
>Feature BB-003_CYTB
1 >201 gene
gene CYTB
1 >201 CDS
product cytochrome b
codon_start 1
Repeat for 16S

That's it, all done for cytb. Now we repeat for our 16S sample, and append the new data to the files we already made. Remember we are overwriting our previous objects, but that's okay as these were already written to disk.

reduced_table <- tab[-which(is.na(tab$nucleotides_16S)), ]
gene_name <- "rRNA"
prod_name <- "16S ribosomal RNA"
# note that we deleted the genetic code element, as this is not a coding sequence
# also note that 'append' is now set to 'TRUE' to add the data to previously written files
fasta_description <- paste0(">", paste0(reduced_table$otherCatalogNumbers, "_", gene_name), #
" ", "[organism=", reduced_table$genus, " ", reduced_table$specificEpithet, "]", " ", #
"[Bio_material=", reduced_table$otherCatalogNumbers, "]", " ", "[Specimen-voucher=", #
reduced_table$institutionCode, ":", reduced_table$catalogNumber, "]", " ", "[location=mitochondrion]")
fasta_complete <- paste(fasta_description, reduced_table$nucleotides_16S, sep="\n")# add data to fasta
write(fasta_complete, file="sequences.fsa", append=TRUE)# write out
#
feature_tab <- paste0(paste0(">Feature", " ", reduced_table$otherCatalogNumbers, "_", gene_name),"\n", #
"<1", "\t", ">", nchar(reduced_table$nucleotides_16S), "\t", gene_name, "\n", #
"\t", "\t", "\t", "product", "\t", prod_name, "\n")
write(feature_tab, file="features.tbl", append=TRUE)# write out
view raw 16S.R hosted with ❤ by GitHub

Here's the resulting fasta file and feature table for both genes:

>BB-002_CYTB [organism=Boops boops] [Bio_material=BB-002] [Specimen-voucher=MNHN:1978-0632] [location=mitochondrion] [mgcode=2]
ATGGCTAGCCTTCGAAAAACGCACCCCCTATTAAAAATTGCTAATCACGCATTAGTTGATCTCCCTGCACCCTCCAATATTTCCGTCTGATGAAATTTTGGCTCCCTGCTTGGCCTCTGTCTTATTTCCCAGCTCCTTACAGGGCTATTCCTCGCCATACACTATACCTCCGATATCGCTACAGCCTTCTCTTCCGTTGCC
>BB-003_CYTB [organism=Boops boops] [Bio_material=BB-003] [Specimen-voucher=MNHN:1978-0632] [location=mitochondrion] [mgcode=2]
ATGGCTAGCCTTCGAAAAACGCACCCCCTATTAAAAATTGCTAATCACGCATTAGTTGATCTCCCTGCACCCTCCAATATTTCCGTCTGATGAAATTTTGGCTCCCTGCTTGGCCTCTGTCTTATTTCCCAGCTCCTTACAGGGCTATTCCTCGCCATACACTATACCTCCGATATCGCTACAGCCTTCTCTTCCGTTGCC
>BB-001_rRNA [organism=Boops boops] [Bio_material=BB-001] [Specimen-voucher=MNHN:1978-0632] [location=mitochondrion]
TATGGAGCTTAAGACGCCAGGGCAGCTCACGTTAAACGCCCCTAATAAAGGAATAAAACCTAGTGAATCCTGCTCTAATGTCTTTGGTTGGGGCGACCACGGGGAATCATAAAACCCCCACGTGGAATGGGAGCACCACACTCCTAAACCCAAGAGCTTCCGCTCTAATGAACAGAACTTCTGGCCATATTAGATCCGGT
>BB-003_rRNA [organism=Boops boops] [Bio_material=BB-003] [Specimen-voucher=MNHN:1978-0632] [location=mitochondrion]
TATGGAGCTTAAGACGCCAGGGCAGCTCACGTTAAACGCCCCTAATAAAGGAATAAAACCTAGTGAATCCTGCTCTAATGTCTTTGGTTGGGGCGACCACGGGGAATCATAAAACCCCCACGTGGAATGGGAGCACCACACTCCTAAACCCAAGAGCTTCCGCTCTAATGAACAGAACTTCTGGCCATATTAGATCCGGT
view raw sequences.fsa hosted with ❤ by GitHub
>Feature BB-002_CYTB
1 >201 gene
gene CYTB
1 >201 CDS
product cytochrome b
codon_start 1
>Feature BB-003_CYTB
1 >201 gene
gene CYTB
1 >201 CDS
product cytochrome b
codon_start 1
>Feature BB-001_rRNA
<1 >200 rRNA
product 16S ribosomal RNA
>Feature BB-003_rRNA
<1 >200 rRNA
product 16S ribosomal RNA
view raw features.tbl hosted with ❤ by GitHub
Author info template

Next, we need to generate a file for the author/publication information. This is done online, and you simply type the details into the Web form here and download the file into your working directory. Easy.

Running 'tbl2asn'

We have generated three files so far: 'sequences.fsa' contains our DNA sequences and source modifiers, 'features.tbl' contains the locations of the features in our data, and 'template.sbt' contains our author and publication information. Now, we need to pass these files to the 'tbl2asn' program.

The latest version of the program is available from the NCBI via their FTP site (versions exist for Windows, Mac, and Linux). If you find your institution blocks FTP connections such as this, you can find the tool as part of the NCBI's published toolkit—simply 'sudo apt-get install ncbi-tools-bin' to install in Ubuntu—but this version is over a year old, so I do not know if its output would be acceptable for GenBank today. Run this program from the terminal as follows:

# Here we use -a flag to specify fasta file format type,
# the -V flag to request verification (v) and a GenBank flatfile as part of the output (b),
# and the -T flag to tell the program to generate the higher taxonomic classifications for our record.
# Use 'tbl2asn --help' for a full list of the options
# To run if in PATH
tbl2asn -t template.sbt -i sequences.fsa -f features.tbl -a s -V vb -T
# To run if local
./tbl2asn -t template.sbt -i sequences.fsa -f features.tbl -a s -V vb -T
view raw tlb2asn.sh hosted with ❤ by GitHub

Output

When the program has run (it's quick), it outputs a series of files. This includes one called 'errorsummary.val', which is important, as it contains a list of errors. In this case, we have none, and the file is empty. The two main output files are called 'sequences.gbf', and 'sequences.sqn'. The gbf file is a GenBank flatfile, the same as you would access through GenBank. This file is designed to be human-readable, so you need to review this in order to check you are happy with everything. You can also read the flatfile into Geneious, and also check there for any errors.

Here's an example:

LOCUS BB-002_CYTB 201 bp DNA linear VRT 24-APR-2015
DEFINITION Boops boops.
ACCESSION
VERSION
KEYWORDS .
SOURCE mitochondrion Boops boops
ORGANISM Boops boops
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Actinopterygii; Neopterygii; Teleostei; Neoteleostei;
Acanthomorphata; Eupercaria; Spariformes; Sparidae; Boops.
REFERENCE 1 (bases 1 to 201)
AUTHORS Mouse,M.
TITLE Phylogeographic structure of Boops boops
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 201)
AUTHORS Mouse,M.
TITLE Direct Submission
JOURNAL Submitted (24-APR-2015) Disney, Hollywood, Los Angeles, CA, USA
FEATURES Location/Qualifiers
source 1..201
/organism="Boops boops"
/organelle="mitochondrion"
/mol_type="genomic DNA"
/specimen_voucher="MNHN:1978-0632"
/bio_material="BB-002"
/db_xref="taxon:36219"
gene 1..>201
/gene="CYTB"
CDS 1..>201
/gene="CYTB"
/codon_start=1
/transl_table=2
/product="cytochrome b"
/translation="MASLRKTHPLLKIANHALVDLPAPSNISVWWNFGSLLGLCLISQ
LLTGLFLAMHYTSDIATAFSSVA"
ORIGIN
1 atggctagcc ttcgaaaaac gcacccccta ttaaaaattg ctaatcacgc attagttgat
61 ctccctgcac cctccaatat ttccgtctga tgaaattttg gctccctgct tggcctctgt
121 cttatttccc agctccttac agggctattc ctcgccatac actatacctc cgatatcgct
181 acagccttct cttccgttgc c
//
view raw cytb.gbf hosted with ❤ by GitHub
Submit

If all looks okay, you are ready to submit to GenBank! This is as simple as emailing the 'sequences.sqn' file to GenBank staff (gb-sub@ncbi.nlm.nih.gov). In a couple of days, you should receive via email your shiny new accession numbers, although it may take several further weeks before the sequences go live.

Have a go ...

If you wish to try this out yourself, all the files to repeat this example (except tbl2asn) can be found at https://github.com/boopsboops/genbank-submit. For the time being you will have to adapt these R scripts for your own needs, but one day I might wrap this up into a nice function.


While this is okay for a small-medium sized project, you might want to be thinking about relational databases for more complicated structures linking several projects.

1 comment:

  1. Thank you. I was just able to modify your example to work with our sequences, once we have everything in a table Genbank submission should be relatively painless! Huge help, can't thank you enough.

    ReplyDelete