Prokaryotic and Eukaryotic Genomes Submission Guide (original) (raw)

Both WGS and non-WGS genomes, including gapless complete bacterial chromosomes, can be submitted via the Submission Portal. You will be asked to choose whether the genome being submitted is considered WGS or not. The differences for GenBank purposes are:

Each chromosome in a non-WGS genome is in a single piece and there are no extra sequences. A WGS genome may still have chromosomes in multiple pieces and/or unplaced sequences

non-WGS

WGS

In both cases

Table of Contents

Type of submission

Submit a single genome

This is the simplest submission route because you just fill in a web form in the Submission Portal and upload fasta (or sqn) files of the genome sequences. You will need to:

Submit a batch of genomes

This submission route allows you to submit as many as 400 WGS or non-wgs genomes in a single batch submission. In this route you choose Batch/multiple in the Genome Submission Portal , fill in the web form, upload a Genome Info file with genome metadata, and upload or preload fasta files (or sqn files if there is annotation) of the genome sequences. All the genomes within a batch must:

You will need to:

Events

  1. Only if you will be submitting a genome with annotation and you have not yet registered a BioProject and BioSample for this genome, then you will register the genome sequencing project with the BioProject and BioSample databases so that a locus_tag prefix will be assigned to the BioProject:BioSample pair. If you have already registered a BioProject and BioSample for this genome, eg when submitting the reads to SRA, then a locus_tag prefix should have already been assigned. A file of the locus_tag prefix(es) for the BioSamples within a BioProject is linked to the BioProject submission. Write to genomes@ncbi.nlm.nih.gov if you did not receive a locus_tag prefix. Do not register a duplicate BioProject or BioSample for the same genome. Provide these preregistered BioProject and BioSample accessions in the genome submission. Remember that annotation is optional for genome submissions. If you are submitting a genome without annotation, even if you will be requesting PGAP annotation, then you'll create the BioSample (and BioProject, if necessary) during the genome submission. Genomes sequenced as part of the same research effort can belong to a single BioProject, so it's common to create a BioProject during the submission of one genome and then include that BioProject during the submission of additional genomes.
  2. Make the genome assembly data files.
    • Unannotated genomes just need fasta files
    • Annotated genomes need to make .sqn file submissions by running the command line program table2asn (the replacement of tbl2asn), and then fixing Errors and Fatals that are indicated in the .val and .dr files. Failure to do this will cause serious delays in processing.
  3. If you have higher-level assembly information, scaffolds and/or chromosomes, then generate an AGP file to build those objects from the wgs-contigs.
  4. If you are submitting a batch of genomes (maximum of 400 per batch), then create a Genome Info file. Note that for batch submissions all chromosome and plasmid assignment information must be included in the header of the relevant fasta sequence, as described in the 'see details' section of the Additional requirements for batch submissions
  5. Submit via the Genome Submission Portal.
  6. What happens after submission

Submission Files

Fasta files

Put the sequences file into fasta format
IMPORTANT Additional requirements for batch submissions

[1] All the sequences of single genome must be in one file

[2] The chromosome, plasmid, and organelle assignment information must be encoded in the input files of a batch submission, as described in these details:

.sqn files

These are generally required only when the submitter wants to include annotation. Annotation is optional for GenBank genome submissions.

see details

Prepare a .sqn file for submission using table2asn. table2asn reads a template file along with the fasta sequence and annotation table files, and outputs an ASN (.sqn) file for submission to GenBank. Follow these three steps:

  1. Prepare data files

Prepare fasta files as above, with one file per genome.

Prepare these additional files:

  1. Run table2asn

A. Annotation is in GenBank-specific GFF files: follow the instructions for GFF files.

B. Annotation is in .tbl files: follow these instructions. Note that a few of the arguments in table2asn have changed relative to tbl2asn, eg -indir instead of -p. The table2asn page provides more details. Here are the instructions for creating annotated genome files when the annotation is in .tbl files:

Sample command line when the sequences are contigs (overlapping reads with no Ns representing gaps) is

table2asn -indir path_to_files -t template -M n -Z

If the sequences contain Ns that represent gaps, then run the appropriate table2asn command line with the -l and -gaps-min arguments, as described in the Gapped Genome Submission page. The command line for the most common situation (runs of 10 or more Ns represent a gap, and there are no gaps of completely unknown size, and the evidence for linkage across the gaps is "paired-ends") is:

table2asn -indir path_to_files -t template -M n -Z -gaps-min 10 -l paired-ends

For either case you can include the source information in the definition line of each contig, as described in the fasta defline components section, above. Alternatively, the organism and strain (or breed or isolate) can be included with -j in the table2asn command line. The additional source qualifiers will be obtained from the registered BioSample. However, chromosome, plasmid & organelle assignment information must be included in the fasta definition lines. In addition, if the submission is an annnotated prokaryotic genome, then include the genetic code with -j in the commandline, for example:

table2asn -indir path_to_files -t template -M n -Z -j "[organism=Clostridium difficile ABDC] [strain=ABDC] [gcode=11]"

Here are some commonly used arguments when there is no annotation or when the annotation input is .tbl file:

Option Description
-M n To run genome-specific functions and validator and to fix some known product name problems
-Z Runs the sequence discrepancy report, which looks for subtle inconsistencies within a set of related records, and outputs a file with the .dr suffix. See the Discrepancy Report page for information about its output. NOTE: this argument is changed from tbl2asn because it no longer requires (or accepts) an output file name.
-t template.sbt Specifies the template file (.sbt), which can be be created at GenBank Submission Template. If the .sbt file is in a different directory the full path must be specified.
-j Allows the addition of source qualifiers that are the same for every sequence in every fasta file being read. Examples:-j "[organism=Mus musculus] [tissue-type=liver]"-j "[organism=Escherichia coli] [strain=ABC1] [gcode=11]"
-V b Generate GenBank Flatfile with a .gbf suffix. This file is only for viewing; it is not for submission. Adding this could slow table2asn so you may choose to include it only for the first run to make sure that the annotation looks as expected.
-c s Add exception to every CDS with an intron shorter than 11bp. Adds /artificial_location="low-quality sequence region" to the CDS, allowing the CDS to pass the ShortIntron error, and causes the protein definition line to be prefaced with "LOW QUALITY PROTEIN:". This option should only be used if you are confident that the protein translation is correct. Do not use short introns to force a translation containing frameshifts or large deletions.
-Y File_name Import a file that is a text comment
-w assembly.cmt Import Structured Comment Table. This is optional, but can be helpful when there are multiple genomes, because there will be less information to supply on the web form during submission. This file can be created at Structured Comment Template
  1. Check the output of the validation and discrepancy report and fix problems

A. Check the .stats file for the number, severity and type of errors that are present in the .val files. All Errors and Rejects need to be fixed. The presence of errors will slow processing. See the genome validation errors for guidance. Contact genomes@ncbi.nlm.nih.gov with any questions about the validation output. During processing there may be some questions about other aspects of the submission.

B. Check the .dr file for the results of the discrepancy report. Categories prefaced with FATAL are nearly always unacceptable and must be fixed. (The exceptions are FATALs about bacteria when the genome is not bacterial.) Some of the categories are informational, for example PROTEIN_NAMES: All proteins have same name "hypothetical protein". Reports that are not flagged as fatal should be examined to determine if they represent annotation artifacts that need to be corrected or if they are acceptable due to the biology of the genome. See the discrepancy report examples and explanations and common discrepancy reports for guidance. Write to genomes@ncbi.nlm.nih.gov and send the .dr file with questions about this report.

Some common discrepancy reports of which to be aware:

C. Make any necessary fixes to the input .fsa and/or .tbl files and run table2asn again.

AGP file (optional)

AGP files provide the ordering and orientation information to construct scaffolds from contigs, or to construct chromosomes from scaffolds and/or contigs. However, remember that we do accept the gapped scaffolds themselves as the basic sequences of the genome. If you choose to submit a multi-layer submission with and AGP file, then know that the AGP file defines these genome assemblies, so be sure to include all wgs-contigs that are considered to be part of the genome in the AGP file. However, if the sequences in the fasta (or .sqn) files are already the scaffolds or chromosomes, then do not make an AGP file.

see details

See this page for the AGP format.

There are 3 types of AGP files:

Some specific requests are:

You can validate the basic format of your AGP file at http://www.ncbi.nlm.nih.gov/projects/genome/assembly/agp/agp_validate.cgi. In addition, the standalone commandline program, agp_validate is available by anonymous FTP to validate the AGP file more extensively yourself. The -help option details the arguments and command line format.

Genome Info table

The Genome Info table is required for batch submissions and is used to provide the Genome Assembly Data of each. You can either fill in the table during the genome submission or prepare the file ahead of time and upload it during the submission. To prepare it ahead of time, download the Genome Info file template. The instructions are on the first tab of this file and the template is on the second tab. Complete the second tab (Genome_Data), then save the worksheet as a Text (Tab-delimited) file -- (use 'File, Save as, Save as type: Text (Tab-delimited)' ).

see details of the required and optional information

Each row in the template represents a genome. The required fields are:

Optional fields:

Definitions of these fields are in the Genome Assembly Data section and also as comments in the template itself. Instructions:

Metadata required for all genome submissions

BioProject

The BioProject contains the description of the research effort, relevant grant(s), and has links to the public data for the proejct. Each genome must belong to a BioProject, and genomes sequenced as part of the same research effort can belong to a single BioProject. Use the same BioProject for the sequence reads and genome assembly made from those reads; do not create duplicate BioProjects. If a new BioProject is necessary for unannotated (or PGAP-annotated) genomes, then registering during the genome submission process is simplest. However, genomes submitted with annotation will need to be pre-registered so that a locus_tag prefix can be assigned to the BioProject/BioSample pair and used to identify each gene within that genome uniquely. A file of the locus_tag prefix(es) for the BioSamples within a BioProject is linked to the BioProject submission. Write to genomes@ncbi.nlm.nih.gov if you did not receive a locus_tag prefix after preregistering a BioSample for your BioProject.

BioSample

The BioSample contains the source information of the sample that was sequenced. Use the same BioSample for the sequence reads and genome assembly made from those reads; do not create duplicate BioSamples. Registering a new BioSample can be done during the genome submission process for unannotated (or PGAP-annotated) genomes; however, genomes submitted with annotation will need to be pre-registered to get a locus_tag prefix. Include the registered BioProject when you register the BioSample so that a locus_tag prefix is assigned to the pair. You'll find the locus_tag assignment(s) in a file linked to the BioProject submission.

During processing of the genome the relevant information from the genome and BioSample will be merged so that they are in agreement. If the genome and BioSample have a conflict in the value of an attribute, we will stop and ask the submitter to clarify what the correct value is. There is some extra validation in GenBank compared to BioSample, eg ‘altitude’ is defined as being in meters in BioSample but it must have ‘m’ present in the GenBank genome. We will fix simple issues like this.

Genome Assembly Data and other information about a genome assembly

Gap Information: What the Ns represent

Chromosome and plasmid assignments

Plasmid and chromosome names rules

see details

Chromosome and plasmid names can only digits, dots, underscores, and ASCII characters in plain text in the standard English alphabet. In addition, there are rules specific for each.

Chromosome names
Plasmid names

Submit the genomes to the Genome Submission Portal

All files must be submitted via the Genome Submission Portal. Choose "Single genome" or "Batch/multiple genomes". Answer the questions and upload the necessary files Review the summary page and click the "Submit" button. The submission will be given a 'SUB' temporary identifier which you can use in correspondence before an accession number is assigned to the genome submission.

What happens next

Once we receive your genome submission, several automated validations are run and a member of our staff conducts an initial review. If no significant issues are found, the genome will be assigned an accession number.

If there are problems

The submitted files will be marked in the submission portal as "Error" and you will receive an email with details of the problems. Errors found in the automated validations are automatically reported back to the submission portal and an email is automatically sent to the submitter with instructions on how to proceed. In addition, a member of our staff conducts an initial review of each submission and reviews several additional validations. The problems, including those described in the Fix problems section, could be:

Once you have made the fixes, log back into theGenome Submission Portal, retrieve that submission by its 'SUB' identifier and click the "FIX" button of that submission. You will be back in the original submission and will need to delete the files that are marked as having errors, and then upload new files in their place.

Once your submission is assigned an accession number it undergoes a thorough review by our staff. This review is critical because we are striving to present genome annotation in an accurate and consistent manner so that database users can make maximum use of the data. If we encounter problems during this review, we will contact you by email.

Submission statuses in the submission portal

If you elected to hold your genome until a particular date (or publication, whichever is first), we ask that you provide us with the expected publication date and also notify us in a timely manner of the upcoming publication and the relevant citation details. This will allow us to coordinate the release of your genome with the appearance of the paper. Please provide at least two weeks' notice of any upcoming publication.

NOTE: As of January 2017, genomes will be released on their release date without additional communication, as is the normal GenBank policy. Be sure to request an extension of the release date if the genome is not yet published and you wish to continue to keep it confidential.

Requesting PGAP annotation of prokaryotic genomes

Requests for annotation by theProkaryotic Genomes Annotation Pipelineis a step during submission of the genome to GenBank. Prepare a regular GenBank genome submission and request PGAP annotation during the submission process by clicking on the box "Annotate this prokaryotic genome in the NCBI Prokaryotic Annotation Pipeline before being released". The annotated genome will be posted back to the Submission Portal for your review. You may edit the file and resubmit that to GenBank; however, this is not required and is generally not recommended, as it will slow processing and may introduce problems that you would need to fix.

Running PGAP yourself

If you would like to annotate your prokaryotic genome with the NCBI Prokaryotic Genomes Annotation Pipeline (PGAP) before or without submitting your data to GenBank, NCBI has made an external version available for you to download and run. It will generate a GenBank-compliant annotated genome that is submission-ready. If you are interested in running PGAP yourself, please see the NCBI Insights announcementand find more details at github, or see this short video.

After your genome is annotated using external PGAP, you may choose to submit it to GenBank: