NCBI Genomes FTP (original) (raw)

NCBI Genomes FTP

The Genomes FTP site offers a consistent core set of files for the genome sequence and annotation products of all organisms and assembled genomes in scope. It supports download needs including:

Retrieving the genome sequence for a specific assembled genome
Retrieving GenBank or RefSeq Gene, RNA, and protein annotation for a specific organism and a specific assembled genome, or a specific RefSeq annotation release
Retrieving annotation in GenBank flat file, GFF3, or GTF format
Retrieving RefSeq annotation for mitochondria, plastids, and plasmids
Assembly summary reports containing metadata for the latest and historical GenBank and RefSeq assembled genomes
Matching set of sequence (FASTA) and annotation (GFF3 or GTF) files with identical sequence identifiers to facilitate reproducible analyses
MD5 checksums to ensure downloaded content is complete

FAQs

What is the easiest way to download data for one or more assembled genomes?

Using NCBI Datasets. This is the most user-friendly way to download genome data. Please see NCBI Datasets Documentation for more details.
From the Genomes FTP site: Users interested in additional files that are currently not included in the Datasets package (see table) can browse the Genomes FTP site to download them piecemeal or download in bulk using command-line tools such as lftp and rsync.

How can I download only the current annotation for an organism?

Most users will want to download data only for the latest annotation for the reference assembled genome of an organism. This data is available in NCBI Datasets.

How can I stay informed about changes to the NCBI genomes FTP site?

Subscribe to the Genomes-announce mailing list or follow the NCBI Insights blog.

Are files on the FTP site updated following annotation updates?

All new annotation releases are published to the Genomes FTP site.

How can I download older annotation files?

NCBI Datasets delivers the latest annotation for any assembled genome version. In some cases, when an assembled genome is annotated multiple times by NCBI and users need data for a specific older annotation release for that assembled genome, they can download it from the annotation_releases directory on the Genomes FTP site.

What is the file content within each specific assembled genome directory?

Directories for all current assembled genomes, and for many previous versions, include a core set of files, plus additional files relevant to the specific assembled genome. Directories for old, assembled genome versions that predate the genomes FTP site reorganization contain only the assembly report, assembly stats, and assembly status files (see table).

Table: Sequence and Annotation Files Available on Genomes FTP

File	Format	Description
*_ani_contam_ranges.tsv [G/R]	Tab-delimited text	Reports potentially contaminated regions in the assembly identified based on Average Nucleotide Identity (ANI).
*_ani_report.txt [G/R]	Tab-delimited text	Reports Average Nucleotide Identity (ANI)based evaluation of the taxonomic identity of the assembly.
*_assembly_regions.txt [R]	Tab-delimited text	Reports the location of genomic regions and lists the alt/patch scaffolds placed within those regions.
*_assembly_report.txt [G/R]	Tab-delimited text	Reports the name, role, and sequence accession.version for objects in the assembly.
*_assembly_stats.txt [G/R]	Tab-delimited text	Reports statistics for the assembly.
*_cds_from_genomic.fna.gz [D/G/R]	FASTA	Nucleotide sequences corresponding to all CDS features annotated on the assembly, based on the genome sequence.
*_fcs_report.txt [G/R]	Tab-delimited text	Reports potentially contaminated regions in the assembly identified by the Foreign Contamination Screen (FCS).
*_feature_count.txt.gz [G/R]	Tab-delimited text	Reports counts of gene, RNA, CDS, and similar features based on data reported in the *_feature_table.txt.gz file.
*_feature_table.txt.gz [G/R]	Tab-delimited text	Reports locations and attributes for a subset of annotated features.
*_gene_expression_counts.txt.gz [R]	Tab-delimited text	Reports counts of RNA-seq reads mapped to each gene.
*_normalized_gene_expression_counts.txt.gz [R]	Tab-delimited text	Reports normalized counts (TPM) of RNA-seq reads mapped to each gene.
*_gene_ontology.gaf.gz [R]	GO Annotation File (GAF)	Gene Ontology (GO) annotation of the annotated genes.
*_genomic.fna.gz [D/G/R]	FASTA	Genomic sequence(s) in the assembly. Repetitive sequences in eukaryotes are masked to lower-case.
*_genomic.gbff.gz [G/R]	GenBank flat file	Genomic sequence(s) in the assembly.
*_genomic.gff.gz [D/G/R]	GFF3	Annotation of the genomic sequence(s).
*_genomic.gtf.gz [D/G/R]	GTF	Annotation of the genomic sequence(s).
*_genomic_gaps.txt.gz [G/R]	Tab-delimited text	Reports the coordinates of all gaps in the top-level genomic sequences.
*_protein.faa.gz [D/G/R]	FASTA	Sequences of accessioned protein products annotated on the genome assembly.
*_protein.gpff.gz [G/R]	GenPept flat file	Sequences of accessioned protein products annotated on the genome assembly.
*_pseudo_without_product.fna.gz [R]	FASTA	Genomic sequence corresponding to pseudogene and other gene regions which do not have any associated transcribed RNA products or translated protein products.
*_rm.out.gz [R]	Text	RepeatMasker output (provided for some eukaryotes).
*_rm.run [R]	Text	Documentation of the RepeatMasker version, parameters, and library (provided for some eukaryotes).
*_rna.fna.gz [D/R]	FASTA	Sequences of accessioned RNA products annotated on the genome assembly.
*_rna.gbff.gz [R]	GenBank flat file	RNA products annotated on the genome assembly (provided for RefSeq assemblies as relevant).
*_rna_from_genomic.fna.gz [G/R]	FASTA	Nucleotide sequences corresponding to all RNA features annotated on the assembly, based on the genome sequence.
*_rnaseq_alignment_summary.txt [R]	Tab-delimited text	Reports counts of alignments classified by Subread featureCounts.
*_rnaseq_runs.txt [R]	Tab-delimited text	Information about RNA-seq runs used for gene expression analyses.
*_translated_cds.faa.gz [G/R]	FASTA	Individual CDS features annotated on the genomic records, conceptually translated into protein sequences.
*_wgsmaster.gbff.gz [G]	GenBank flat file	WGS master for the assembly (present only if a WGS master record exists for the sequences in the assembly)
.annotation_hashes.txt [G/R]	Tab-delimited text	Reports hash values for different aspects of the annotation data
.assembly_status.txt [G/R]	Text	Reports the current status of this assembly version
.md5checksums.txt [G/R]	Text	File checksums are provided for all data files in the directory.
*_knownrefseq_alns.bam (RefSeq_transcripts_alignments sub-directory) [R]	BAM	Alignments of the annotated Known RefSeq transcripts (identified with accessions prefixed with NM_ and NR_) to the genome.
*_knownrefseq_alns.bam.bai (RefSeq_transcripts_alignments sub-directory) [R]	BAM Index	Index of the BAM alignments of the annotated Known RefSeq transcripts to the genome.
*_modelrefseq_alns.bam (RefSeq_transcripts_alignments sub-directory) [R]	BAM	Alignments of the annotated Model RefSeq transcripts (identified by accessions prefixed with XM_ and XR_) to the genome.
*_modelrefseq_alns.bam.bai (RefSeq_transcripts_alignments sub-directory) [R]	BAM Index	Index of the BAM alignments of the annotated Model RefSeq transcripts to the genome.
*_compare_prev.txt.gz (Annotation_comparison sub-directory) [R]	Tab-delimited text	Annotation comparison report.
*_cross_species_tx_alns.gff.gz (Evidence_alignments sub-directory) [R]	GFF3	Alignments of cDNAs, ESTs, and TSAs from other species to the genomic sequence(s).
*_same_species_tx_alns.gff.gz (Evidence_alignments sub-directory) [R]	GFF3	Alignments of same species cDNAs, ESTs, and TSAs to the genomic sequence(s).
*_gnomon_model.gff.gz (Gnomon_models sub-directory) [R]	GFF3	Gnomon annotation of the genomic sequence(s).
*_gnomon_protein.faa.gz (Gnomon_models sub-directory) [R]	FASTA	Gnomon protein models annotated on the genome assembly.
*_gnomon_rna.fna.gz (Gnomon_models sub-directory) [R]	FASTA	Gnomon transcript models annotated on the genome assembly.
*_graph.bw (RNASeq_coverage_graphs directory) [R]	UCSC BigWig	RNA-seq read coverage graphs. Alternative style: subdir/*_file.txt

D: Datasets; G: Genbank; R: RefSeq

* Sample file path with either a GCA or GCF prefix where each hashtag represents a number in the actual path: https://ftp.ncbi.nih.gov/genomes/all/GC[A/F]/###/###/###/GC[A/F]_#########_(assembly name)/GC[A/F]_#########_(assembly name)

Where can I find information to help me choose between the many different assembled genomes for a species?

Many different assembled genomes are available for species with medical, agricultural, or scientific relevance. The Genus_species directories under the “genbank” and “refseq” directory trees each contain an assembly_summary.txt file that provides general information on all assembled genome versions included in the directory such as release date, submitter organization, assembly level, and annotation status. For example, see ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/Sulfolobus_islandicus/assembly_summary.txt.

After assembled genomes of interest have been identified using the data from the species-specific assembly_summary.txt file, they can be accessed via the “all_assembly_versions” directory for that species. Alternatively, any assembled genomes the NCBI Reference Sequence Database (RefSeq) selects as reference or representative genome can be readily accessed via the directories named “reference” or “representative” in the Genus_species directories under the “genbank” and “refseq” directory trees.

Do you provide assembled genome data formatted for use by sequence read alignment pipelines?

Genomic FASTA with modified sequence identifiers and index files convenient for analysis with Next Generation Sequencing tools are currently provided for the Genome Reference Consortium’s human and mouse assembled genomes GRCh38, GRCm38.p3, and GRCm39. RefSeq annotation in GFF3 and GTF formats with sequence identifiers matching those in the FASTA files are also provided to facilitate use in RNA-Seq analysis pipelines.

The four analysis sets provided for GRCh38 (no_alt_analysis_set, full_analysis_set, full_plus_hs38d1_analysis_set, and no_alt_plus_hs38d1_analysis_set) and the two analysis sets provided for GRCm38.p3 (no_alt_analysis_set and full_analysis_set) differ from the corresponding full assembled genomes by one or more of the following:

omission of alternate locus and patch scaffolds that cause complications for sequence read alignment programs that are not alternate contig aware (alt-aware)
hard masking of duplicate copies of pseudo-autosomal regions and centromeric arrays
addition of “decoy” sequences

Additionally, index files generated by BWA, Samtools, Bowtie, and HISAT2 are also provided. See the GRCh38 README, GRCm38 README, or GRCm39 README for a full description.

Are repetitive sequences in eukaryotic genomes masked?

Yes. All genome sequences are softmasked using WindowMasker or RepeatMasker. For genomes that are masked using RepeatMasker, an additional file with information about the masked regions is also provided (see table).

Generated October 18, 2024