NCBI Genomes FTP (original) (raw)

NCBI Genomes FTP

The Genomes FTP site offers a consistent core set of files for the genome sequence and annotation products of all organisms and assembled genomes in scope. It supports download needs including:

FAQs

What is the easiest way to download data for one or more assembled genomes?

How can I download only the current annotation for an organism?

Most users will want to download data only for the latest annotation for the reference assembled genome of an organism. This data is available in NCBI Datasets.

How can I stay informed about changes to the NCBI genomes FTP site?

Subscribe to the Genomes-announce mailing list or follow the NCBI Insights blog.

Are files on the FTP site updated following annotation updates?

All new annotation releases are published to the Genomes FTP site.

How can I download older annotation files?

NCBI Datasets delivers the latest annotation for any assembled genome version. In some cases, when an assembled genome is annotated multiple times by NCBI and users need data for a specific older annotation release for that assembled genome, they can download it from the annotation_releases directory on the Genomes FTP site.

What is the file content within each specific assembled genome directory?

Directories for all current assembled genomes, and for many previous versions, include a core set of files, plus additional files relevant to the specific assembled genome. Directories for old, assembled genome versions that predate the genomes FTP site reorganization contain only the assembly report, assembly stats, and assembly status files (see table).

Table: Sequence and Annotation Files Available on Genomes FTP

File Format Description
*_ani_contam_ranges.tsv [G/R] Tab-delimited text Reports potentially contaminated regions in the assembly identified based on Average Nucleotide Identity (ANI).
*_ani_report.txt [G/R] Tab-delimited text Reports Average Nucleotide Identity (ANI)based evaluation of the taxonomic identity of the assembly.
*_assembly_regions.txt [R] Tab-delimited text Reports the location of genomic regions and lists the alt/patch scaffolds placed within those regions.
*_assembly_report.txt [G/R] Tab-delimited text Reports the name, role, and sequence accession.version for objects in the assembly.
*_assembly_stats.txt [G/R] Tab-delimited text Reports statistics for the assembly.
*_cds_from_genomic.fna.gz [D/G/R] FASTA Nucleotide sequences corresponding to all CDS features annotated on the assembly, based on the genome sequence.
*_fcs_report.txt [G/R] Tab-delimited text Reports potentially contaminated regions in the assembly identified by the Foreign Contamination Screen (FCS).
*_feature_count.txt.gz [G/R] Tab-delimited text Reports counts of gene, RNA, CDS, and similar features based on data reported in the *_feature_table.txt.gz file.
*_feature_table.txt.gz [G/R] Tab-delimited text Reports locations and attributes for a subset of annotated features.
*_gene_expression_counts.txt.gz [R] Tab-delimited text Reports counts of RNA-seq reads mapped to each gene.
*_normalized_gene_expression_counts.txt.gz [R] Tab-delimited text Reports normalized counts (TPM) of RNA-seq reads mapped to each gene.
*_gene_ontology.gaf.gz [R] GO Annotation File (GAF) Gene Ontology (GO) annotation of the annotated genes.
*_genomic.fna.gz [D/G/R] FASTA Genomic sequence(s) in the assembly. Repetitive sequences in eukaryotes are masked to lower-case.
*_genomic.gbff.gz [G/R] GenBank flat file Genomic sequence(s) in the assembly.
*_genomic.gff.gz [D/G/R] GFF3 Annotation of the genomic sequence(s).
*_genomic.gtf.gz [D/G/R] GTF Annotation of the genomic sequence(s).
*_genomic_gaps.txt.gz [G/R] Tab-delimited text Reports the coordinates of all gaps in the top-level genomic sequences.
*_protein.faa.gz [D/G/R] FASTA Sequences of accessioned protein products annotated on the genome assembly.
*_protein.gpff.gz [G/R] GenPept flat file Sequences of accessioned protein products annotated on the genome assembly.
*_pseudo_without_product.fna.gz [R] FASTA Genomic sequence corresponding to pseudogene and other gene regions which do not have any associated transcribed RNA products or translated protein products.
*_rm.out.gz [R] Text RepeatMasker output (provided for some eukaryotes).
*_rm.run [R] Text Documentation of the RepeatMasker version, parameters, and library (provided for some eukaryotes).
*_rna.fna.gz [D/R] FASTA Sequences of accessioned RNA products annotated on the genome assembly.
*_rna.gbff.gz [R] GenBank flat file RNA products annotated on the genome assembly (provided for RefSeq assemblies as relevant).
*_rna_from_genomic.fna.gz [G/R] FASTA Nucleotide sequences corresponding to all RNA features annotated on the assembly, based on the genome sequence.
*_rnaseq_alignment_summary.txt [R] Tab-delimited text Reports counts of alignments classified by Subread featureCounts.
*_rnaseq_runs.txt [R] Tab-delimited text Information about RNA-seq runs used for gene expression analyses.
*_translated_cds.faa.gz [G/R] FASTA Individual CDS features annotated on the genomic records, conceptually translated into protein sequences.
*_wgsmaster.gbff.gz [G] GenBank flat file WGS master for the assembly (present only if a WGS master record exists for the sequences in the assembly)
.annotation_hashes.txt [G/R] Tab-delimited text Reports hash values for different aspects of the annotation data
.assembly_status.txt [G/R] Text Reports the current status of this assembly version
.md5checksums.txt [G/R] Text File checksums are provided for all data files in the directory.
*_knownrefseq_alns.bam (RefSeq_transcripts_alignments sub-directory) [R] BAM Alignments of the annotated Known RefSeq transcripts (identified with accessions prefixed with NM_ and NR_) to the genome.
*_knownrefseq_alns.bam.bai (RefSeq_transcripts_alignments sub-directory) [R] BAM Index Index of the BAM alignments of the annotated Known RefSeq transcripts to the genome.
*_modelrefseq_alns.bam (RefSeq_transcripts_alignments sub-directory) [R] BAM Alignments of the annotated Model RefSeq transcripts (identified by accessions prefixed with XM_ and XR_) to the genome.
*_modelrefseq_alns.bam.bai (RefSeq_transcripts_alignments sub-directory) [R] BAM Index Index of the BAM alignments of the annotated Model RefSeq transcripts to the genome.
*_compare_prev.txt.gz (Annotation_comparison sub-directory) [R] Tab-delimited text Annotation comparison report.
*_cross_species_tx_alns.gff.gz (Evidence_alignments sub-directory) [R] GFF3 Alignments of cDNAs, ESTs, and TSAs from other species to the genomic sequence(s).
*_same_species_tx_alns.gff.gz (Evidence_alignments sub-directory) [R] GFF3 Alignments of same species cDNAs, ESTs, and TSAs to the genomic sequence(s).
*_gnomon_model.gff.gz (Gnomon_models sub-directory) [R] GFF3 Gnomon annotation of the genomic sequence(s).
*_gnomon_protein.faa.gz (Gnomon_models sub-directory) [R] FASTA Gnomon protein models annotated on the genome assembly.
*_gnomon_rna.fna.gz (Gnomon_models sub-directory) [R] FASTA Gnomon transcript models annotated on the genome assembly.
*_graph.bw (RNASeq_coverage_graphs directory) [R] UCSC BigWig RNA-seq read coverage graphs. Alternative style: subdir/*_file.txt

D: Datasets; G: Genbank; R: RefSeq

* Sample file path with either a GCA or GCF prefix where each hashtag represents a number in the actual path: https://ftp.ncbi.nih.gov/genomes/all/GC[A/F]/###/###/###/GC[A/F]_#########_(assembly name)/GC[A/F]_#########_(assembly name)

Where can I find information to help me choose between the many different assembled genomes for a species?

Many different assembled genomes are available for species with medical, agricultural, or scientific relevance. The Genus_species directories under the “genbank” and “refseq” directory trees each contain an assembly_summary.txt file that provides general information on all assembled genome versions included in the directory such as release date, submitter organization, assembly level, and annotation status. For example, see ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/archaea/Sulfolobus_islandicus/assembly_summary.txt.

After assembled genomes of interest have been identified using the data from the species-specific assembly_summary.txt file, they can be accessed via the “all_assembly_versions” directory for that species. Alternatively, any assembled genomes the NCBI Reference Sequence Database (RefSeq) selects as reference or representative genome can be readily accessed via the directories named “reference” or “representative” in the Genus_species directories under the “genbank” and “refseq” directory trees.

Do you provide assembled genome data formatted for use by sequence read alignment pipelines?

Genomic FASTA with modified sequence identifiers and index files convenient for analysis with Next Generation Sequencing tools are currently provided for the Genome Reference Consortium’s human and mouse assembled genomes GRCh38, GRCm38.p3, and GRCm39. RefSeq annotation in GFF3 and GTF formats with sequence identifiers matching those in the FASTA files are also provided to facilitate use in RNA-Seq analysis pipelines.

The four analysis sets provided for GRCh38 (no_alt_analysis_set, full_analysis_set, full_plus_hs38d1_analysis_set, and no_alt_plus_hs38d1_analysis_set) and the two analysis sets provided for GRCm38.p3 (no_alt_analysis_set and full_analysis_set) differ from the corresponding full assembled genomes by one or more of the following:

Additionally, index files generated by BWA, Samtools, Bowtie, and HISAT2 are also provided. See the GRCh38 README, GRCm38 README, or GRCm39 README for a full description.

Are repetitive sequences in eukaryotic genomes masked?

Yes. All genome sequences are softmasked using WindowMasker or RepeatMasker. For genomes that are masked using RepeatMasker, an additional file with information about the masked regions is also provided (see table).

Generated October 18, 2024