Glossary of NGS terms — deepTools 3.5.6 documentation (original) (raw)

Like most specialized fields, next-generation sequencing has inspired many an acronyms. We are trying to keep track of those Abbreviations that we heavily use. Do make us aware if something is unclear by opening an issue on github

Abbreviations

Reference genomes are usually referred to by their abbreviations, such as:

For a more comprehensive list of available reference genomes and their abbreviations, see the UCSC data base.

Acronym full phrase Synonyms/Explanation
-seq -sequencing indicates that an experiment was completed by DNA sequencing using NGS
ChIP-seq chromatin immunoprecipitation sequencing NGS technique for detecting transcription factor binding sites and histone modifications (see entry Input for more information)
DNase deoxyribonuclease I DNase I digestion is used to determine active (“open”) chromatin regions
HTS high-throughput sequencing next-generation sequencing, massive parallel short read sequencing, deep sequencing
MNase micrococcal nuclease MNase digestion is used to determine sites with nucleosomes
NGS next-generation sequencing high-throughput (DNA) sequencing, massive parallel short read sequencing, deep sequencing
RPGC reads per genomic content normalize reads to 1x sequencing depth, sequencing depth is defined as: (mapped reads x fragment length) / effective genome size
RPKM reads per kilobase per million reads normalize read numbers: RPKM (per bin) = reads per bin / ( mapped reads (in millions) x bin length (kb))

For a review of popular *-seq applications, see Zentner and Henikoff.

NGS and generic terminology

The following are terms that may be new to some:

bin

Input

read

File Formats

Data obtained from next-generation sequencing data must be processed several times. Most of the processing steps are aimed at extracting only that information needed for a specific down-stream analysis, with redundant entries often discarded. Therefore, specific data formats are often associated with different steps of a data processing pipeline.

Here, we just want to give very brief key descriptions of the file, for elaborate information we will link to external websites. Be aware, that the file name sorting here is alphabetical, not according to their usage within an analysis pipeline that is depicted here:

../_images/flowChart_FileFormats.png

Follow the links for more information on the different tool collections mentioned in the figure:

samtools |UCSCtools |BEDtools |

2bit

BAM

BED

bedGraph

chr1 10 20 1.5 chr1 20 30 1.7 chr1 30 40 2.0 chr1 40 50 1.8

bigWig

FASTA

gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY

FASTQ

!''((((+))%%%++)(%%%%).1-+''))**55CCF>>>>>>CCCCCCC65 # ASCII-encoded quality scores

SAM

SAM alignment section

Warning

Although the SAM/BAM format is rather meticulously defined and documented, whether an alignment program will produce a SAM/BAM file that adheres to these principles is completely up to the programmer. The mapping score, CIGAR string, and particularly, all optional flags (fields >11) are often very differently defined depending on the program. If you plan on filtering your data based on any of these criteria, make sure you know exactly how these entries were calculated and set!