ATAC-seq Data Standards and Processing Pipeline – ENCODE (original) (raw)

Assay overview

The Assay for Transposase-Accessible Chromatin followed by sequencing (ATAC-seq) experiment provides genome-wide profiles of chromatin accessibility. Briefly, the ATAC-seq method works as follows: loaded transposase inserts sequencing primers into open chromatin sites across the genome, and reads are then sequenced. The ends of the reads mark open chromatin sites.

Updated July 2020

Pipeline Overview

The ATAC-seq pipeline was developed by Anshul Kundaje's lab at Stanford University. Upon revision and full implementation, it will be a part of the ENCODE Uniform Processing Pipelines series. The full ATAC-seq pipeline code is available on Github.

The ENCODE ATAC-seq pipeline is used for quality control and statistical signal processing of short-read sequencing data, producing alignments and measures of enrichment.

Pipeline schematic for replicated data

View the current instances of the pipeline for replicated data

Pipeline schematic for unreplicated data

View the current instances of the pipeline for unreplicated data

Inputs:

Outputs:

File format Information contained in file File description Notes
bam alignments and filtered alignments Produced by mapping reads to the genome Bowtie2 aligner is used to produce raw bam files, followed by various filtering steps (mappability and quality) to produce filtered bams.
bigWig signals Two versions of nucleotide resolution signal coverage tracks These signals are the fold enrichment of signal over expected background and a p-value track representing statistical significance
bed peaks (regions of enrichment) Punctuate peaks (narrowPeaks) and larger regions of enrichment (gappedPeaks) Regions of enrichments are called from filtered bams, using the MACS2 peak caller with custom parameters optimal for ATAC-seq data
Merged Peak Sets
Using the replicates provided where available, three types of “merged” peak sets are produced. In the absence of replicates, the same procedure listed below is used on pseudoreplicates obtained by subsampling reads from a single replicate
bed and bigBed replicated peaks (narrowPeak) Punctate peaks both on individual replicates/pseudoreplicates and on data pooled across replicates From the set of peaks on pooled data, we only retain those that have at least 50% overlap with a peak in both replicates.
bed and bigBed replicated region (gappedPeak) Broader regions of enrichment both on individual replicates/pseudoreplicates and on data pooled across replicates From this set of peaks on pooled data, we only retain those that have at least 50% overlap with a peak in both replicates.
bed and bigBed IDR peaks A higher confidence, reproducible set of peaks A statistical procedure called the Irreproducible Discovery Rate (IDR) operates on the replicated peak set and compares consistency of ranks of these peaks in individual replicate/pseudoreplicate peak sets.
Quality control metrics are collected to determine library complexity, signal to noise ratios, fragment length distribution per replicate (where available), and reproducibility.

References

Genomic References

View the mapping assembly and genome annotation reference files used in this pipeline

Find data generated by the pipeline for unreplicated data
Find data generated by the pipeline for replicated data
Explore all ATAC-seq related publications on the ENCODE portal

Uniform Processing Pipeline Restrictions

Current Standards

Annotation used Value Resulting Data Status
hg19 Refseq TSS annotation < 6 Concerning
6 - 10 Acceptable
> 10 Ideal
GRCh38 Refseq TSS annotation < 5 Concerning
5 - 7 Acceptable
> 7 Ideal
mm9 GENCODE TSS annotation < 5 Concerning
5 - 7 Acceptable
> 7 Ideal
mm10 Refseq TSS annotation < 10 Concerning
10 -15 Acceptable
> 15 Ideal