Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments (original) (raw)
. Author manuscript; available in PMC: 2018 Mar 12.
Abstract
Hi-C experiments explore the three-dimensional structure of the genome, generating terabases of data to create high resolution contact maps. Here, we introduce Juicer, an open-source tool for analyzing terabase-scale Hi-C datasets. Juicer allows users without a computational background to transform raw sequence data into normalized contact maps with one click. Juicer produces a hic file containing compressed contact matrices at many resolutions, facilitating visualization and analysis at multiple scales. Structural features, such as loops and domains, are automatically annotated. Juicer is available as open source software at http://aidenlab.org/juicer/
Graphical abstract
Main Text
Hi-C experiments probe the three-dimensional structure of DNA and chromatin by ligating and sequencing DNA loci that are spatially proximate to one another (Lieberman-Aiden and Van Berkum et al., 2009; Rao and Huntley et al., 2014). The resulting maps reflect patterns of physical contact between loci, making it possible to deduce how loci are organized in 3D.
Efforts to improve the resolution of 3D maps have caused the amount of DNA sequence produced from Hi-C experiments to skyrocket. Our original maps, derived from 30 million reads and 16 Gb of DNA sequence, described the genome at 1 megabase resolution (Lieberman-Aiden and Van Berkum et al., 2009). In contrast, we recently generated 6.5 billion reads and 1.6 Tb of DNA sequence in order to create a single 3D map of the genome at kilobase resolution (Rao and Huntley et al., 2014).
Although pipelines for Hi-C data analysis exist (Lieberman-Aiden and Van Berkum et al., 2009; Schmid et al., 2015; Servant et al., 2015; Suria et al., 2015), these packages are not designed to process datasets at the terabase scale or to annotate the structural features that these maps reflect. Moreover, when designing tools that require high-performance computation, ensuring reliability and ease-of-use across software platforms and hardware instances becomes a crucial desideratum. Ensuring such compatibility can be a considerable engineering challenge.
Here, we introduce Juicer, an easy-to-use, fully-automated pipeline for the processing and annotation of data from Hi-C and other contact mapping experiments. Juicer is closely based on the algorithms that we recently developed in order to analyze and annotate our terabase-scale Hi-C experiments (Rao and Huntley et al., 2014). In order to meet the engineering challenge of handling such massive datasets, Juicer supports the use of parallelization and hardware acceleration whenever possible, including CPU clusters, general-purpose graphics processing units (GP-GPUs), and field-programmable gate arrays (FPGAs). Juicer is also compatible with a variety of cloud and cluster architectures.
Juicer comprises three tools, which are designed to be run one-after-another.
First, Juicer transforms raw sequence data into a list of Hi-C contacts (pairs of genomic positions that were adjacent to each other in three-dimensional space during the experiment). To accomplish this, read pairs are aligned to the genome; both duplicates and near-duplicates are removed, and read pairs that align to three or more locations are set aside. When appropriate hardware is available, this procedure can be accelerated, either by parallelizing across multiple CPUs or by using an FPGA (see Table 1).
Table 1.
Using Juicer to process 1.5 billion paired-end Hi-C reads on different cluster systems. “RAM (Gb)” (resp., “VM(Gb)”) are the maximum RAM (resp., virtual memory”) used for each task. Loop annotation was not performed on the Broad cluster, which does not offer GPUs. See Table S1.
System | Amazon Web Servicesg2.8xlarge | BroadUniva Grid Engine | Rice PowerOmics | Rice PowerOmics + FPGA | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CPU | Intel Xeon E5-2670 @2.60GHz | Intel Xeon X5650 @2.66GHz | IBM POWER8E@2.061GHz revision: 2.1 | IBM POWER8E@2.061GHz revision: 2.1 | ||||||||
Cores/node | 4×8 cores | 4×6 cores | 2×24 cores | 2×24 cores | ||||||||
RAM | 60GB | 32GB | 256GB | 256GB | ||||||||
Cluster OS | OpenLava 2.2 (LSF Compatible) | UGE 8.3.0 | Slurm 14.11.8 | Slurm 14.11.8 | ||||||||
GPU | NVIDIA Quadro K5000 | None | NVIDIA Tesla K80 | NVIDIA Tesla K80 | ||||||||
FPGA | None | None | None | Edico Genome DRAGEN Bio-IT Platform | ||||||||
Max Parallel Cores | 32 | 1200 | 1536 | 1536 | ||||||||
Core Hours (hr:min) | RAM (GB) | VM (GB) | Core Hours (hr:min) | RAM (GB) | VM (GB) | Core Hours (hr:min) | RAM (GB) | VM (GB) | Core Hours (hr:min) | RAM (GB) | VM (GB) | |
Align | 8744:49 | 12.3 | 13.5 | 11614:07 | 10.8 | 11.9 | 4221:29 | 13.1 | 14.0 | 1:29 | 0 | 0 |
Merge Sort | 35:36 | 9.9 | 10.1 | 117:03 | 8.7 | 198.1 | 452:13 | 14.0 | 120.0 | 426:30 | 30.0 | 120.0 |
Duplicate Removal | 12:21 | 0.5 | 0.5 | 17:04 | 0.4 | 0.5 | 3:12 | 0.4 | 0.0 | 1:28 | 0.4 | 0.0 |
.hic Creation | 112:43 | 21.8 | 34.9 | 209:43 | 13.4 | 19.5 | 139:17 | 19.3 | 8 | 177:04 | 19.3 | 8 |
Feature Annotation | 2:07 | 10.5 | 139.3 | 1:04 | 6.4 | 19.5 | 3:25 | 4.2 | 9.1 | 4:28 | 77.1 | 9.1 |
Total | 8906:11 | 11959:01 | 4819:36 | 608:59 |
Next, the catalog of contacts is used to create contact matrices. To do so, the linear genome is partitioned into loci of a fixed size, or “resolution,” (e.g., 1Mb or 1Kb). These loci correspond to the rows and columns of a contact matrix; each entry in the matrix reflects the number of contacts observed between the corresponding pair of loci during a Hi-C experiment. Due to factors such as chromatin accessibility, certain loci are observed more frequently in Hi-C experiments. Juicer can adjust for these biases in multiple ways. The options include our original normalization scheme (Lieberman-Aiden and Van Berkum et al., 2009), as well as a matrix balancing scheme that ensures that each row and column of the contact matrix sums to the same value (Knight and Ruiz, 2012). A wide array of quality statistics are also calculated, making it possible to assess the success and reliability of a given experiment before the costly deep-sequencing step.
The contact matrices generated in this way are stored efficiently in a compressed format, which is designed to facilitate all subsequent computations. For instance, 1 terabyte of raw sequencing data is represented as an 80 gigabyte hic file containing normalized and non-normalized contact matrices at 18 different resolutions, from 2.5Mb resolution to single restriction fragment resolution for a 4-cutter restriction enzyme (~400bp). Contact matrices in the hic format can also be visualized using Juicebox, which is described in the accompanying paper.
Finally, Juicer contains a suite of algorithms that are designed to annotate contact matrices and thus identify features of genome folding. These features include loops, loop anchor motifs, and contact domains.
Loops are identified using the HiCCUPS algorithm (Rao and Huntley et al., 2014), which searches for clusters of contact matrix entries in which the frequency of contact is enriched relative to the local background. Since there are trillions of pixels in a kilobase-resolution Hi-C map, HiCCUPS is implemented using GP-GPUs. Given CTCF and/or cohesin ChIP-Seq tracks for the same cell type, HiCCUPS can frequently use FIMO (Grant et al., 2011) to identify the CTCF motif that serves as the anchor for each loop. We recently performed CRISPR experiments disrupting seven different CTCF motifs, each of which was identified by HiCCUPS as the anchor of one or more loops. In each case, disruption of the motif led to disruption of the corresponding loop, thus confirming the accuracy of HiCCUPS loop anchor annotations (Sanborn and Rao et al., 2015).
Contact domains are identified using a dynamic programming algorithm that relies on applying the Arrowhead transformation [Ai,i+d = (M* i,i-d − M* i,i+d)/(M* i,i-d + M* i,i+d)] to a normalized contact matrix M* (Rao and Huntley et al., 2014). Many of these domains are associated with loops, and can be disrupted by manipulating the corresponding loop anchors (Sanborn and Rao et al., 2015).
It is frequently useful to examine the cumulative signal from a large number of putative features at once, including both loops and domains. To this end, Juicer includes an implementation of Aggregate Peak Analysis (Rao and Huntley et al., 2014).
Juicer is an open-source project. It is available at github.com/theaidenlab/juicer as a series of packages designed for a variety of hardware configurations: either a single machine, or clusters that run LSF, Univa Grid Engine, or SLURM. In addition, Juicer is available on the cloud at Amazon Web Services. Table 1 displays different performance metrics on each cluster system; the details of each setup are in the supplemental text. Once installed, Juicer can be executed using a single command, by users without informatics experience.
Experimental Methods
All algorithms and data are drawn from Rao and Huntley et al., 2014, except as described in the supplement.
Supplementary Material
1
2
3
Figure 1. Juicer analyzes terabases of Hi-C data with one click.
(A) Sequenced read pairs (horizontal bars) are aligned to the genome in parallel. Color indicates genomic position. Read pairs aligning to more than two positions are excluded. Those remaining are sorted by position and merged into a single list, at which point duplicate reads are removed. The .hic file stores contact matrices at many resolutions, which can be loaded into Juicebox for visualization. See Table S2. (B) Contact domains (yellow) are annotated using the Arrowhead algorithm. (C) Loops (cyan) are annotated using HiCCUPS.
Acknowledgments
Supported by NIH New Innovator Award 1DP2OD008540, NIH 4D Nucleome Grant U01HL130010, NSF Physics Frontier Center PHY-1427654, NHGRI HG006193, Welch Foundation Q-1866, Cancer Prevention Research Institute of Texas Scholar Award R1304, an NVIDIA Research Center Award, an IBM University Challenge Award, a Google Research Award, a McNair Medical Institute Scholar Award, and the President’s Early Career Award in Science and Engineering to E.L.A.; an NHGRI grant (HG003067) to E.S.L.; and a PD Soros Fellowship to S.S.P.R. The Rice PowerOmics cluster was a gift from IBM.
Footnotes
Author Contributions: E.L.A. conceived of this project; N.C.D. created the pipeline; S.S.P.R. created HiCCUPS; M.H.H. created APA; M.H.H. and N.C.D. created Arrowhead; M.S.S. re-implemented all feature annotation algorithms in Java as fully-automated, end-to-end tools; I.M. ported the pipeline to SLURM and AWS; N.C.D., M.S.S., I.M., and E.S.L. contributed to tool development; N.C.D. and E.L.A. prepared the manuscript.
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- Grant CE, Bailey TL, Noble WS. FIMO: Scanning for occurrences of a given motif. Bioinformatics. 2011;27(7):1017–1018. doi: 10.1093/bioinformatics/btr064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Knight PA, Ruiz D. A fast algorithm for matrix balancing. IMA J Numer Anal. 2012;33:1029–1047. [Google Scholar]
- Lieberman-Aiden E, van Berkum N, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie B, Sabo P, Dorschner M, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326:289–293. doi: 10.1126/science.1181369. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rao SSP, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL. A Three-dimensional Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell. 2014;159:1665–1680. doi: 10.1016/j.cell.2014.11.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sanborn AL, Rao SSP, Huang S, Durand NC, Huntley MH, Jewett AI, Bochkov ID, Chinnappan D, Cutkosky A, Geeting KP, Gnirke A, Melnikov A, McKenna D, Stamenova EK, Lander ES, Aiden EL. Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes. Proceedings of the National Academy of Sciences. 2015;112(47):E6456–E6465. doi: 10.1073/pnas.1518552112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Servant N, Varoquaux N, Lajoie BR, Viara E, Chen CJ, Vert JP, Heard E, Dekker J, Barillot E. HiC-Pro: an optimized and flexible pipeline for Hi-C data processing. Genome Biology. 2015;16:259. doi: 10.1186/s13059-015-0831-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmid MW, Grob S, Stefan Grob, Grossniklaus U. HiCdat: a fast and easy-to-use Hi-C data analysis tool. BMC Bioinformatics. 2015;16(1):277. doi: 10.1186/s12859-015-0678-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suria MEG, Phillips-Cremins JE, Corces VG, Taylor J. HiFive: a tool suite for easy and efficient HiC and 5C data analysis. Genome Biology. 2015;16:237. doi: 10.1186/s13059-015-0806-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
1
2
3