GitHub - emaigne/HiCParser: R package to parse HiC data into R (original) (raw)

HiCParser

GitHub issues GitHub pulls Lifecycle: experimental Bioc release status Bioc devel status Bioc downloads rank Bioc support Bioc history Bioc last commit Bioc dependencies R-CMD-check-bioc Codecov test coverage

The goal of HiCParser is to parse Hi-C data (HiCParser supports serveral formats), and import them in R, as an InteractionSet object.

Installation instructions

Get the latest stable R release fromCRAN. Then install HiCParser fromBioconductor using the following code:

if (!requireNamespace("BiocManager", quietly = TRUE)) { install.packages("BiocManager") }

BiocManager::install("HiCParser")

And the development version fromGitHub with:

BiocManager::install("emaigne/HiCParser")

Then load the package :

Supported formats

So far, HiCParser supports:

Example

hic format

We show here how to parse one hic format file.

hicFilePath <- system.file("extdata", "hicsample_21.hic", package = "HiCParser") data <- parseHiC( paths = hicFilePath, binSize = 5000000, conditions = 1, replicates = 1 )

Note that a hic file can include several matrices, with different bin sizes. This is why the bin size should be provided.

We show here how to parse several files (actually, the same file, several times). We suppose here that we have 2 conditions, with 3 replicates for each condition.

data <- parseHiC( paths = rep(hicFilePath, 6), binSize = 5000000, conditions = rep(seq(2), each = 3), replicates = rep(seq(3), 2) )

Currently, HiCParser supports the hic format up to the version 9.

HiC-Pro format

A HiC-Pro file contains a matrix file, and a bed file. A different bed file could be use for each matrix file, but the same can also be used.

matrixFilePath <- system.file("extdata", "hicsample_21.matrix", package = "HiCParser") bedFilePath <- system.file("extdata", "hicsample_21.bed", package = "HiCParser") data <- parseHiCPro( matrixPaths = rep(matrixFilePath, 6), bedPaths = bedFilePath, conditions = rep(seq(2), each = 3), replicates = rep(seq(3), 2) )

cool and mcool formats

Please note that the cool and mcool format store data in HDF5 format. The HDF5 packageis not included by default, because it requires a substantial time to be compiled, and many users will not need the cool/mcool parser. So, in order to use the cool/mcool parser, you should install the rhdf5package.

The cool format include only one bin size.

if (!"rhdf5" %in% installed.packages()) { install.packages("rhdf5") } coolFilePath <- system.file("extdata", "hicsample_21.cool", package = "HiCParser" ) data <- parseCool( paths = rep(coolFilePath, 6), conditions = rep(seq(2), each = 3), replicates = rep(seq(3), 2) )

The mcool format may include several bin sizes. It is thus compulsory to mention it. The same function is used for the cool/mcool formats.

mcoolFilePath <- system.file("extdata", "hicsample_21.mcool", package = "HiCParser" ) data <- parseCool( paths = rep(mcoolFilePath, 6), binSize = 5000000, conditions = rep(seq(2), each = 3), replicates = rep(seq(3), 2) )

Tabular files

A tabular file is a tab-separated multi-replicate sparse matrix with a header:

chromosome    position 1    position 2    C1.R1    C1.R2    C1.R3    ...
Y             1500000       7500000       145      184      72       ...

The number of interactions between position 1 and position 2 ofchromosome are reported in each condition.replicate column. There is no limit to the number of conditions and replicates.

To load Hi-C data in this format:

hic.experiment <- parseTabular( system.file("extdata", "hicsample_21.tsv", package = "HiCParser" ), sep = "\t" )

Output

The output is aInteractionSet. This object can store one or several samples. Please read thecorresponding vignettein order to known more about this format.

library("HiCParser") hicFilePath <- system.file("extdata", "hicsample_21.hic", package = "HiCParser") hic.experiment <- parseHiC( paths = rep(hicFilePath, 6), binSize = 5000000, conditions = rep(seq(2), each = 3), replicates = rep(seq(3), 2) ) #> #> Parsing '/tmp/RtmpHFfOT6/temp_libpathc52a1c587e54/HiCParser/extdata/hicsample_21.hic'. #> #> Parsing '/tmp/RtmpHFfOT6/temp_libpathc52a1c587e54/HiCParser/extdata/hicsample_21.hic'. #> #> Parsing '/tmp/RtmpHFfOT6/temp_libpathc52a1c587e54/HiCParser/extdata/hicsample_21.hic'. #> #> Parsing '/tmp/RtmpHFfOT6/temp_libpathc52a1c587e54/HiCParser/extdata/hicsample_21.hic'. #> #> Parsing '/tmp/RtmpHFfOT6/temp_libpathc52a1c587e54/HiCParser/extdata/hicsample_21.hic'. #> #> Parsing '/tmp/RtmpHFfOT6/temp_libpathc52a1c587e54/HiCParser/extdata/hicsample_21.hic'. hic.experiment #> class: InteractionSet #> dim: 44 6 #> metadata(0): #> assays(1): '' #> rownames: NULL #> rowData names(1): chromosome #> colnames: NULL #> colData names(2): condition replicate #> type: StrictGInteractions #> regions: 9

The conditions and replicates are reported in the colData slot :

SummarizedExperiment::colData(hic.experiment) #> DataFrame with 6 rows and 2 columns #> condition replicate #> #> 1 1 1 #> 2 1 2 #> 3 1 3 #> 4 2 1 #> 5 2 2 #> 6 2 3

They corresponds to columns of the assays matrix (containing interactions values):

head(SummarizedExperiment::assay(hic.experiment)) #> [,1] [,2] [,3] [,4] [,5] [,6] #> [1,] 79 79 79 79 79 79 #> [2,] 22 22 22 22 22 22 #> [3,] 3 3 3 3 3 3 #> [4,] 1 1 1 1 1 1 #> [5,] 1 1 1 1 1 1 #> [6,] 2 2 2 2 2 2

The positions of interactions are in the interactions slot of the object:

InteractionSet::interactions(hic.experiment) #> StrictGInteractions object with 44 interactions and 1 metadata column: #> seqnames1 ranges1 seqnames2 ranges2 | chromosome #> | #> [1] 21 5000001-10000000 --- 21 5000001-10000000 | 21 #> [2] 21 5000001-10000000 --- 21 10000001-15000000 | 21 #> [3] 21 5000001-10000000 --- 21 15000001-20000000 | 21 #> [4] 21 5000001-10000000 --- 21 20000001-25000000 | 21 #> [5] 21 5000001-10000000 --- 21 25000001-30000000 | 21 #> ... ... ... ... ... ... . ... #> [40] 21 35000001-40000000 --- 21 40000001-45000000 | 21 #> [41] 21 35000001-40000000 --- 21 45000001-50000000 | 21 #> [42] 21 40000001-45000000 --- 21 40000001-45000000 | 21 #> [43] 21 40000001-45000000 --- 21 45000001-50000000 | 21 #> [44] 21 45000001-50000000 --- 21 45000001-50000000 | 21 #> ------- #> regions: 9 ranges and 1 metadata column #> seqinfo: 1 sequence from an unspecified genome; no seqlengths

Citation

Below is the citation output from using citation('HiCParser') in R. Please run this yourself to check for any updates on how to citeHiCParser.

To cite the ‘HiCParser’ HiCParser in a publication, use :

Maigné E, Zytnicki M (2024). A multiple format Hi-C data parser. doi:10.18129/B9.bioc.HiCParserhttps://doi.org/10.18129/B9.bioc.HiCParser,https://github.com/emaigne/HiCParser/HiCParser - R package version 0.1.0, http://www.bioconductor.org/packages/HiCParser.

As a BibTeX entry :

@Manual{hicparser,
  title = {A multiple format Hi-C data parser},
  author = {Elise Maigné and Matthias Zytnicki},
  year = {2024},
  url = {http://www.bioconductor.org/packages/HiCParser},
  note = {https://github.com/emaigne/HiCParser/HiCParser - R package version 0.1.0},
  doi = {10.18129/B9.bioc.HiCParser},
}

Please note that the HiCParser was only made possible thanks to many other R and bioinformatics software authors, which are cited either in the vignettes and/or the paper(s) describing this package.

Code of Conduct

Please note that the HiCParser project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

Development tools

For more details, check the dev directory.

This package was developed using_biocthis_.