# plyranges: fluent genomic data analysis \[!\[R-CMD-check-bioc\](https://github.com/tidyomics/plyranges/workflows/R-CMD-check-bioc/badge.svg)\](https://github.com/tidyomics/plyranges/actions?query=workflow%3AR-CMD-check-bioc) \[!\[BioC status\](http://www.bioconductor.org/shields/build/release/bioc/plyranges.svg)\](https://bioconductor.org/checkResults/release/bioc-LATEST/plyranges) \[plyranges\](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html) provides a consistent interface for importing and wrangling genomics data from a variety of sources. The package defines a grammar of genomic data transformation based on \`dplyr\` and the Bioconductor packages \`IRanges\`, \`GenomicRanges\`, and \`rtracklayer\`. It does this by providing a set of verbs for developing analysis pipelines based on \*Ranges\* objects that represent genomic regions: - Modify genomic regions with the \`mutate()\` and \`stretch()\` functions. - Modify genomic regions while fixing the start/end/center coordinates with the \`anchor\_\` family of functions. - Sort genomic ranges with \`arrange()\`. - Modify, subset, and aggregate genomic data with the \`mutate()\`, \`filter()\`, and \`summarise()\`functions. - Any of the above operations can be performed on partitions of the data with \`group\_by()\`. - Find nearest neighbour genomic regions with the \`join\_nearest\_\` family of functions. - Find overlaps between ranges with the \`join\_overlaps\_\` family of functions. - Merge all overlapping and adjacent genomic regions with \`reduce\_ranges()\`. - Merge the end points of all genomic regions with \`disjoin\_ranges()\`. - Import and write common genomic data formats with the \`read\_/write\_\` family of functions. For more details on the features of plyranges, read the \[vignette\](https://tidyomics.github.io/plyranges/articles/an-introduction.html). For a complete case-study on using plyranges to combine ATAC-seq and RNA-seq results read the \[\*fluentGenomics\* workflow\](https://tidyomics.github.io/fluentGenomics). plyranges is part of the \[tidyomics\](https://github.com/tidyomics) project, providing a \`dplyr\`-based interface for many types of genomics datasets represented in Bioconductor. # Installation \[plyranges\](https://www.bioconductor.org/packages/release/bioc/html/plyranges.html) can be installed from the latest Bioconductor release: \`\`\` r # install.packages("BiocManager") BiocManager::install("plyranges") \`\`\` To install the development version from GitHub: \`\`\` r BiocManager::install("tidyomics/plyranges") \`\`\` # Quick overview ## About \`Ranges\` \`Ranges\` objects can either represent sets of integers as \`IRanges\` (which have start, end and width attributes) or represent genomic intervals (which have additional attributes, sequence name, and strand) as \`GRanges\`. In addition, both types of \`Ranges\` can store information about their intervals as metadata columns (for example GC content over a genomic interval). \`Ranges\` objects follow the tidy data principle: each row of a \`Ranges\` object corresponds to an interval, while each column will represent a variable about that interval, and generally each object will represent a single unit of observation (like gene annotations). We can construct a \`IRanges\` object from a \`data.frame\` with a \`start\` or \`width\` using the \`as\_iranges()\` method. \`\`\` r library(plyranges) df <- data.frame(start = 1:5, width = 5) as\_iranges(df) #> IRanges object with 5 ranges and 0 metadata columns: #> start end width #> #> \[1\] 1 5 5 #> \[2\] 2 6 5 #> \[3\] 3 7 5 #> \[4\] 4 8 5 #> \[5\] 5 9 5 # alternatively with end df <- data.frame(start = 1:5, end = 5:9) as\_iranges(df) #> IRanges object with 5 ranges and 0 metadata columns: #> start end width #> #> \[1\] 1 5 5 #> \[2\] 2 6 5 #> \[3\] 3 7 5 #> \[4\] 4 8 5 #> \[5\] 5 9 5 \`\`\` We can also construct a \`GRanges\` object in a similar manner. Note that a \`GRanges\` object requires at least a seqnames column to be present in the data.frame (but not necessarily a strand column). \`\`\` r df <- data.frame(seqnames = c("chr1", "chr2", "chr2", "chr1", "chr2"), start = 1:5, width = 5) as\_granges(df) #> GRanges object with 5 ranges and 0 metadata columns: #> seqnames ranges strand #> #> \[1\] chr1 1-5 \* #> \[2\] chr2 2-6 \* #> \[3\] chr2 3-7 \* #> \[4\] chr1 4-8 \* #> \[5\] chr2 5-9 \* #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths # strand can be specified with \`+\`, \`\*\` (mising) and \`-\` df$strand <- c("+", "+", "-", "-", "\*") as\_granges(df) #> GRanges object with 5 ranges and 0 metadata columns: #> seqnames ranges strand #> #> \[1\] chr1 1-5 + #> \[2\] chr2 2-6 + #> \[3\] chr2 3-7 - #> \[4\] chr1 4-8 - #> \[5\] chr2 5-9 \* #> ------- #> seqinfo: 2 sequences from an unspecified genome; no seqlengths \`\`\` # Example: finding GWAS hits that overlap known exons Let’s look at a more a realistic example (taken from HelloRanges vignette). Suppose we have two \*GRanges\* objects: one containing coordinates of known exons and another containing SNPs from a GWAS. The first and last 5 exons are printed below, there are two additional columns corresponding to the exon name, and a score. We could check the number of exons per chromosome using \`group\_by\` and \`summarise\`. \`\`\` r exons #> GRanges object with 459752 ranges and 2 metadata columns: #> seqnames ranges strand | name score #> | #> \[1\] chr1 11874-12227 + | NR\_046018\_exon\_0\_0\_c.. 0 #> \[2\] chr1 12613-12721 + | NR\_046018\_exon\_1\_0\_c.. 0 #> \[3\] chr1 13221-14409 + | NR\_046018\_exon\_2\_0\_c.. 0 #> \[4\] chr1 14362-14829 - | NR\_024540\_exon\_0\_0\_c.. 0 #> \[5\] chr1 14970-15038 - | NR\_024540\_exon\_1\_0\_c.. 0 #> ... ... ... ... . ... ... #> \[459748\] chrY 59338754-59338859 + | NM\_002186\_exon\_6\_0\_c.. 0 #> \[459749\] chrY 59338754-59338859 + | NM\_176786\_exon\_7\_0\_c.. 0 #> \[459750\] chrY 59340194-59340278 + | NM\_002186\_exon\_7\_0\_c.. 0 #> \[459751\] chrY 59342487-59343488 + | NM\_002186\_exon\_8\_0\_c.. 0 #> \[459752\] chrY 59342487-59343488 + | NM\_176786\_exon\_8\_0\_c.. 0 #> ------- #> seqinfo: 93 sequences from an unspecified genome; no seqlengths exons %>% group\_by(seqnames) %>% summarise(n = n()) #> DataFrame with 49 rows and 2 columns #> seqnames n #> #> 1 chr1 43366 #> 2 chr10 19420 #> 3 chr11 24476 #> 4 chr12 24949 #> 5 chr13 7974 #> ... ... ... #> 45 chrUn\_gl000222 20 #> 46 chrUn\_gl000223 22 #> 47 chrUn\_gl000228 85 #> 48 chrX 18173 #> 49 chrY 4128 \`\`\` Next we create a column representing the transcript\_id with \`mutate\`: \`\`\` r exons <- exons %>% mutate(tx\_id = sub("\_exon.\*", "", name)) \`\`\` To find all GWAS SNPs that overlap exons, we use \`join\_overlap\_inner\`. This will create a new \*GRanges\* with the coordinates of SNPs that overlap exons, as well as metadata from both objects. \`\`\` r olap <- join\_overlap\_inner(gwas, exons) olap #> GRanges object with 3439 ranges and 4 metadata columns: #> seqnames ranges strand | name.x name.y score #> | #> \[1\] chr1 1079198 \* | rs11260603 NR\_038869\_exon\_2\_0\_c.. 0 #> \[2\] chr1 1247494 \* | rs12103 NM\_001256456\_exon\_1\_.. 0 #> \[3\] chr1 1247494 \* | rs12103 NM\_001256460\_exon\_1\_.. 0 #> \[4\] chr1 1247494 \* | rs12103 NM\_001256462\_exon\_1\_.. 0 #> \[5\] chr1 1247494 \* | rs12103 NM\_001256463\_exon\_1\_.. 0 #> ... ... ... ... . ... ... ... #> \[3435\] chrX 153764217 \* | rs1050828 NM\_001042351\_exon\_9\_.. 0 #> \[3436\] chrX 153764217 \* | rs1050828 NM\_000402\_exon\_9\_0\_c.. 0 #> \[3437\] chrX 153764217 \* | rs1050828 NM\_001042351\_exon\_9\_.. 0 #> \[3438\] chrX 153764217 \* | rs1050828 NM\_000402\_exon\_9\_0\_c.. 0 #> \[3439\] chrX 153764217 \* | rs1050828 NM\_001042351\_exon\_9\_.. 0 #> tx\_id #> #> \[1\] NR\_038869 #> \[2\] NM\_001256456 #> \[3\] NM\_001256460 #> \[4\] NM\_001256462 #> \[5\] NM\_001256463 #> ... ... #> \[3435\] NM\_001042351 #> \[3436\] NM\_000402 #> \[3437\] NM\_001042351 #> \[3438\] NM\_000402 #> \[3439\] NM\_001042351 #> ------- #> seqinfo: 93 sequences from an unspecified genome; no seqlengths \`\`\` For each SNP we can count the number of times it overlaps a transcript. \`\`\` r olap %>% group\_by(name.x, tx\_id) %>% summarise(n = n()) #> DataFrame with 1619 rows and 3 columns #> name.x tx\_id n #> #> 1 rs10043775 NM\_001271723 1 #> 2 rs10043775 NM\_030793 1 #> 3 rs10078 NM\_001242412 1 #> 4 rs10078 NM\_020731 1 #> 5 rs10089 NM\_001046 1 #> ... ... ... ... #> 1615 rs9906595 NM\_001008777 1 #> 1616 rs9948 NM\_017623 1 #> 1617 rs9948 NM\_199078 1 #> 1618 rs995030 NM\_000899 4 #> 1619 rs995030 NM\_003994 4 \`\`\` We can also generate 2bp splice sites on either side of the exon using \`flank\_left\` and \`flank\_right\`. We add a column indicating the side of flanking for illustrative purposes. The \`interweave\` function pairs the left and right ranges objects. \`\`\` r left\_ss <- flank\_left(exons, 2L) right\_ss <- flank\_right(exons, 2L) all\_ss <- interweave(left\_ss, right\_ss, .id = "side") all\_ss #> GRanges object with 919504 ranges and 4 metadata columns: #> seqnames ranges strand | name score #> | #> \[1\] chr1 11872-11873 + | NR\_046018\_exon\_0\_0\_c.. 0 #> \[2\] chr1 12228-12229 + | NR\_046018\_exon\_0\_0\_c.. 0 #> \[3\] chr1 12611-12612 + | NR\_046018\_exon\_1\_0\_c.. 0 #> \[4\] chr1 12722-12723 + | NR\_046018\_exon\_1\_0\_c.. 0 #> \[5\] chr1 13219-13220 + | NR\_046018\_exon\_2\_0\_c.. 0 #> ... ... ... ... . ... ... #> \[919500\] chrY 59340279-59340280 + | NM\_002186\_exon\_7\_0\_c.. 0 #> \[919501\] chrY 59342485-59342486 + | NM\_002186\_exon\_8\_0\_c.. 0 #> \[919502\] chrY 59343489-59343490 + | NM\_002186\_exon\_8\_0\_c.. 0 #> \[919503\] chrY 59342485-59342486 + | NM\_176786\_exon\_8\_0\_c.. 0 #> \[919504\] chrY 59343489-59343490 + | NM\_176786\_exon\_8\_0\_c.. 0 #> tx\_id side #> #> \[1\] NR\_046018 left #> \[2\] NR\_046018 right #> \[3\] NR\_046018 left #> \[4\] NR\_046018 right #> \[5\] NR\_046018 left #> ... ... ... #> \[919500\] NM\_002186 right #> \[919501\] NM\_002186 left #> \[919502\] NM\_002186 right #> \[919503\] NM\_176786 left #> \[919504\] NM\_176786 right #> ------- #> seqinfo: 93 sequences from an unspecified genome; no seqlengths \`\`\` # Learning more - The \[\*fluentGenomics\* workflow\](https://sa-lee.github.io/fluentGenomics) package shows you how to combine differential expression genes and differential chromatin accessibility peaks using plyranges. It extends the \[case study\](https://github.com/mikelove/plyrangesTximetaCaseStudy) by Michael Love for using plyranges with \[tximeta\](https://bioconductor.org/packages/release/bioc/html/tximeta.html). - The \[extended vignette in the plyrangesWorkshops package\](https://github.com/sa-lee/plyrangesWorkshops) has a detailed walk through of using plyranges for coverage analysis. - The \[Bioc 2018 Workshop book\](https://bioconductor.github.io/BiocWorkshops/fluent-genomic-data-analysis-with-plyranges.html) has worked examples of using \`plyranges\` to analyse publicly available genomics data. # Citation If you found \`plyranges\` useful for your work please cite our \[paper\](http://dx.doi.org/10.1186/s13059-018-1597-8): @ARTICLE{Lee2019, title = "plyranges: a grammar of genomic data transformation", author = "Lee, Stuart and Cook, Dianne and Lawrence, Michael", journal = "Genome Biol.", volume = 20, number = 1, pages = "4", month = jan, year = 2019, url = "http://dx.doi.org/10.1186/s13059-018-1597-8", doi = "10.1186/s13059-018-1597-8", pmc = "PMC6320618" } # Contributing We welcome contributions from the R/Bioconductor community. We ask that contributors follow the \[code of conduct\](.github/CODE\_OF\_CONDUCT.md) and the guide outlined \[here\](.github/CONTRIBUTING.md).