GitHub - showteeth/ggcoverage: Visualize and annotate genomic coverage with ggplot2 (original) (raw)

ggcoverage - Visualize and annotate omics coverage with ggplot2

CRAN R-CMD-check GitHub issues GitHub last commit License CODE_SIZE

Introduction

The goal of ggcoverage is to visualize coverage tracks from genomics, transcriptomics or proteomics data. It contains functions to load data from BAM, BigWig, BedGraph, txt, or xlsx files, create genome/protein coverage plots, and add various annotations including base and amino acid composition, GC content, copy number variation (CNV), genes, transcripts, ideograms, peak highlights, HiC contact maps, contact links and protein features. It is based on and integrates well with ggplot2.

It contains three main parts:

Installation

ggcoverage is an R package distributed as part of the CRAN repository. To install the package, start R and enter one of the following commands:

install via CRAN (not yet available)

install.packages("ggcoverage")

OR install via Github

install.package("remotes") remotes::install_github("showteeth/ggcoverage")

In general, it is recommended to install from the Github repository (updated more regularly).

Once ggcoverage is installed, it can be loaded like every other package:

Manual

ggcoverage provides twovignettes:

RNA-seq data

Load the data

The RNA-seq data used here is from Transcription profiling by high throughput sequencing of HNRNPC knockdown and control HeLa cells. We select four samples to use as example: ERR127307_chr14,ERR127306_chr14, ERR127303_chr14, ERR127302_chr14, and all bam files were converted to bigwig files withdeeptools.

Load metadata:

load metadata

meta_file <- system.file("extdata", "RNA-seq", "meta_info.csv", package = "ggcoverage") sample_meta <- read.csv(meta_file) sample_meta #> SampleName Type Group #> 1 ERR127302_chr14 KO_rep1 KO #> 2 ERR127303_chr14 KO_rep2 KO #> 3 ERR127306_chr14 WT_rep1 WT #> 4 ERR127307_chr14 WT_rep2 WT

Load track files:

track folder

track_folder <- system.file("extdata", "RNA-seq", package = "ggcoverage")

load bigwig file

track_df <- LoadTrackFile( track.folder = track_folder, format = "bw", region = "chr14:21,677,306-21,737,601", extend = 2000, meta.info = sample_meta )

check data

head(track_df) #> seqnames start end width strand score Type Group #> 1 chr14 21675306 21675950 645 * 0 KO_rep1 KO #> 2 chr14 21675951 21676000 50 * 1 KO_rep1 KO #> 3 chr14 21676001 21676100 100 * 2 KO_rep1 KO #> 4 chr14 21676101 21676150 50 * 1 KO_rep1 KO #> 5 chr14 21676151 21677100 950 * 0 KO_rep1 KO #> 6 chr14 21677101 21677200 100 * 2 KO_rep1 KO

Prepare mark region:

create mark region

mark_region <- data.frame( start = c(21678900, 21732001, 21737590), end = c(21679900, 21732400, 21737650), label = c("M1", "M2", "M3") )

check data

mark_region #> start end label #> 1 21678900 21679900 M1 #> 2 21732001 21732400 M2 #> 3 21737590 21737650 M3

Load GTF

To add gene annotation, the gtf file should contain gene_typeand gene_name attributes in column 9; to add transcript annotation, the gtf file should contain a transcript_nameattribute in column 9.

gtf_file <- system.file("extdata", "used_hg19.gtf", package = "ggcoverage") gtf_gr <- rtracklayer::import.gff(con = gtf_file, format = "gtf")

Basic coverage

The basic coverage plot has two types:

joint view

Create line plot for every sample (facet.key = "Type") and color by every sample (group.key = "Type"):

basic_coverage <- ggcoverage( data = track_df, plot.type = "joint", facet.key = "Type", group.key = "Type", mark.region = mark_region, range.position = "out" )

basic_coverage

Create group average line plot (sample is indicated byfacet.key = "Type", group is indicated by group.key = "Group"):

basic_coverage <- ggcoverage( data = track_df, plot.type = "joint", facet.key = "Type", group.key = "Group", joint.avg = TRUE, mark.region = mark_region, range.position = "out" )

basic_coverage

Facet view

basic_coverage <- ggcoverage( data = track_df, plot.type = "facet", mark.region = mark_region, range.position = "out" )

basic_coverage

Custom Y-axis style

Change the Y-axis scale label in/out of plot region withrange.position:

basic_coverage <- ggcoverage( data = track_df, plot.type = "facet", mark.region = mark_region, range.position = "in" )

basic_coverage

Shared/Free Y-axis scale with facet.y.scale:

basic_coverage <- ggcoverage( data = track_df, plot.type = "facet", mark.region = mark_region, range.position = "in", facet.y.scale = "fixed" )

basic_coverage

Add gene annotation

basic_coverage + geom_gene(gtf.gr = gtf_gr)

Add transcript annotation

In “loose” style (default style; each transcript occupies one line):

basic_coverage + geom_transcript(gtf.gr = gtf_gr, label.vjust = 1.5)

In “tight” style (attempted to place non-overlapping transcripts in one line):

basic_coverage + geom_transcript( gtf.gr = gtf_gr, overlap.style = "tight", label.vjust = 1.5 )

Add ideogram

The ideogram is an overview plot about the respective position on a chromosome. The plotting of the ideogram is implemented by the ggbiopackage. This package needs to be installed separately (it is only ‘Suggested’ by ggcoverage).

library(ggbio) #> Loading required package: BiocGenerics #> #> Attaching package: 'BiocGenerics' #> The following objects are masked from 'package:stats': #> #> IQR, mad, sd, var, xtabs #> The following objects are masked from 'package:base': #> #> anyDuplicated, aperm, append, as.data.frame, basename, cbind, #> colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, #> get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, #> match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, #> Position, rank, rbind, Reduce, rownames, sapply, setdiff, table, #> tapply, union, unique, unsplit, which.max, which.min #> Loading required package: ggplot2 #> Registered S3 method overwritten by 'GGally': #> method from
#> +.gg ggplot2 #> Need specific help about ggbio? try mailing #> the maintainer or visit https://lawremi.github.io/ggbio/ #> #> Attaching package: 'ggbio' #> The following objects are masked from 'package:ggplot2': #> #> geom_bar, geom_rect, geom_segment, ggsave, stat_bin, stat_identity, #> xlim

basic_coverage + geom_gene(gtf.gr = gtf_gr) + geom_ideogram(genome = "hg19", plot.space = 0) #> Loading ideogram... #> Loading ranges... #> Scale for x is already present. #> Adding another scale for x, which will replace the existing scale.

basic_coverage + geom_transcript(gtf.gr = gtf_gr, label.vjust = 1.5) + geom_ideogram(genome = "hg19", plot.space = 0) #> Loading ideogram... #> Loading ranges... #> Scale for x is already present. #> Adding another scale for x, which will replace the existing scale.

DNA-seq data

CNV

Example 1

Load the data

The DNA-seq data used here are from Copy number work flow, we select tumor sample, and get bin counts withcn.mops::getReadCountsFromBAM with WL 1000.

prepare metafile

cnv_meta_info <- data.frame( SampleName = c("CNV_example"), Type = c("tumor"), Group = c("tumor") )

track file

track_file <- system.file("extdata", "DNA-seq", "CNV_example.txt", package = "ggcoverage" )

load txt file

track_df <- LoadTrackFile( track.file = track_file, format = "txt", region = "chr4:61750000-62,700,000", meta.info = cnv_meta_info )

check data

head(track_df) #> seqnames start end score Type Group #> 1 chr4 61748000 61748000 25 tumor tumor #> 2 chr4 61748001 61749000 24 tumor tumor #> 3 chr4 61749001 61750000 17 tumor tumor #> 4 chr4 61750001 61751000 23 tumor tumor #> 5 chr4 61751001 61752000 14 tumor tumor #> 6 chr4 61752001 61753000 22 tumor tumor

Basic coverage

basic_coverage <- ggcoverage( data = track_df, color = "grey", mark.region = NULL, range.position = "out" ) basic_coverage

Add GC annotations

Add GC, ideogram and gene annotations. The plotting of the GC content requires the genome annotation packageBSgenome.Hsapiens.UCSC.hg19. This package needs to be installed separately (it is only ‘Suggested’ by ggcoverage).

load genome data

library("BSgenome.Hsapiens.UCSC.hg19") #> Loading required package: BSgenome #> Loading required package: S4Vectors #> Loading required package: stats4 #> #> Attaching package: 'S4Vectors' #> The following object is masked from 'package:utils': #> #> findMatches #> The following objects are masked from 'package:base': #> #> expand.grid, I, unname #> Loading required package: IRanges #> Loading required package: GenomeInfoDb #> Loading required package: GenomicRanges #> Loading required package: Biostrings #> Loading required package: XVector #> #> Attaching package: 'Biostrings' #> The following object is masked from 'package:base': #> #> strsplit #> Loading required package: BiocIO #> Loading required package: rtracklayer #> #> Attaching package: 'rtracklayer' #> The following object is masked from 'package:BiocIO': #> #> FileForFormat

create plot

basic_coverage + geom_gc(bs.fa.seq = BSgenome.Hsapiens.UCSC.hg19) + geom_gene(gtf.gr = gtf_gr) + geom_ideogram(genome = "hg19") #> Loading ideogram... #> Loading ranges... #> Scale for x is already present. #> Adding another scale for x, which will replace the existing scale.

Example 2

Load the data

The DNA-seq data used here are from Genome-wide copy number analysis of single cells, and the accession number isSRR054616.

track file

track_file <- system.file("extdata", "DNA-seq", "SRR054616.bw", package = "ggcoverage")

load track

track_df <- LoadTrackFile( track.file = track_file, format = "bw", region = "4:1-160000000" ) #> No metadata provided, returning coverage as is.

add chr prefix

track_df$seqnames <- paste0("chr", track_df$seqnames)

check data

head(track_df) #> seqnames start end width strand score Type Group #> 1 chr4 1 50000 50000 * 197 SRR054616.bw SRR054616.bw #> 2 chr4 50001 100000 50000 * 598 SRR054616.bw SRR054616.bw #> 3 chr4 100001 150000 50000 * 287 SRR054616.bw SRR054616.bw #> 4 chr4 150001 200000 50000 * 179 SRR054616.bw SRR054616.bw #> 5 chr4 200001 250000 50000 * 282 SRR054616.bw SRR054616.bw #> 6 chr4 250001 300000 50000 * 212 SRR054616.bw SRR054616.bw

Basic coverage

basic_coverage <- ggcoverage( data = track_df, color = "grey", mark.region = NULL, range.position = "out" )

basic_coverage

Load CNV file

prepare files

cnv_file <- system.file("extdata", "DNA-seq", "SRR054616_copynumber.txt", package = "ggcoverage" )

read CNV

cnv_df <- read.table(file = cnv_file, sep = "\t", header = TRUE)

check data

head(cnv_df) #> chrom chrompos cn.ratio copy.number #> 1 chr4 1 11.518554 5 #> 2 chr4 90501 5.648878 5 #> 3 chr4 145220 4.031609 5 #> 4 chr4 209519 5.005852 5 #> 5 chr4 268944 4.874096 5 #> 6 chr4 330272 4.605368 5

Add annotations

Add GC, ideogram and CNV annotations.

create plot

basic_coverage + geom_gc(bs.fa.seq = BSgenome.Hsapiens.UCSC.hg19) + geom_cnv( cnv.df = cnv_df, bin.col = 3, cn.col = 4 ) + geom_ideogram( genome = "hg19", plot.space = 0, highlight.centromere = TRUE ) #> Loading ideogram... #> Loading ranges... #> Scale for x is already present. #> Adding another scale for x, which will replace the existing scale.

Single-nucleotide level

Load the data

prepare sample metadata

sample_meta <- data.frame( SampleName = c("tumorA.chr4.selected"), Type = c("tumorA"), Group = c("tumorA") )

load bam file

bam_file <- system.file("extdata", "DNA-seq", "tumorA.chr4.selected.bam", package = "ggcoverage" )

track_df <- LoadTrackFile( track.file = bam_file, meta.info = sample_meta, single.nuc = TRUE, single.nuc.region = "chr4:62474235-62474295" ) #> No 'region' specified; extracting coverage for an example range #> (<=100,000 bases, first annotated sequence) #> Coverage extracted from sequence/chromosome: chr10

head(track_df) #> seqnames start end width strand score Type Group #> 1 chr4 62474235 62474236 1 * 5 tumorA tumorA #> 2 chr4 62474236 62474237 1 * 5 tumorA tumorA #> 3 chr4 62474237 62474238 1 * 5 tumorA tumorA #> 4 chr4 62474238 62474239 1 * 6 tumorA tumorA #> 5 chr4 62474239 62474240 1 * 6 tumorA tumorA #> 6 chr4 62474240 62474241 1 * 6 tumorA tumorA

Default color scheme

For base and amino acid annotation, the package comes with the following default color schemes. Color schemes can be changed with nuc.color andaa.color parameters.

THe default color scheme for base annotation is Clustal-style, more popular color schemes are availablehere.

color scheme

nuc_color <- c( "A" = "#ff2b08", "C" = "#009aff", "G" = "#ffb507", "T" = "#00bc0d" )

create plot

graphics::image( seq_along(nuc_color), 1, as.matrix(seq_along(nuc_color)), col = nuc_color, xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n" ) graphics::text(seq_along(nuc_color), 1, names(nuc_color)) graphics::mtext( text = "Base", adj = 1, las = 1, side = 2 )

Default color scheme for amino acid annotation is from Residual colours: a proposal for aminochromography:

aa_color <- c( "D" = "#FF0000", "S" = "#FF2400", "T" = "#E34234", "G" = "#FF8000", "P" = "#F28500", "C" = "#FFFF00", "A" = "#FDFF00", "V" = "#E3FF00", "I" = "#C0FF00", "L" = "#89318C", "M" = "#00FF00", "F" = "#50C878", "Y" = "#30D5C8", "W" = "#00FFFF", "H" = "#0F2CB3", "R" = "#0000FF", "K" = "#4b0082", "N" = "#800080", "Q" = "#FF00FF", "E" = "#8F00FF", "*" = "#FFC0CB", " " = "#FFFFFF", " " = "#FFFFFF", " " = "#FFFFFF", " " = "#FFFFFF" )

graphics::image( 1:5, 1:5, matrix(seq_along(aa_color), nrow = 5), col = rev(aa_color), xlab = "", ylab = "", xaxt = "n", yaxt = "n", bty = "n" )

graphics::text(expand.grid(1:5, 1:5), names(rev(aa_color))) graphics::mtext( text = "Amino acids", adj = 1, las = 1, side = 2 )

Add base and amino acid annotation

Use twill to mark position with SNV:

create plot with twill mark

ggcoverage( data = track_df, color = "grey", range.position = "out", single.nuc = TRUE, rect.color = "white" ) + geom_base( bam.file = bam_file, bs.fa.seq = BSgenome.Hsapiens.UCSC.hg19, mark.type = "twill" ) + geom_ideogram(genome = "hg19", plot.space = 0) #> Loading ideogram... #> Loading ranges... #> Scale for x is already present. #> Adding another scale for x, which will replace the existing scale.

Use star to mark position with SNV:

create plot with star mark

ggcoverage( data = track_df, color = "grey", range.position = "out", single.nuc = TRUE, rect.color = "white" ) + geom_base( bam.file = bam_file, bs.fa.seq = BSgenome.Hsapiens.UCSC.hg19, mark.type = "star" ) + geom_ideogram(genome = "hg19", plot.space = 0) #> Loading ideogram... #> Loading ranges... #> Scale for x is already present. #> Adding another scale for x, which will replace the existing scale.

Highlight position with SNV:

highlight one base

ggcoverage( data = track_df, color = "grey", range.position = "out", single.nuc = TRUE, rect.color = "white" ) + geom_base( bam.file = bam_file, bs.fa.seq = BSgenome.Hsapiens.UCSC.hg19, mark.type = "highlight" ) + geom_ideogram(genome = "hg19", plot.space = 0) #> Loading ideogram... #> Loading ranges... #> Scale for x is already present. #> Adding another scale for x, which will replace the existing scale.

ChIP-seq data

The ChIP-seq data used here is fromDiffBind. Four samples are selected as examples: Chr18_MCF7_input,Chr18_MCF7_ER_1, Chr18_MCF7_ER_3, Chr18_MCF7_ER_2, and all bam files were converted to bigwig files withdeeptools.

Create metadata:

load metadata

sample_meta <- data.frame( SampleName = c( "Chr18_MCF7_ER_1", "Chr18_MCF7_ER_2", "Chr18_MCF7_ER_3", "Chr18_MCF7_input" ), Type = c("MCF7_ER_1", "MCF7_ER_2", "MCF7_ER_3", "MCF7_input"), Group = c("IP", "IP", "IP", "Input") )

sample_meta #> SampleName Type Group #> 1 Chr18_MCF7_ER_1 MCF7_ER_1 IP #> 2 Chr18_MCF7_ER_2 MCF7_ER_2 IP #> 3 Chr18_MCF7_ER_3 MCF7_ER_3 IP #> 4 Chr18_MCF7_input MCF7_input Input

Load track files:

track folder

track_folder <- system.file("extdata", "ChIP-seq", package = "ggcoverage")

load bigwig file

track_df <- LoadTrackFile( track.folder = track_folder, format = "bw", region = "chr18:76822285-76900000", meta.info = sample_meta )

check data

head(track_df) #> seqnames start end width strand score Type Group #> 1 chr18 76820285 76820400 116 * 219.658005 MCF7_ER_1 IP #> 2 chr18 76820401 76820700 300 * 0.000000 MCF7_ER_1 IP #> 3 chr18 76820701 76821000 300 * 439.316010 MCF7_ER_1 IP #> 4 chr18 76821001 76821300 300 * 219.658005 MCF7_ER_1 IP #> 5 chr18 76821301 76821600 300 * 0.000000 MCF7_ER_1 IP #> 6 chr18 76821601 76821900 300 * 219.658005 MCF7_ER_1 IP

Prepare mark region:

create mark region

mark_region <- data.frame( start = c(76822533), end = c(76823743), label = c("Promoter") )

check data

mark_region #> start end label #> 1 76822533 76823743 Promoter

Basic coverage

basic_coverage <- ggcoverage( data = track_df, mark.region = mark_region, show.mark.label = FALSE ) basic_coverage

Add annotations

Add gene, ideogram and peak annotations. To create peak annotation, we first get consensus peaks withMSPC.

get consensus peak file

peak_file <- system.file("extdata", "ChIP-seq", "consensus.peak", package = "ggcoverage" )

basic_coverage + geom_gene(gtf.gr = gtf_gr) + geom_peak(bed.file = peak_file) + geom_ideogram(genome = "hg19", plot.space = 0) #> Loading ideogram... #> Loading ranges... #> Scale for x is already present. #> Adding another scale for x, which will replace the existing scale.

Hi-C data

The Hi-C method maps chromosome contacts in eukaryotic cells. For this purpose, DNA and protein complexes are cross-linked and DNA fragments then purified. As a result, even distant chromatin fragments can be found to interact due to the spatial organization of the DNA and histones in the cell. Hi-C data shows these interactions for example as a contact map.

The Hi-C data is taken from pyGenomeTracks: reproducible plots for multivariate genomic datasets.

The Hi-C matrix visualization is implemented byHiCBricks. This package needs to be installed separately (it is only ‘Suggested’ byggcoverage).

Load track data

prepare track dataframe

track_file <- system.file("extdata", "HiC", "H3K36me3.bw", package = "ggcoverage")

track_df <- LoadTrackFile( track.file = track_file, format = "bw", region = "chr2L:8050000-8300000", extend = 0 ) #> No metadata provided, returning coverage as is.

track_df$score <- ifelse(track_df$score < 0, 0, track_df$score)

check the data

head(track_df) #> seqnames start end width strand score Type Group #> 1 chr2L 8050000 8050009 10 * 1.66490245 H3K36me3.bw H3K36me3.bw #> 2 chr2L 8050015 8050049 35 * 1.59976900 H3K36me3.bw H3K36me3.bw #> 3 chr2L 8050057 8050091 35 * 1.60730922 H3K36me3.bw H3K36me3.bw #> 4 chr2L 8050097 8050131 35 * 1.65555012 H3K36me3.bw H3K36me3.bw #> 5 chr2L 8050137 8050171 35 * 1.71025538 H3K36me3.bw H3K36me3.bw #> 6 chr2L 8050176 8050210 35 * 1.75198197 H3K36me3.bw H3K36me3.bw

Load Hi-C data

Matrix:

matrix

hic_mat_file <- system.file("extdata", "HiC", "HiC_mat.txt", package = "ggcoverage" ) hic_mat <- read.table(file = hic_mat_file, sep = "\t") hic_mat <- as.matrix(hic_mat)

Bin table:

bin

hic_bin_file <- system.file("extdata", "HiC", "HiC_bin.txt", package = "ggcoverage") hic_bin <- read.table(file = hic_bin_file, sep = "\t") colnames(hic_bin) <- c("chr", "start", "end") hic_bin_gr <- GenomicRanges::makeGRangesFromDataFrame(df = hic_bin)

transfrom func

failsafe_log10 <- function(x) { x[is.na(x) | is.nan(x) | is.infinite(x)] <- 0 return(log10(x + 1)) }

Data transfromation method:

prepare arcs

link_file <- system.file("extdata", "HiC", "HiC_link.bedpe", package = "ggcoverage")

Basic coverage

basic_coverage <- ggcoverage( data = track_df, color = "grey", mark.region = NULL, range.position = "out" )

basic_coverage

Add annotations

Add link, contact mapannotations:

library(HiCBricks) #> Loading required package: curl #> Using libcurl 7.81.0 with OpenSSL/3.0.2 #> Loading required package: rhdf5 #> Loading required package: R6 #> Loading required package: grid #> #> Attaching package: 'grid' #> The following object is masked from 'package:Biostrings': #> #> pattern

basic_coverage + geom_tad( matrix = hic_mat, granges = hic_bin_gr, value.cut = 0.99, color.palette = "viridis", transform.fun = failsafe_log10, top = FALSE, show.rect = TRUE ) + geom_link( link.file = link_file, file.type = "bedpe", show.rect = TRUE ) #> Read 534 lines after Skipping 0 lines #> Inserting Data at location: 1 #> Data length: 534 #> Loaded 2315864 bytes of data... #> Read 534 records... #> Scale for y is already present. #> Adding another scale for y, which will replace the existing scale. #> Scale for x is already present. #> Adding another scale for x, which will replace the existing scale.

Mass spectrometry protein coverage

Mass spectrometry(MS) is an important method for the accurate mass determination and characterization of proteins, and a variety of methods and instruments have been developed for its many uses. With ggcoverage, we can easily inspect the peptide coverage of a protein in order to learn about the quality of the data.

Load coverage

The exported coverage from Proteome Discoverer:

library(openxlsx)

prepare coverage dataframe

coverage_file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.xlsx", package = "ggcoverage" ) coverage_df <- openxlsx::read.xlsx(coverage_file, sheet = "Sheet1")

check the data

head(coverage_df) #> Confidence Annotated.Sequence #> 1 High [K].ATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPK.[L] #> 2 High [K].ATEEQLKTVMENFVAFVDKCCAADDKEACFAVEGPK.[L] #> 3 High [K].TVMENFVAFVDKCCAADDKEACFAVEGPK.[L] #> 4 High [K].HLVDEPQNLIKQNCDQFEKLGEYGFQNALIVR.[Y] #> 5 High [R].RHPYFYAPELLYYANKYNGVFQECCQAEDKGACLLPK.[I] #> 6 High [K].AFDEKLFTFHADICTLPDTEKQIKK.[Q] #> Modifications Contaminant #> 1 3xCarbamidomethyl [C20; C21; C29] TRUE #> 2 3xCarbamidomethyl [C20; C21; C29]; 1xOxidation [M10] TRUE #> 3 3xCarbamidomethyl [C13; C14; C22]; 1xOxidation [M3] TRUE #> 4 1xCarbamidomethyl [C14] TRUE #> 5 3xCarbamidomethyl [C24; C25; C33] TRUE #> 6 1xCarbamidomethyl [C14] TRUE #> #.Protein.Groups #.Proteins #.PSMs Master.Protein.Accessions #> 1 1 2 15 ALBU_BOVIN #> 2 1 2 26 ALBU_BOVIN #> 3 1 2 14 ALBU_BOVIN #> 4 1 2 41 ALBU_BOVIN #> 5 1 2 37 ALBU_BOVIN #> 6 1 2 40 ALBU_BOVIN #> Positions.in.Master.Proteins Modifications.in.Master.Proteins #> 1 ALBU_BOVIN [562-597] NA #> 2 ALBU_BOVIN [562-597] NA #> 3 ALBU_BOVIN [569-597] NA #> 4 ALBU_BOVIN [402-433] NA #> 5 ALBU_BOVIN [168-204] NA #> 6 ALBU_BOVIN [524-548] NA #> #.Missed.Cleavages Theo..MH+.[Da] Abundance:.F3:.Sample Quan.Info #> 1 3 4107.88065 18692597.5 #> 2 3 4123.87556 87767162.0 #> 3 2 3324.46798 19803927.2 #> 4 2 3815.91737 204933705.0 #> 5 3 4513.12024 57012156.5 #> 6 3 2995.52337 183934556.7 #> Found.in.Sample:.[S3].F3:.Sample Confidence.(by.Search.Engine):.Sequest.HT #> 1 High High #> 2 High High #> 3 High High #> 4 High High #> 5 High High #> 6 High High #> XCorr.(by.Search.Engine):.Sequest.HT Top.Apex.RT.[min] #> 1 11.96 97.50 #> 2 10.91 90.09 #> 3 9.89 84.90 #> 4 9.75 91.84 #> 5 8.94 93.30 #> 6 8.90 75.40

The input protein fasta:

fasta_file <- system.file("extdata", "Proteomics", "MS_BSA_coverage.fasta", package = "ggcoverage" )

prepare track dataframe

protein_set <- Biostrings::readAAStringSet(fasta_file)

check the data

protein_set #> AAStringSet object of length 2: #> width seq names
#> [1] 607 MKWVTFISLLLLFSSAYSRGVFR...DDKEACFAVEGPKLVVSTQTALA sp|P02769|ALBU_BOVIN #> [2] 583 DTHKSEIAHRFKDLGEEHFKGLV...DDKEACFAVEGPKLVVSTQTALA decoy

Protein coverage

protein_coverage <- ggprotein( coverage.df = coverage_df, fasta.file = fasta_file, protein.id = "sp|P02769|ALBU_BOVIN", range.position = "out" )

protein_coverage

Add annotation

We can obtain features of the protein fromUniProt. For example, the above protein coverage plot shows that there is empty region in 1-24, and this empty region in UniProt is annotated as Signal peptide and Propeptide peptide. When the protein is mature and released extracellular, these peptides will be cleaved. This is the reason why there is empty region in 1-24.

protein feature obtained from UniProt

protein_feature_df <- data.frame( ProteinID = "sp|P02769|ALBU_BOVIN", start = c(1, 19, 25), end = c(18, 24, 607), Type = c("Signal", "Propeptide", "Chain") )

add annotation

protein_coverage + geom_feature( feature.df = protein_feature_df, feature.color = c("#4d81be", "#173b5e", "#6a521d") )

Code of Conduct

Please note that the ggcoverage project is released with aContributor Code of Conduct. By contributing to this project, you agree to abide by its terms.