Index file for the qcmetrics package vignette (original) (raw)

Microarray degradation

The Microarray degradation section has been removed since the packages it was depending on have been deprecated.

Proteomics raw data

To illustrate a simple QC analysis for proteomics data, we will download data set PXD00001 from the ProteomeXchange repository in the mzXML format (Pedrioli and others 2004). The MS2 spectra from that mass-spectrometry run are then read into Rand stored as an MSnExpexperiment using the readMSData function from the MSnbase package(Gatto and Lilley 2012).

library("RforProteomics")
msfile <- getPXD000001mzXML()
library("MSnbase")
exp <- readMSData(msfile, verbose = FALSE)

In the interest of time, this code chunk has been pre-computed and a subset (1 in 3) of the exp instance is distributed with the package. The data is loaded with

load(system.file("extdata/exp.rda", package = "qcmetrics"))

The QcMetrics will consist of 3 items, namely a chromatogram constructed with the MS2 spectra precursor’s intensities, a figure illustrating the precursor charges in the MS space and an m/z delta plot illustrating the suitability of MS2 spectra for identification (see ?plotMzDelta or (Foster et al. 2011)).

qc1 <- QcMetric(name = "Chromatogram")
x <- rtime(exp)
y <- precursorIntensity(exp)
o <- order(x)
qcdata(qc1, "x") <- x[o]
qcdata(qc1, "y") <- y[o]
plot(qc1) <- function(object, ...)
    plot(qcdata(object, "x"),
         qcdata(object, "y"),
         col = "darkgrey", type ="l",
         xlab = "retention time",
         ylab = "precursor intensity")
qc2 <- QcMetric(name = "MS space")
qcdata(qc2, "p2d") <- plot2d(exp, z = "charge", plot = FALSE)
plot(qc2) <- function(object) {
    require("ggplot2")
    print(qcdata(object, "p2d"))
}
qc3 <- QcMetric(name = "m/z delta plot")
qcdata(qc3, "pmz") <- plotMzDelta(exp, plot = FALSE,
                                  verbose = FALSE)
plot(qc3) <- function(object)
    suppressWarnings(print(qcdata(object, "pmz")))

Note that we do not store the raw data in any of the above instances, but always pre-compute the necessary data or plots that are then stored as qcdata. If the raw data was to be needed in multipleQcMetric instances, we could re-use the same qcdata _environment_to avoid unnecessary copies using qcdata(qc2) <- qcenv(qc1) and implement different views through custom plot methods.

Let’s now combine the three items into a QcMetrics object, decorate it with custom metadata using the MIAPE information from the MSnExpobject and generate a report.

protqcm <- QcMetrics(qcdata = list(qc1, qc2, qc3))
metadata(protqcm) <- list(
    data = "PXD000001",
    instrument = experimentData(exp)@instrumentModel,
    source = experimentData(exp)@ionSource,
    analyser = experimentData(exp)@analyser,
    detector = experimentData(exp)@detectorType,
    manufacurer = experimentData(exp)@instrumentManufacturer)

The status column of the summary table is empty as we have not set the QC items statuses yet.

qcReport(protqcm, reportname = "protqc")

Figure 1: Proteomics QC report

The complete pdf report is available with:

browseURL(example_reports("protqc"))

Processed N15 labelling data

In this section, we describe a set of N15 metabolic labelling QC metrics (Krijgsveld et al. 2003). The data is a phospho-enriched N15 labelled Arabidopsis thaliana sample prepared as described in(Groen et al. 2013). The data was processed with in-house tools and is available as an MSnSet instance. Briefly, MS2 spectra were search with the Mascot engine and identification scores adjusted with Mascot Percolator. Heavy and light pairs were then searched in the survey scans and N15 incorporation was estimated based on the peptide sequence and the isotopic envelope of the heavy member of the pair (the inc feature variable). Heavy and light peptides isotopic envelope areas were finally integrated to obtain unlabelled and N15 quantitation data. The psm object provides such data for PSMs (peptide spectrum matches) with a posterior error probability < 0.05 that can be uniquely matched to proteins.

We first load the MSnbase package (required to support the MSnSetdata structure) and example data that is distributed with theqcmetrics package. We will make use of the ggplot2 plotting package.

library("ggplot2")
library("MSnbase")
data(n15psm)
psm
## MSnSet (storageMode: lockedEnvironment)
## assayData: 1772 features, 2 samples 
##   element names: exprs 
## protocolData: none
## phenoData: none
## featureData
##   featureNames: 3 5 ... 4499 (1772 total)
##   fvarLabels: Protein_Accession Protein_Description ... inc (21 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 23681576 
## Annotation:  
## - - - Processing information - - -
## Subset [22540,2][1999,2] Tue Sep 17 01:34:09 2013 
## Removed features with more than 0 NAs: Tue Sep 17 01:34:09 2013 
## Dropped featureData's levels Tue Sep 17 01:34:09 2013 
##  MSnbase version: 1.9.7

The first QC item examines the N15 incorporation rate, available in the inc feature variable. We also defined a median incorporation rate threshold tr equal to 97.5 that is used to set the QC status.

## incorporation rate QC metric
qcinc <- QcMetric(name = "15N incorporation rate")
qcdata(qcinc, "inc") <- fData(psm)$inc
qcdata(qcinc, "tr") <- 97.5
status(qcinc) <- median(qcdata(qcinc, "inc")) > qcdata(qcinc, "tr")

Next, we implement a custom show method, that prints 5 summary values of the variable’s distribution.

show(qcinc) <- function(object) {
    qcshow(object, qcdata = FALSE)
    cat(" QC threshold:", qcdata(object, "tr"), "\n")
    cat(" Incorporation rate\n")
    print(summary(qcdata(object, "inc")))
    invisible(NULL)
}

We then define the metric’s plot function that represent the distribution of the PSM’s incorporation rates as a boxplot, shows all the individual rates as jittered dots and represents the trthreshold as a dotted red line.

plot(qcinc) <- function(object) {
    inc <- qcdata(object, "inc")
    tr <- qcdata(object, "tr")
    lab <- "Incorporation rate"
    dd <- data.frame(inc = qcdata(qcinc, "inc"))
    p <- ggplot(dd, aes(factor(""), inc)) +
        geom_jitter(colour = "#4582B370", size = 3) +
    geom_boxplot(fill = "#FFFFFFD0", colour = "#000000",
                 outlier.size = 0) +
    geom_hline(yintercept = tr, colour = "red",
               linetype = "dotted", size = 1) +
    labs(x = "", y = "Incorporation rate")
    p
}

N15 experiments of good quality are characterised by high incorporation rates, which allow to deconvolute the heavy and light peptide isotopic envelopes and accurate quantification.

The second metric inspects the log2 fold-changes of the PSMs, unique peptides with modifications, unique peptide sequences (not taking modifications into account) and proteins. These respective data sets are computed with the combineFeatures function (see?combineFeatures for details).

fData(psm)$modseq <- ## pep seq + PTM
    paste(fData(psm)$Peptide_Sequence,
          fData(psm)$Variable_Modifications, sep = "+")
pep <- combineFeatures(psm,
                       as.character(fData(psm)$Peptide_Sequence),
                       "median", verbose = FALSE)
modpep <- combineFeatures(psm,
                          fData(psm)$modseq,
                          "median", verbose = FALSE)
prot <- combineFeatures(psm,
                        as.character(fData(psm)$Protein_Accession),
                        "median", verbose = FALSE)

The log2 fold-changes for all the features are then computed and stored as QC data of our next QC item. We also store a pair of valuesexplfc that defined an interval in which we expect our median PSM log2 fold-change to be.

## calculate log fold-change
qclfc <- QcMetric(name = "Log2 fold-changes")
qcdata(qclfc, "lfc.psm") <-
    log2(exprs(psm)[,"unlabelled"] / exprs(psm)[, "N15"])
qcdata(qclfc, "lfc.pep") <-
    log2(exprs(pep)[,"unlabelled"] / exprs(pep)[, "N15"])
qcdata(qclfc, "lfc.modpep") <-
    log2(exprs(modpep)[,"unlabelled"] / exprs(modpep)[, "N15"])
qcdata(qclfc, "lfc.prot") <-
    log2(exprs(prot)[,"unlabelled"] / exprs(prot)[, "N15"])
qcdata(qclfc, "explfc") <- c(-0.5, 0.5)

status(qclfc) <-
    median(qcdata(qclfc, "lfc.psm")) > qcdata(qclfc, "explfc")[1] &
    median(qcdata(qclfc, "lfc.psm")) < qcdata(qclfc, "explfc")[2]

As previously, we provide a custom show method that displays summary values for the four fold-changes. The plot function illustrates the respective log2 fold-change densities and the expected median PSM fold-change range (red rectangle). The expected 0 log2 fold-change is shown as a dotted black vertical line and the observed median PSM value is shown as a blue dashed line.

show(qclfc) <- function(object) {
    qcshow(object, qcdata = FALSE) ## default
    cat(" QC thresholds:", qcdata(object, "explfc"), "\n")
    cat(" * PSM log2 fold-changes\n")
    print(summary(qcdata(object, "lfc.psm")))
    cat(" * Modified peptide log2 fold-changes\n")
    print(summary(qcdata(object, "lfc.modpep")))
    cat(" * Peptide log2 fold-changes\n")
    print(summary(qcdata(object, "lfc.pep")))
    cat(" * Protein log2 fold-changes\n")
    print(summary(qcdata(object, "lfc.prot")))
    invisible(NULL)
}
plot(qclfc) <- function(object) {
    x <- qcdata(object, "explfc")
    plot(density(qcdata(object, "lfc.psm")),
         main = "", sub = "", col = "red",
         ylab = "", lwd = 2,
         xlab = expression(log[2]~fold-change))
    lines(density(qcdata(object, "lfc.modpep")),
          col = "steelblue", lwd = 2)
    lines(density(qcdata(object, "lfc.pep")),
          col = "blue", lwd = 2)
    lines(density(qcdata(object, "lfc.prot")),
          col = "orange")
    abline(h = 0, col = "grey")
    abline(v = 0, lty = "dotted")
    rect(x[1], -1, x[2], 1, col = "#EE000030",
         border = NA)
    abline(v = median(qcdata(object, "lfc.psm")),
           lty = "dashed", col = "blue")
    legend("topright",
           c("PSM", "Peptides", "Modified peptides", "Proteins"),
           col = c("red", "steelblue", "blue", "orange"), lwd = 2,
           bty = "n")
}

A good quality experiment is expected to have a tight distribution centred around 0. Major deviations would indicate incomplete incorporation, errors in the respective amounts of light and heavy material used, and a wide distribution would reflect large variability in the data.

Our last QC item inspects the number of features that have been identified in the experiment. We also investigate how many peptides (with or without considering the modification) have been observed at the PSM level and the number of unique peptides per protein. Here, we do not specify any expected values as the number of observed features is experiment specific; the QC status is left as NA.

## number of features
qcnb <- QcMetric(name = "Number of features")
qcdata(qcnb, "count") <- c(
    PSM = nrow(psm),
    ModPep = nrow(modpep),
    Pep = nrow(pep),
    Prot = nrow(prot))
qcdata(qcnb, "peptab") <-
    table(fData(psm)$Peptide_Sequence)
qcdata(qcnb, "modpeptab") <-
    table(fData(psm)$modseq)
qcdata(qcnb, "upep.per.prot") <-
    fData(psm)$Number_Of_Unique_Peptides

The counts are displayed by the new show and plotted as bar charts by the plot methods.

show(qcnb) <- function(object) {
    qcshow(object, qcdata = FALSE)
    print(qcdata(object, "count"))
}
plot(qcnb) <- function(object) {
    par(mar = c(5, 4, 2, 1))
    layout(matrix(c(1, 2, 1, 3, 1, 4), ncol = 3))
    barplot(qcdata(object, "count"), horiz = TRUE, las = 2)
    barplot(table(qcdata(object, "modpeptab")),
            xlab = "Modified peptides")
    barplot(table(qcdata(object, "peptab")),
            xlab = "Peptides")
    barplot(table(qcdata(object, "upep.per.prot")),
            xlab = "Unique peptides per protein ")
}

In the code chunk below, we combine the 3 QC items into a QcMetricsinstance and generate a report using meta data extracted from thepsm MSnSet instance.

n15qcm <- QcMetrics(qcdata = list(qcinc, qclfc, qcnb))
qcReport(n15qcm, reportname = "n15qcreport",
         title = expinfo(experimentData(psm))["title"],
         author = expinfo(experimentData(psm))["contact"],
         clean = FALSE)

Once an appropriate set of quality metrics has been identified, the generation of the QcMetrics instances can be wrapped up for automation.

We provide such a wrapper function for this examples: the n15qcfunction fully automates the above pipeline. The names of the feature variable columns and the thresholds for the two first QC items are provided as arguments. In case no report name is given, a custom title with date and time is used, to avoid overwriting existing reports.

Figure 2: N15 QC report

The complete pdf report is available with

browseURL(example_reports("n15qc"))