Chapter 8 Quantitative proteomics data analysis | Omics Data Analysis (original) (raw)

Learning Objectives

The goals of this chapter are to provide a real-life example of step-by-step quantitative proteomics data analysis.

Introduction

Mass spectrometry-based quantitative proteomics data can be representated as a matrix of quantitative values for features (PSMs, peptides, proteins) arranged along the rows, measured for a set of samples, arranged along the columns. The is a common representation for any quantitative data set. As already done inthisandpreviouscourse, we will be using the SummarizedExperiment class:

Figure 8.1: Schematic representation of the anatomy of a SummarizedExperiment object. (Figure taken from the SummarizedExperiment package vignette.)

Schematic representation of the anatomy of a `SummarizedExperiment` object. (Figure taken from the `SummarizedExperiment` package vignette.)

QFeatures

As we have already discussed in the previous chapter, even though mass spectrometers acquire data for spectra/peptides, the biological entities of interest remain the proteins. As part of the data processing, we are thus required to aggregate low-level quantitative features into higher level data.

Figure 8.2: Conceptual representation of a QFeatures object and the aggregative relation between different assays.

Conceptual representation of a `QFeatures` object and the aggregative relation between different assays.

We are going to start to familiarise ourselves with the QFeaturesclass implemented in the QFeaturespackage. Let’s start by loading the tidyverse and QFeatures packages.

The `QFeatures` package.Figure 8.3: The QFeatures package.

library("tidyverse")
library("ggplot2")
library("QFeatures")
library("limma")

Next, we load the feat1 test data, which is composed of single_assay_ of class SummarizedExperiment composed of 10 rows and 2 columns.

## An instance of class QFeatures containing 1 assays:
##  [1] psms: SummarizedExperiment with 10 rows and 2 columns

► Question

Perform the following to familiarise yourselves with the QFeatures class:

► Solution

## DataFrame with 2 rows and 1 column
##        Group
##    <integer>
## S1         1
## S2         2
## class: SummarizedExperiment 
## dim: 10 2 
## metadata(0):
## assays(1): ''
## rownames(10): PSM1 PSM2 ... PSM9 PSM10
## rowData names(5): Sequence Protein Var location pval
## colnames(2): S1 S2
## colData names(0):
## class: SummarizedExperiment 
## dim: 10 2 
## metadata(0):
## assays(1): ''
## rownames(10): PSM1 PSM2 ... PSM9 PSM10
## rowData names(5): Sequence Protein Var location pval
## colnames(2): S1 S2
## colData names(0):
##       S1 S2
## PSM1   1 11
## PSM2   2 12
## PSM3   3 13
## PSM4   4 14
## PSM5   5 15
## PSM6   6 16
## PSM7   7 17
## PSM8   8 18
## PSM9   9 19
## PSM10 10 20
## DataFrame with 10 rows and 5 columns
##            Sequence     Protein       Var      location      pval
##         <character> <character> <integer>   <character> <numeric>
## PSM1       SYGFNAAR       ProtA         1 Mitochondr...     0.084
## PSM2       SYGFNAAR       ProtA         2 Mitochondr...     0.077
## PSM3       SYGFNAAR       ProtA         3 Mitochondr...     0.063
## PSM4       ELGNDAYK       ProtA         4 Mitochondr...     0.073
## PSM5       ELGNDAYK       ProtA         5 Mitochondr...     0.012
## PSM6       ELGNDAYK       ProtA         6 Mitochondr...     0.011
## PSM7  IAEESNFPFI...       ProtB         7       unknown     0.075
## PSM8  IAEESNFPFI...       ProtB         8       unknown     0.038
## PSM9  IAEESNFPFI...       ProtB         9       unknown     0.028
## PSM10 IAEESNFPFI...       ProtB        10       unknown     0.097

Feature aggregation

The central functionality of the QFeatures infrastructure is the aggregation of features into higher-level features while retaining the link between the different levels. This can be done with theaggregateFeatures()function.

The call below will

feat1 <- aggregateFeatures(feat1, i = "psms",
                           fcol = "Sequence",
                           name = "peptides",
                           fun = colMeans)
feat1
## An instance of class QFeatures containing 2 assays:
##  [1] psms: SummarizedExperiment with 10 rows and 2 columns 
##  [2] peptides: SummarizedExperiment with 3 rows and 2 columns

► Question

► Solution

## SYGFNAAR
colMeans(assay(feat1[[1]])[1:3, ])
## S1 S2 
##  2 12
assay(feat1[[2]])["SYGFNAAR", ]
## S1 S2 
##  2 12
## ELGNDAYK
colMeans(assay(feat1[[1]])[4:6, ])
## S1 S2 
##  5 15
assay(feat1[[2]])["ELGNDAYK", ]
## S1 S2 
##  5 15
## IAEESNFPFIK
colMeans(assay(feat1[[1]])[7:10, ])
##   S1   S2 
##  8.5 18.5
assay(feat1[[2]])["IAEESNFPFIK", ]
##   S1   S2 
##  8.5 18.5
## DataFrame with 3 rows and 4 columns
##                  Sequence     Protein      location        .n
##               <character> <character>   <character> <integer>
## ELGNDAYK         ELGNDAYK       ProtA Mitochondr...         3
## IAEESNFPFIK IAEESNFPFI...       ProtB       unknown         4
## SYGFNAAR         SYGFNAAR       ProtA Mitochondr...         3

► Question

Aggregate the peptide-level data into a new protein-level assay using the colMedians() aggregation function.

► Solution

feat1 <- aggregateFeatures(feat1, i = "peptides",
                           fcol = "Protein",
                           name = "proteins",
                           fun = colMedians)
feat1
## An instance of class QFeatures containing 3 assays:
##  [1] psms: SummarizedExperiment with 10 rows and 2 columns 
##  [2] peptides: SummarizedExperiment with 3 rows and 2 columns 
##  [3] proteins: SummarizedExperiment with 2 rows and 2 columns
assay(feat1[["proteins"]])
##        S1   S2
## ProtA 3.5 13.5
## ProtB 8.5 18.5

Subsetting and filtering

The link between the assays becomes apparent when we now subset the assays for protein A as shown below or using the subsetByFeature()function. This creates a new instance of class QFeatures containing assays with the expression data for protein, its peptides and their PSMs.

## An instance of class QFeatures containing 3 assays:
##  [1] psms: SummarizedExperiment with 6 rows and 2 columns 
##  [2] peptides: SummarizedExperiment with 2 rows and 2 columns 
##  [3] proteins: SummarizedExperiment with 1 rows and 2 columns

The filterFeatures() function can be used to filter rows the assays composing a QFeatures object using the row data variables. We can for example retain rows that have a pval < 0.05, which would only keep rows in the psms assay because the pval is only relevant for that assay.

filterFeatures(feat1, ~ pval < 0.05)
## 'pval' found in 1 out of 3 assay(s)
## No filter applied to the following assay(s) because one or more filtering variables are missing in the rowData: peptides, proteins.
## You can control whether to remove or keep the features using the 'keep' argument (see '?filterFeature').
## An instance of class QFeatures containing 3 assays:
##  [1] psms: SummarizedExperiment with 4 rows and 2 columns 
##  [2] peptides: SummarizedExperiment with 0 rows and 2 columns 
##  [3] proteins: SummarizedExperiment with 0 rows and 2 columns

On the other hand, if we filter assay rows for those that localise to the mitochondrion, we retain the relevant protein, peptides and PSMs.

filterFeatures(feat1, ~ location == "Mitochondrion")
## 'location' found in 3 out of 3 assay(s)
## An instance of class QFeatures containing 3 assays:
##  [1] psms: SummarizedExperiment with 6 rows and 2 columns 
##  [2] peptides: SummarizedExperiment with 2 rows and 2 columns 
##  [3] proteins: SummarizedExperiment with 1 rows and 2 columns

► Question

Filter rows that do not localise to the mitochondrion.

► Solution

filterFeatures(feat1, ~ location != "Mitochondrion")
## 'location' found in 3 out of 3 assay(s)
## An instance of class QFeatures containing 3 assays:
##  [1] psms: SummarizedExperiment with 4 rows and 2 columns 
##  [2] peptides: SummarizedExperiment with 1 rows and 2 columns 
##  [3] proteins: SummarizedExperiment with 1 rows and 2 columns

You can refer to the Quantitative features for mass spectrometry datavignette and the QFeature manual pagefor more details about the class.

Analysis pipeline

Quantitative proteomics data processing is composed of the following steps:

The CPTAC data

The CPTAC spike-in study 6 (⊕Paulovich et al. 2010Paulovich, Amanda G, Dean Billheimer, Amy-Joan L Ham, Lorenzo Vega-Montoto, Paul A Rudnick, David L Tabb, Pei Wang, et al. 2010. “Interlaboratory Study Characterizing a Yeast Performance Standard for Benchmarking LC-MS Platform Performance.” Mol. Cell. Proteomics 9 (2): 242–54.) combines the Sigma UPS1 standard containing 48 different human proteins that are spiked in at 5 different concentrations (conditions A to E) into a constant yeast protein background. The sample were acquired in triplicate on different instruments in different labs. We are going to start with a subset of the CPTAC study 6 containing conditions A and B for a single lab.

Figure 8.4: The CPTAC spike-in study design (credit Lieven Clement, statOmics, Ghent University).

The CPTAC spike-in study design (credit Lieven Clement, statOmics, Ghent University).

The peptide-level data, as processed by MaxQuant (⊕Cox and Mann 2008Cox, J, and M Mann. 2008. “MaxQuant Enables High Peptide Identification Rates, Individualized p.p.b.-Range Mass Accuracies and Proteome-Wide Protein Quantification.” Nat Biotechnol 26 (12): 1367–72. https://doi.org/10.1038/nbt.1511.) is available in the msdata package:

basename(f <- msdata::quant(pattern = "cptac", full.names = TRUE))
## [1] "cptac_a_b_peptides.txt"

From the names of the columns, we see that the quantitative columns, starting with "Intensity." (note the dot!) are at positions 56 to 61.

##  [1] "Sequence"                 "N.term.cleavage.window"  
##  [3] "C.term.cleavage.window"   "Amino.acid.before"       
##  [5] "First.amino.acid"         "Second.amino.acid"       
##  [7] "Second.last.amino.acid"   "Last.amino.acid"         
##  [9] "Amino.acid.after"         "A.Count"                 
## [11] "R.Count"                  "N.Count"                 
## [13] "D.Count"                  "C.Count"                 
## [15] "Q.Count"                  "E.Count"                 
## [17] "G.Count"                  "H.Count"                 
## [19] "I.Count"                  "L.Count"                 
## [21] "K.Count"                  "M.Count"                 
## [23] "F.Count"                  "P.Count"                 
## [25] "S.Count"                  "T.Count"                 
## [27] "W.Count"                  "Y.Count"                 
## [29] "V.Count"                  "U.Count"                 
## [31] "Length"                   "Missed.cleavages"        
## [33] "Mass"                     "Proteins"                
## [35] "Leading.razor.protein"    "Start.position"          
## [37] "End.position"             "Unique..Groups."         
## [39] "Unique..Proteins."        "Charges"                 
## [41] "PEP"                      "Score"                   
## [43] "Identification.type.6A_7" "Identification.type.6A_8"
## [45] "Identification.type.6A_9" "Identification.type.6B_7"
## [47] "Identification.type.6B_8" "Identification.type.6B_9"
## [49] "Experiment.6A_7"          "Experiment.6A_8"         
## [51] "Experiment.6A_9"          "Experiment.6B_7"         
## [53] "Experiment.6B_8"          "Experiment.6B_9"         
## [55] "Intensity"                "Intensity.6A_7"          
## [57] "Intensity.6A_8"           "Intensity.6A_9"          
## [59] "Intensity.6B_7"           "Intensity.6B_8"          
## [61] "Intensity.6B_9"           "Reverse"                 
## [63] "Potential.contaminant"    "id"                      
## [65] "Protein.group.IDs"        "Mod..peptide.IDs"        
## [67] "Evidence.IDs"             "MS.MS.IDs"               
## [69] "Best.MS.MS"               "Oxidation..M..site.IDs"  
## [71] "MS.MS.Count"
(i <- grep("Intensity\\.", names(read.delim(f))))
## [1] 56 57 58 59 60 61

We now read these data using the readSummarizedExperimentfunction. This peptide-level expression data will be imported into R as an instance of class SummarizedExperiment. We also use thefnames argument to set the row-names of the peptides assay to the peptide sequences and specify that the file is a tab-separated table.

cptac_se <- readSummarizedExperiment(f, quantCols = i, fnames = "Sequence", sep = "\t")
cptac_se
## class: SummarizedExperiment 
## dim: 11466 6 
## metadata(0):
## assays(1): ''
## rownames(11466): AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNR
##   YYTVFDRDNNRVGFAEAAR
## rowData names(65): Sequence N.term.cleavage.window ...
##   Oxidation..M..site.IDs MS.MS.Count
## colnames(6): Intensity.6A_7 Intensity.6A_8 ... Intensity.6B_8
##   Intensity.6B_9
## colData names(0):

Before proceeding, we are going to clean up the sample names and annotate the experiment:

colnames(cptac_se) <- sub("I.+\\.", "", colnames(cptac_se))
cptac_se$condition <- sub("_[7-9]", "", colnames(cptac_se))
cptac_se$id <- sub("^.+_", "", colnames(cptac_se))
colData(cptac_se)
## DataFrame with 6 rows and 2 columns
##        condition          id
##      <character> <character>
## 6A_7          6A           7
## 6A_8          6A           8
## 6A_9          6A           9
## 6B_7          6B           7
## 6B_8          6B           8
## 6B_9          6B           9

Let’s also keep only a subset of

keep_var <- c("Sequence", "Proteins", "Leading.razor.protein", "PEP",
              "Score", "Reverse", "Potential.contaminant")

rowData(cptac_se) <- rowData(cptac_se)[, keep_var]

Missing values

Missing values can be highly frequent in proteomics. These exist two reasons supporting the existence of missing values, namely biological or technical.

  1. Values that are missing due to the absence (or extremely low contentration) of a protein are observed for biological reasons, and their pattern aren’t random. A protein missing due to the suppression of its expression will not be missing at random: it will be missing in the condition in which it was suppressed, and be present in the condition where it is expressed.
  2. Due to it’s data-dependent acquisition, mass spectrometry isn’t capable to assaying all peptides in a sample. Peptides that are less abundant than some of their co-eluting ions, peptides that do not ionise well or peptides that do not get identified might be sporadically missing in the final quantitation table, despite their presence in the biological samples. Their absence patterns arerandom in such cases.

Often, third party software that produce quantiative data use zeros instead of properly reporting missing values. We can use thezeroIsNA() function to replace the 0 by NA values in ourcptac_se object and then explore the missing data patterns across columns and rows.

cptac_se <- zeroIsNA(cptac_se)
nNA(cptac_se)
## $nNA
## DataFrame with 1 row and 2 columns
##         nNA       pNA
##   <integer> <numeric>
## 1     31130  0.452497
## 
## $nNArows
## DataFrame with 11466 rows and 3 columns
##                name       nNA       pNA
##         <character> <integer> <numeric>
## 1     AAAAGAGGAG...         4  0.666667
## 2         AAAALAGGK         0  0.000000
## 3        AAAALAGGKK         0  0.000000
## 4     AAADALSDLE...         0  0.000000
## 5     AAADALSDLE...         0  0.000000
## ...             ...       ...       ...
## 11462 YYSIYDLGNN...         6  1.000000
## 11463 YYTFNGPNYN...         3  0.500000
## 11464    YYTITEVATR         4  0.666667
## 11465 YYTVFDRDNN...         6  1.000000
## 11466 YYTVFDRDNN...         6  1.000000
## 
## $nNAcols
## DataFrame with 6 rows and 3 columns
##          name       nNA       pNA
##   <character> <integer> <numeric>
## 1        6A_7      4743  0.413658
## 2        6A_8      5483  0.478196
## 3        6A_9      5320  0.463980
## 4        6B_7      4721  0.411739
## 5        6B_8      5563  0.485174
## 6        6B_9      5300  0.462236

Figure 8.5: Distribution of missing value (white). Peptides row with more missing values are moved towards the top of the figure.

Distribution of missing value (white). Peptides row with more missing values are moved towards the top of the figure.

► Question

► Solution

barplot(nNA(cptac_se)$nNAcols$pNA)

table(nNA(cptac_se)$nNArows$nNA)
## 
##    0    1    2    3    4    5    6 
## 4059  990  884  717  934  807 3075
## remove rows that have 4 or more NAs out of 6
cptac_se <- filterNA(cptac_se, pNA = 4/6)

Imputation

Imputation is the technique of replacing missing data with probable values. This can be done with impute() method. As we have discussed above, there are however two types of missing values in mass spectrometry-based proteomics, namely data missing at random (MAR), and data missing not at random (MNAR). These two types of missing data need to be imputed with different types of imputation methods (⊕Lazar et al. 2016Lazar, C, L Gatto, M Ferro, C Bruley, and T Burger. 2016. “Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.” J Proteome Res 15 (4): 1116–25. https://doi.org/10.1021/acs.jproteome.5b00981.).

Figure 8.6: Mixed imputation method. Black cells represent presence of quantitation values and light grey corresponds to missing data. The two groups of interest are depicted in green and blue along the heatmap columns. Two classes of proteins are annotated on the left: yellow are proteins with randomly occurring missing values (if any) while proteins in brown are candidates for non-random missing value imputation.

Mixed imputation method. Black cells represent presence of quantitation values and light grey corresponds to missing data. The two groups of interest are depicted in green and blue along the heatmap columns. Two classes of proteins are annotated on the left: yellow are proteins with randomly occurring missing values (if any) while proteins in brown are candidates for non-random missing value imputation.

When downstream analyses permit, it might thus be safer not to impute data and deal explicitly with missing values. This is possible when performing hypothesis tests, but not to perform a principal component analysis.

Identification quality control

As discussed in the previous chapter, PSMs are deemed relevant after comparison against hist from a decoy database. The origin of these hits is recorded with + in the Reverse variable:

table(rowData(cptac_se)$Reverse)
## 
##         + 
## 7572   12

Similarly, a proteomics experiment is also searched against a database of contaminants:

table(rowData(cptac_se)$Potential.contaminant)
## 
##         + 
## 7558   26

► Question

► Solution

rowData(cptac_se) %>%
    as_tibble() %>%
    ggplot(aes(x = Score, colour = Reverse)) +
    geom_density()

rowData(cptac_se) %>%
    as_tibble() %>%
    ggplot(aes(x = PEP, colour = Reverse)) +
    geom_density()

Creating the QFeatures data

We can now create our QFeatures object using theSummarizedExperiment as show below.

cptac <- QFeatures(list(peptides = cptac_se))

We should also assign the QFeatures column data with theSummarizedExperiment slot.

colData(cptac) <- colData(cptac_se)

Note that it is also possible to directly create a QFeatures object with the readQFeatures() function and the same arguments as thereadSummarizedExperiment() used above. In addition, most functions used above and below work on single SummarizedExperiment objects or assays within a QFeatures object.

Filtering out contaminants and reverse hits

► Question

Using the filterFeatures() function, filter out the reverse and contaminant hits.

► Solution

cptac <-
    cptac %>%
    filterFeatures(~ Reverse != "+") %>%
    filterFeatures(~ Potential.contaminant != "+")
## 'Reverse' found in 1 out of 1 assay(s)
## 'Potential.contaminant' found in 1 out of 1 assay(s)

Log-transformation and normalisation

The two code chunks below log-transform and normalise using the assayi as input and adding a new one names as defined by name.

cptac <- logTransform(cptac, i = "peptides",
                      name = "log_peptides")
cptac <- normalize(cptac, i = "log_peptides",
                   name = "lognorm_peptides",
                   method = "diff.median")
par(mfrow = c(1, 3))
limma::plotDensities(assay(cptac[["peptides"]]))
limma::plotDensities(assay(cptac[["log_peptides"]]))
limma::plotDensities(assay(cptac[["lognorm_peptides"]]))

Three peptide level assays: raw data, log transformed and normalised.

Figure 8.7: Three peptide level assays: raw data, log transformed and normalised.

Aggregation

Below, we are going to use median aggregation, as a first attempt. This is however not the best choice, as we will see later.

cptac <-
    aggregateFeatures(cptac,
                      "lognorm_peptides",
                      name = "proteins_med",
                      fcol = "Leading.razor.protein",
                      fun = colMedians,
                      na.rm = TRUE)

Looking at the .n row variable computed during the aggregation, we see that most proteins result of the aggregation of 5 peptides or less, while very few proteins are accounted for by tens of peptides.

table(rowData(cptac[["proteins_med"]])$.n)
## 
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
## 328 234 162 138  85  72  66  49  46  34  25  28  21  12  17   9   5   4  13   5 
##  21  22  23  24  25  26  28  29  30  31  32  35  38  39  42  43  51  52  62 
##   7   6   4   4   3   3   1   3   1   1   3   1   2   1   1   1   1   1   1

Principal component analysis

library("factoextra")
library("patchwork")

pca_pep <-
    cptac[["lognorm_peptides"]] %>%
    filterNA() %>%
    assay() %>%
    t() %>%
    prcomp(scale = TRUE, center = TRUE) %>%
    fviz_pca_ind(habillage = cptac$condition, title = "Peptides")

pca_prot <-
    cptac[["proteins_med"]] %>%
    filterNA() %>%
    assay() %>%
    t() %>%
    prcomp(scale = TRUE, center = TRUE) %>%
    fviz_pca_ind(habillage = cptac$condition,
                 title = "Proteins (median aggregation)")

Peptide and protein level PCA analyses.

Figure 8.8: Peptide and protein level PCA analyses.

► Question

Interpret the two PCA plots above.

Visualisation

Below, we use the longFormat() function to extract the quantitative and row data in a long format, that can be directly reused by the tidyverse tools.

longFormat(cptac["P02787ups|TRFE_HUMAN_UPS", ,
                 c("lognorm_peptides", "proteins_med")]) %>%
    as_tibble() %>%
    mutate(condition = ifelse(grepl("A", colname), "A", "B")) %>%
    ggplot(aes(x = colname, y = value, colour = rowname, shape = condition)) +
    geom_point(size = 3) +
    geom_line(aes(group = rowname)) +
    facet_grid(~ assay) +
    ggtitle("P02787ups|TRFE_HUMAN_UPS")

Peptide and protein expression profile.

Figure 8.9: Peptide and protein expression profile.

Statistical analysis

Below, we are going to perform our statistical analysis on the protein data.

prots <- cptac[["proteins_med"]]
colData(prots) <- colData(cptac)

We have seen in chapter 2 how to use linear models and have applied the same prinicples in chapter 4 on count data (based on a negative binomial distribution). The limma package is the precursor package that enables the consistent application of linear models to normally distributed omics data in general, and microarrays in particuar1515 The name of the package refers to Linear Models for Microarray Data..

The limma package also implements an empirical Bayes method that borrows information across features to estimate the standard error1616 alike the shinkage gene-wise dispersion estimates see inDESeq2. and calculate (so called moderate) t statistics. This approach is demonstrably more powerful than a standard t-tests when the number of samples is lot.

The code chunk below illstrated how to set up the model, fit it, and apply the empirical Bayes moderation.

library("limma")
design <- model.matrix(~ prots$condition)
fit <- lmFit(assay(prots), design)
## Warning: Partial NA coefficients for 23 probe(s)

Finally, the topTable() function is used the extract the results for the coefficient of interest. You can look at the column names of the coefficients table to get their names.

colnames(coefficients(fit))
## [1] "(Intercept)"       "prots$condition6B"
res <-
    topTable(fit, coef = "prots$condition6B", number = Inf) %>%
    rownames_to_column("protein") %>%
    as_tibble() %>%
    mutate(TP = grepl("ups", protein))

► Question

Note the warning about partial NA coefficients for 23 probes. Where could these come from?

► Solution

na_coefs <-
    filter(res, is.na(t)) %>%
    pull(protein)
assay(prots[na_coefs, ])
##                               6A_7     6A_8     6A_9     6B_7     6B_8     6B_9
## P00167ups|CYB5_HUMAN_UPS       NaN      NaN      NaN 17.14246 15.90173 16.79995
## P01112ups|RASH_HUMAN_UPS       NaN      NaN      NaN 16.37003      NaN 16.36121
## P05413ups|FABPH_HUMAN_UPS      NaN      NaN      NaN 14.58457      NaN 14.03232
## P08758ups|ANXA5_HUMAN_UPS      NaN      NaN      NaN 15.12913 15.91627 15.83355
## sp|P06704|CDC31_YEAST          NaN      NaN      NaN 16.72331 15.80479 16.33862
## sp|P32608|RTG2_YEAST           NaN      NaN      NaN      NaN 13.48761 15.13571
## sp|P32769|HBS1_YEAST           NaN 16.54689 17.19290      NaN      NaN      NaN
## sp|P34217|PIN4_YEAST           NaN      NaN      NaN 17.08866 17.79839 17.72409
## sp|P34237|CASP_YEAST           NaN      NaN      NaN 16.36200 16.27000 16.23528
## sp|P38166|SFT2_YEAST      16.34269 16.85422      NaN      NaN      NaN      NaN
## sp|P40056|GET2_YEAST           NaN 16.83923 16.52006      NaN      NaN      NaN
## sp|P40533|TED1_YEAST           NaN      NaN      NaN 15.87733      NaN 16.17555
## sp|P46965|SPC1_YEAST           NaN 14.50215 14.28928      NaN      NaN      NaN
## sp|P48363|PFD3_YEAST           NaN      NaN      NaN 17.73603      NaN 17.41425
## sp|P53044|UFD1_YEAST           NaN      NaN      NaN 15.53273 16.41615      NaN
## sp|P53091|MCM6_YEAST           NaN 16.67714 16.23045      NaN      NaN      NaN
##  [ reached getOption("max.print") -- omitted 7 rows ]
vp <-
    res %>%
    ggplot(aes(x = logFC, y = -log10(adj.P.Val))) +
    geom_point(aes(colour = TP)) +
    geom_vline(xintercept = c(-1, 1)) +
    geom_hline(yintercept = -log10(0.05)) +
    scale_color_manual(values = c("black","red"))
library("plotly")
ggplotly(vp)

Using the pipeline described above, we would would identify a single differentially expressed protein at an 5 percent FDR but miss out the other 32 expected spike-in proteins. We can assess our results in terms of true/false postitves/negatives:

As shown below, it is possible to substantially improve these results using robust summarisation, i.e robust regression with M-estimation using Huber weights, as described in section 2.7 in (⊕Sticker et al. 2019Sticker, Adriaan, Ludger Goeminne, Lennart Martens, and Lieven Clement. 2019. “Robust Summarization and Inference in Proteome-Wide Label-Free Quantification.” bioRxiv. https://doi.org/10.1101/668863.).

Figure 8.10: Aggregation using robust summarisation.

Aggregation using robust summarisation.

Project

You are provided with the larger CPTAC data including a third condition (namely C) and a two additional lab, tallying now 27 samples.

LTQ-Orbitrap_86 LTQ-OrbitrapO_65 LTQ-OrbitrapW_56
6A 3 3 3
6B 3 3 3
6C 3 3 3

The full design is shown below and is available here.

TRUE id condition lab previous
6A_1 1 6A LTQ-Orbitrap_86 new
6A_2 2 6A LTQ-Orbitrap_86 new
6A_3 3 6A LTQ-Orbitrap_86 new
6A_4 4 6A LTQ-OrbitrapO_65 new
6A_5 5 6A LTQ-OrbitrapO_65 new
6A_6 6 6A LTQ-OrbitrapO_65 new
6A_7 7 6A LTQ-OrbitrapW_56
6A_8 8 6A LTQ-OrbitrapW_56
6A_9 9 6A LTQ-OrbitrapW_56
6B_1 1 6B LTQ-Orbitrap_86 new
6B_2 2 6B LTQ-Orbitrap_86 new
6B_3 3 6B LTQ-Orbitrap_86 new
6B_4 4 6B LTQ-OrbitrapO_65 new
6B_5 5 6B LTQ-OrbitrapO_65 new
6B_6 6 6B LTQ-OrbitrapO_65 new
6B_7 7 6B LTQ-OrbitrapW_56
6B_8 8 6B LTQ-OrbitrapW_56
6B_9 9 6B LTQ-OrbitrapW_56
6C_1 1 6C LTQ-Orbitrap_86 new
6C_2 2 6C LTQ-Orbitrap_86 new
6C_3 3 6C LTQ-Orbitrap_86 new
6C_4 4 6C LTQ-OrbitrapO_65 new
6C_5 5 6C LTQ-OrbitrapO_65 new
6C_6 6 6C LTQ-OrbitrapO_65 new
6C_7 7 6C LTQ-OrbitrapW_56 new
6C_8 8 6C LTQ-OrbitrapW_56 new
6C_9 9 6C LTQ-OrbitrapW_56 new