Chapter 8 Quantitative proteomics data analysis | Omics Data Analysis (original) (raw)
Learning Objectives
The goals of this chapter are to provide a real-life example of step-by-step quantitative proteomics data analysis.
Introduction
Mass spectrometry-based quantitative proteomics data can be representated as a matrix of quantitative values for features (PSMs, peptides, proteins) arranged along the rows, measured for a set of samples, arranged along the columns. The is a common representation for any quantitative data set. As already done inthisandpreviouscourse, we will be using the SummarizedExperiment
class:
Figure 8.1: Schematic representation of the anatomy of a SummarizedExperiment
object. (Figure taken from the SummarizedExperiment
package vignette.)
- The sample (columns) metadata can be access with the
colData()
function. - The features (rows) metadata can be access with the
rowData()
column. - If the features represent ranges along genomic coordinates, these can be accessed with
rowRanges()
- Additional metadata describing the overall experiment can be accessed with
metadata()
. - The quantiative data can be accessed with
assay()
. assays()
returns a list of matrix-like assays.
QFeatures
As we have already discussed in the previous chapter, even though mass spectrometers acquire data for spectra/peptides, the biological entities of interest remain the proteins. As part of the data processing, we are thus required to aggregate low-level quantitative features into higher level data.
Figure 8.2: Conceptual representation of a QFeatures
object and the aggregative relation between different assays.
We are going to start to familiarise ourselves with the QFeatures
class implemented in the QFeaturespackage. Let’s start by loading the tidyverse
and QFeatures
packages.
Figure 8.3: The
QFeatures
package.
library("tidyverse")
library("ggplot2")
library("QFeatures")
library("limma")
Next, we load the feat1
test data, which is composed of single_assay_ of class SummarizedExperiment
composed of 10 rows and 2 columns.
## An instance of class QFeatures containing 1 assays:
## [1] psms: SummarizedExperiment with 10 rows and 2 columns
► Question
Perform the following to familiarise yourselves with the QFeatures
class:
- Extract the sample metadata using the
colData()
accessor (like you have previously done withSummarizedExperiment
objects). - Extract the first (and only) assay composing this
QFeaures
data using the[[
operator (as you have done to extract elements of a list) by using the assay’s index or name. - Extract the
psms
assay’s row data and quantitative values.
► Solution
## DataFrame with 2 rows and 1 column
## Group
## <integer>
## S1 1
## S2 2
## class: SummarizedExperiment
## dim: 10 2
## metadata(0):
## assays(1): ''
## rownames(10): PSM1 PSM2 ... PSM9 PSM10
## rowData names(5): Sequence Protein Var location pval
## colnames(2): S1 S2
## colData names(0):
## class: SummarizedExperiment
## dim: 10 2
## metadata(0):
## assays(1): ''
## rownames(10): PSM1 PSM2 ... PSM9 PSM10
## rowData names(5): Sequence Protein Var location pval
## colnames(2): S1 S2
## colData names(0):
## S1 S2
## PSM1 1 11
## PSM2 2 12
## PSM3 3 13
## PSM4 4 14
## PSM5 5 15
## PSM6 6 16
## PSM7 7 17
## PSM8 8 18
## PSM9 9 19
## PSM10 10 20
## DataFrame with 10 rows and 5 columns
## Sequence Protein Var location pval
## <character> <character> <integer> <character> <numeric>
## PSM1 SYGFNAAR ProtA 1 Mitochondr... 0.084
## PSM2 SYGFNAAR ProtA 2 Mitochondr... 0.077
## PSM3 SYGFNAAR ProtA 3 Mitochondr... 0.063
## PSM4 ELGNDAYK ProtA 4 Mitochondr... 0.073
## PSM5 ELGNDAYK ProtA 5 Mitochondr... 0.012
## PSM6 ELGNDAYK ProtA 6 Mitochondr... 0.011
## PSM7 IAEESNFPFI... ProtB 7 unknown 0.075
## PSM8 IAEESNFPFI... ProtB 8 unknown 0.038
## PSM9 IAEESNFPFI... ProtB 9 unknown 0.028
## PSM10 IAEESNFPFI... ProtB 10 unknown 0.097
Feature aggregation
The central functionality of the QFeatures
infrastructure is the aggregation of features into higher-level features while retaining the link between the different levels. This can be done with theaggregateFeatures()function.
The call below will
- operate on the
psms
assay of thefeat1
objects; - aggregate the rows the assay following the grouping defined in the
peptides
row data variables; - perform aggregation using the
colMeans()
function; - create a new assay named
peptides
and add it to thefeat1
object.
feat1 <- aggregateFeatures(feat1, i = "psms",
fcol = "Sequence",
name = "peptides",
fun = colMeans)
feat1
## An instance of class QFeatures containing 2 assays:
## [1] psms: SummarizedExperiment with 10 rows and 2 columns
## [2] peptides: SummarizedExperiment with 3 rows and 2 columns
► Question
- Convince yourself that you understand the effect of feature aggregation.
- Repeat the calculations manually.
- Observe the content of the new assay’s row data.
► Solution
## SYGFNAAR
colMeans(assay(feat1[[1]])[1:3, ])
## S1 S2
## 2 12
assay(feat1[[2]])["SYGFNAAR", ]
## S1 S2
## 2 12
## ELGNDAYK
colMeans(assay(feat1[[1]])[4:6, ])
## S1 S2
## 5 15
assay(feat1[[2]])["ELGNDAYK", ]
## S1 S2
## 5 15
## IAEESNFPFIK
colMeans(assay(feat1[[1]])[7:10, ])
## S1 S2
## 8.5 18.5
assay(feat1[[2]])["IAEESNFPFIK", ]
## S1 S2
## 8.5 18.5
## DataFrame with 3 rows and 4 columns
## Sequence Protein location .n
## <character> <character> <character> <integer>
## ELGNDAYK ELGNDAYK ProtA Mitochondr... 3
## IAEESNFPFIK IAEESNFPFI... ProtB unknown 4
## SYGFNAAR SYGFNAAR ProtA Mitochondr... 3
► Question
Aggregate the peptide-level data into a new protein-level assay using the colMedians()
aggregation function.
► Solution
feat1 <- aggregateFeatures(feat1, i = "peptides",
fcol = "Protein",
name = "proteins",
fun = colMedians)
feat1
## An instance of class QFeatures containing 3 assays:
## [1] psms: SummarizedExperiment with 10 rows and 2 columns
## [2] peptides: SummarizedExperiment with 3 rows and 2 columns
## [3] proteins: SummarizedExperiment with 2 rows and 2 columns
assay(feat1[["proteins"]])
## S1 S2
## ProtA 3.5 13.5
## ProtB 8.5 18.5
Subsetting and filtering
The link between the assays becomes apparent when we now subset the assays for protein A as shown below or using the subsetByFeature()
function. This creates a new instance of class QFeatures
containing assays with the expression data for protein, its peptides and their PSMs.
## An instance of class QFeatures containing 3 assays:
## [1] psms: SummarizedExperiment with 6 rows and 2 columns
## [2] peptides: SummarizedExperiment with 2 rows and 2 columns
## [3] proteins: SummarizedExperiment with 1 rows and 2 columns
The filterFeatures()
function can be used to filter rows the assays composing a QFeatures
object using the row data variables. We can for example retain rows that have a pval
< 0.05, which would only keep rows in the psms
assay because the pval
is only relevant for that assay.
filterFeatures(feat1, ~ pval < 0.05)
## 'pval' found in 1 out of 3 assay(s)
## No filter applied to the following assay(s) because one or more filtering variables are missing in the rowData: peptides, proteins.
## You can control whether to remove or keep the features using the 'keep' argument (see '?filterFeature').
## An instance of class QFeatures containing 3 assays:
## [1] psms: SummarizedExperiment with 4 rows and 2 columns
## [2] peptides: SummarizedExperiment with 0 rows and 2 columns
## [3] proteins: SummarizedExperiment with 0 rows and 2 columns
On the other hand, if we filter assay rows for those that localise to the mitochondrion, we retain the relevant protein, peptides and PSMs.
filterFeatures(feat1, ~ location == "Mitochondrion")
## 'location' found in 3 out of 3 assay(s)
## An instance of class QFeatures containing 3 assays:
## [1] psms: SummarizedExperiment with 6 rows and 2 columns
## [2] peptides: SummarizedExperiment with 2 rows and 2 columns
## [3] proteins: SummarizedExperiment with 1 rows and 2 columns
► Question
Filter rows that do not localise to the mitochondrion.
► Solution
filterFeatures(feat1, ~ location != "Mitochondrion")
## 'location' found in 3 out of 3 assay(s)
## An instance of class QFeatures containing 3 assays:
## [1] psms: SummarizedExperiment with 4 rows and 2 columns
## [2] peptides: SummarizedExperiment with 1 rows and 2 columns
## [3] proteins: SummarizedExperiment with 1 rows and 2 columns
You can refer to the Quantitative features for mass spectrometry datavignette and the QFeature
manual pagefor more details about the class.
Analysis pipeline
Quantitative proteomics data processing is composed of the following steps:
- Data import
- Exploratory data analysis (PCA)
- Missing data management (filtering and/or imputation)
- Data cleaning
- Transformation and normalisation
- Aggregation
- Downstream analysis
The CPTAC data
The CPTAC spike-in study 6 (⊕Paulovich et al. 2010Paulovich, Amanda G, Dean Billheimer, Amy-Joan L Ham, Lorenzo Vega-Montoto, Paul A Rudnick, David L Tabb, Pei Wang, et al. 2010. “Interlaboratory Study Characterizing a Yeast Performance Standard for Benchmarking LC-MS Platform Performance.” Mol. Cell. Proteomics 9 (2): 242–54.) combines the Sigma UPS1 standard containing 48 different human proteins that are spiked in at 5 different concentrations (conditions A to E) into a constant yeast protein background. The sample were acquired in triplicate on different instruments in different labs. We are going to start with a subset of the CPTAC study 6 containing conditions A and B for a single lab.
Figure 8.4: The CPTAC spike-in study design (credit Lieven Clement, statOmics, Ghent University).
The peptide-level data, as processed by MaxQuant (⊕Cox and Mann 2008Cox, J, and M Mann. 2008. “MaxQuant Enables High Peptide Identification Rates, Individualized p.p.b.-Range Mass Accuracies and Proteome-Wide Protein Quantification.” Nat Biotechnol 26 (12): 1367–72. https://doi.org/10.1038/nbt.1511.) is available in the msdata
package:
basename(f <- msdata::quant(pattern = "cptac", full.names = TRUE))
## [1] "cptac_a_b_peptides.txt"
From the names of the columns, we see that the quantitative columns, starting with "Intensity."
(note the dot!) are at positions 56 to 61.
## [1] "Sequence" "N.term.cleavage.window"
## [3] "C.term.cleavage.window" "Amino.acid.before"
## [5] "First.amino.acid" "Second.amino.acid"
## [7] "Second.last.amino.acid" "Last.amino.acid"
## [9] "Amino.acid.after" "A.Count"
## [11] "R.Count" "N.Count"
## [13] "D.Count" "C.Count"
## [15] "Q.Count" "E.Count"
## [17] "G.Count" "H.Count"
## [19] "I.Count" "L.Count"
## [21] "K.Count" "M.Count"
## [23] "F.Count" "P.Count"
## [25] "S.Count" "T.Count"
## [27] "W.Count" "Y.Count"
## [29] "V.Count" "U.Count"
## [31] "Length" "Missed.cleavages"
## [33] "Mass" "Proteins"
## [35] "Leading.razor.protein" "Start.position"
## [37] "End.position" "Unique..Groups."
## [39] "Unique..Proteins." "Charges"
## [41] "PEP" "Score"
## [43] "Identification.type.6A_7" "Identification.type.6A_8"
## [45] "Identification.type.6A_9" "Identification.type.6B_7"
## [47] "Identification.type.6B_8" "Identification.type.6B_9"
## [49] "Experiment.6A_7" "Experiment.6A_8"
## [51] "Experiment.6A_9" "Experiment.6B_7"
## [53] "Experiment.6B_8" "Experiment.6B_9"
## [55] "Intensity" "Intensity.6A_7"
## [57] "Intensity.6A_8" "Intensity.6A_9"
## [59] "Intensity.6B_7" "Intensity.6B_8"
## [61] "Intensity.6B_9" "Reverse"
## [63] "Potential.contaminant" "id"
## [65] "Protein.group.IDs" "Mod..peptide.IDs"
## [67] "Evidence.IDs" "MS.MS.IDs"
## [69] "Best.MS.MS" "Oxidation..M..site.IDs"
## [71] "MS.MS.Count"
(i <- grep("Intensity\\.", names(read.delim(f))))
## [1] 56 57 58 59 60 61
We now read these data using the readSummarizedExperiment
function. This peptide-level expression data will be imported into R as an instance of class SummarizedExperiment
. We also use thefnames
argument to set the row-names of the peptides
assay to the peptide sequences and specify that the file is a tab-separated table.
cptac_se <- readSummarizedExperiment(f, quantCols = i, fnames = "Sequence", sep = "\t")
cptac_se
## class: SummarizedExperiment
## dim: 11466 6
## metadata(0):
## assays(1): ''
## rownames(11466): AAAAGAGGAGDSGDAVTK AAAALAGGK ... YYTVFDRDNNR
## YYTVFDRDNNRVGFAEAAR
## rowData names(65): Sequence N.term.cleavage.window ...
## Oxidation..M..site.IDs MS.MS.Count
## colnames(6): Intensity.6A_7 Intensity.6A_8 ... Intensity.6B_8
## Intensity.6B_9
## colData names(0):
Before proceeding, we are going to clean up the sample names and annotate the experiment:
colnames(cptac_se) <- sub("I.+\\.", "", colnames(cptac_se))
cptac_se$condition <- sub("_[7-9]", "", colnames(cptac_se))
cptac_se$id <- sub("^.+_", "", colnames(cptac_se))
colData(cptac_se)
## DataFrame with 6 rows and 2 columns
## condition id
## <character> <character>
## 6A_7 6A 7
## 6A_8 6A 8
## 6A_9 6A 9
## 6B_7 6B 7
## 6B_8 6B 8
## 6B_9 6B 9
Let’s also keep only a subset of
keep_var <- c("Sequence", "Proteins", "Leading.razor.protein", "PEP",
"Score", "Reverse", "Potential.contaminant")
rowData(cptac_se) <- rowData(cptac_se)[, keep_var]
Missing values
Missing values can be highly frequent in proteomics. These exist two reasons supporting the existence of missing values, namely biological or technical.
- Values that are missing due to the absence (or extremely low contentration) of a protein are observed for biological reasons, and their pattern aren’t random. A protein missing due to the suppression of its expression will not be missing at random: it will be missing in the condition in which it was suppressed, and be present in the condition where it is expressed.
- Due to it’s data-dependent acquisition, mass spectrometry isn’t capable to assaying all peptides in a sample. Peptides that are less abundant than some of their co-eluting ions, peptides that do not ionise well or peptides that do not get identified might be sporadically missing in the final quantitation table, despite their presence in the biological samples. Their absence patterns arerandom in such cases.
Often, third party software that produce quantiative data use zeros instead of properly reporting missing values. We can use thezeroIsNA()
function to replace the 0
by NA
values in ourcptac_se
object and then explore the missing data patterns across columns and rows.
cptac_se <- zeroIsNA(cptac_se)
nNA(cptac_se)
## $nNA
## DataFrame with 1 row and 2 columns
## nNA pNA
## <integer> <numeric>
## 1 31130 0.452497
##
## $nNArows
## DataFrame with 11466 rows and 3 columns
## name nNA pNA
## <character> <integer> <numeric>
## 1 AAAAGAGGAG... 4 0.666667
## 2 AAAALAGGK 0 0.000000
## 3 AAAALAGGKK 0 0.000000
## 4 AAADALSDLE... 0 0.000000
## 5 AAADALSDLE... 0 0.000000
## ... ... ... ...
## 11462 YYSIYDLGNN... 6 1.000000
## 11463 YYTFNGPNYN... 3 0.500000
## 11464 YYTITEVATR 4 0.666667
## 11465 YYTVFDRDNN... 6 1.000000
## 11466 YYTVFDRDNN... 6 1.000000
##
## $nNAcols
## DataFrame with 6 rows and 3 columns
## name nNA pNA
## <character> <integer> <numeric>
## 1 6A_7 4743 0.413658
## 2 6A_8 5483 0.478196
## 3 6A_9 5320 0.463980
## 4 6B_7 4721 0.411739
## 5 6B_8 5563 0.485174
## 6 6B_9 5300 0.462236
Figure 8.5: Distribution of missing value (white). Peptides row with more missing values are moved towards the top of the figure.
► Question
- Explore the number or proportion of missing values across peptides and samples of the
cptac_se
data. - Remove row that have too many missing values. You can do this by hand or using the
filterNA()
function.
► Solution
barplot(nNA(cptac_se)$nNAcols$pNA)
table(nNA(cptac_se)$nNArows$nNA)
##
## 0 1 2 3 4 5 6
## 4059 990 884 717 934 807 3075
## remove rows that have 4 or more NAs out of 6
cptac_se <- filterNA(cptac_se, pNA = 4/6)
Imputation
Imputation is the technique of replacing missing data with probable values. This can be done with impute()
method. As we have discussed above, there are however two types of missing values in mass spectrometry-based proteomics, namely data missing at random (MAR), and data missing not at random (MNAR). These two types of missing data need to be imputed with different types of imputation methods (⊕Lazar et al. 2016Lazar, C, L Gatto, M Ferro, C Bruley, and T Burger. 2016. “Accounting for the Multiple Natures of Missing Values in Label-Free Quantitative Proteomics Data Sets to Compare Imputation Strategies.” J Proteome Res 15 (4): 1116–25. https://doi.org/10.1021/acs.jproteome.5b00981.).
Figure 8.6: Mixed imputation method. Black cells represent presence of quantitation values and light grey corresponds to missing data. The two groups of interest are depicted in green and blue along the heatmap columns. Two classes of proteins are annotated on the left: yellow are proteins with randomly occurring missing values (if any) while proteins in brown are candidates for non-random missing value imputation.
When downstream analyses permit, it might thus be safer not to impute data and deal explicitly with missing values. This is possible when performing hypothesis tests, but not to perform a principal component analysis.
Identification quality control
As discussed in the previous chapter, PSMs are deemed relevant after comparison against hist from a decoy database. The origin of these hits is recorded with +
in the Reverse
variable:
table(rowData(cptac_se)$Reverse)
##
## +
## 7572 12
Similarly, a proteomics experiment is also searched against a database of contaminants:
table(rowData(cptac_se)$Potential.contaminant)
##
## +
## 7558 26
► Question
- Visualise the score distributions from forward and reverse hits and interpret the figure.
- Do the same with the posterior probability score (PEP).
► Solution
rowData(cptac_se) %>%
as_tibble() %>%
ggplot(aes(x = Score, colour = Reverse)) +
geom_density()
rowData(cptac_se) %>%
as_tibble() %>%
ggplot(aes(x = PEP, colour = Reverse)) +
geom_density()
Creating the QFeatures data
We can now create our QFeatures
object using theSummarizedExperiment
as show below.
cptac <- QFeatures(list(peptides = cptac_se))
We should also assign the QFeatures
column data with theSummarizedExperiment
slot.
colData(cptac) <- colData(cptac_se)
Note that it is also possible to directly create a QFeatures
object with the readQFeatures()
function and the same arguments as thereadSummarizedExperiment()
used above. In addition, most functions used above and below work on single SummarizedExperiment
objects or assays within a QFeatures
object.
Filtering out contaminants and reverse hits
► Question
Using the filterFeatures()
function, filter out the reverse and contaminant hits.
► Solution
cptac <-
cptac %>%
filterFeatures(~ Reverse != "+") %>%
filterFeatures(~ Potential.contaminant != "+")
## 'Reverse' found in 1 out of 1 assay(s)
## 'Potential.contaminant' found in 1 out of 1 assay(s)
Log-transformation and normalisation
The two code chunks below log-transform and normalise using the assayi
as input and adding a new one names as defined by name
.
cptac <- logTransform(cptac, i = "peptides",
name = "log_peptides")
cptac <- normalize(cptac, i = "log_peptides",
name = "lognorm_peptides",
method = "diff.median")
par(mfrow = c(1, 3))
limma::plotDensities(assay(cptac[["peptides"]]))
limma::plotDensities(assay(cptac[["log_peptides"]]))
limma::plotDensities(assay(cptac[["lognorm_peptides"]]))
Figure 8.7: Three peptide level assays: raw data, log transformed and normalised.
Aggregation
Below, we are going to use median aggregation, as a first attempt. This is however not the best choice, as we will see later.
cptac <-
aggregateFeatures(cptac,
"lognorm_peptides",
name = "proteins_med",
fcol = "Leading.razor.protein",
fun = colMedians,
na.rm = TRUE)
Looking at the .n
row variable computed during the aggregation, we see that most proteins result of the aggregation of 5 peptides or less, while very few proteins are accounted for by tens of peptides.
table(rowData(cptac[["proteins_med"]])$.n)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 328 234 162 138 85 72 66 49 46 34 25 28 21 12 17 9 5 4 13 5
## 21 22 23 24 25 26 28 29 30 31 32 35 38 39 42 43 51 52 62
## 7 6 4 4 3 3 1 3 1 1 3 1 2 1 1 1 1 1 1
Principal component analysis
library("factoextra")
library("patchwork")
pca_pep <-
cptac[["lognorm_peptides"]] %>%
filterNA() %>%
assay() %>%
t() %>%
prcomp(scale = TRUE, center = TRUE) %>%
fviz_pca_ind(habillage = cptac$condition, title = "Peptides")
pca_prot <-
cptac[["proteins_med"]] %>%
filterNA() %>%
assay() %>%
t() %>%
prcomp(scale = TRUE, center = TRUE) %>%
fviz_pca_ind(habillage = cptac$condition,
title = "Proteins (median aggregation)")
Figure 8.8: Peptide and protein level PCA analyses.
► Question
Interpret the two PCA plots above.
Visualisation
Below, we use the longFormat()
function to extract the quantitative and row data in a long format, that can be directly reused by the tidyverse tools.
longFormat(cptac["P02787ups|TRFE_HUMAN_UPS", ,
c("lognorm_peptides", "proteins_med")]) %>%
as_tibble() %>%
mutate(condition = ifelse(grepl("A", colname), "A", "B")) %>%
ggplot(aes(x = colname, y = value, colour = rowname, shape = condition)) +
geom_point(size = 3) +
geom_line(aes(group = rowname)) +
facet_grid(~ assay) +
ggtitle("P02787ups|TRFE_HUMAN_UPS")
Figure 8.9: Peptide and protein expression profile.
Statistical analysis
Below, we are going to perform our statistical analysis on the protein data.
prots <- cptac[["proteins_med"]]
colData(prots) <- colData(cptac)
We have seen in chapter 2 how to use linear models and have applied the same prinicples in chapter 4 on count data (based on a negative binomial distribution). The limma package is the precursor package that enables the consistent application of linear models to normally distributed omics data in general, and microarrays in particuar1515 The name of the package refers to Linear Models for Microarray Data..
The limma
package also implements an empirical Bayes method that borrows information across features to estimate the standard error1616 alike the shinkage gene-wise dispersion estimates see inDESeq2
. and calculate (so called moderate) t statistics. This approach is demonstrably more powerful than a standard t-tests when the number of samples is lot.
The code chunk below illstrated how to set up the model, fit it, and apply the empirical Bayes moderation.
library("limma")
design <- model.matrix(~ prots$condition)
fit <- lmFit(assay(prots), design)
## Warning: Partial NA coefficients for 23 probe(s)
Finally, the topTable()
function is used the extract the results for the coefficient of interest. You can look at the column names of the coefficients table to get their names.
colnames(coefficients(fit))
## [1] "(Intercept)" "prots$condition6B"
res <-
topTable(fit, coef = "prots$condition6B", number = Inf) %>%
rownames_to_column("protein") %>%
as_tibble() %>%
mutate(TP = grepl("ups", protein))
► Question
Note the warning about partial NA
coefficients for 23 probes. Where could these come from?
► Solution
na_coefs <-
filter(res, is.na(t)) %>%
pull(protein)
assay(prots[na_coefs, ])
## 6A_7 6A_8 6A_9 6B_7 6B_8 6B_9
## P00167ups|CYB5_HUMAN_UPS NaN NaN NaN 17.14246 15.90173 16.79995
## P01112ups|RASH_HUMAN_UPS NaN NaN NaN 16.37003 NaN 16.36121
## P05413ups|FABPH_HUMAN_UPS NaN NaN NaN 14.58457 NaN 14.03232
## P08758ups|ANXA5_HUMAN_UPS NaN NaN NaN 15.12913 15.91627 15.83355
## sp|P06704|CDC31_YEAST NaN NaN NaN 16.72331 15.80479 16.33862
## sp|P32608|RTG2_YEAST NaN NaN NaN NaN 13.48761 15.13571
## sp|P32769|HBS1_YEAST NaN 16.54689 17.19290 NaN NaN NaN
## sp|P34217|PIN4_YEAST NaN NaN NaN 17.08866 17.79839 17.72409
## sp|P34237|CASP_YEAST NaN NaN NaN 16.36200 16.27000 16.23528
## sp|P38166|SFT2_YEAST 16.34269 16.85422 NaN NaN NaN NaN
## sp|P40056|GET2_YEAST NaN 16.83923 16.52006 NaN NaN NaN
## sp|P40533|TED1_YEAST NaN NaN NaN 15.87733 NaN 16.17555
## sp|P46965|SPC1_YEAST NaN 14.50215 14.28928 NaN NaN NaN
## sp|P48363|PFD3_YEAST NaN NaN NaN 17.73603 NaN 17.41425
## sp|P53044|UFD1_YEAST NaN NaN NaN 15.53273 16.41615 NaN
## sp|P53091|MCM6_YEAST NaN 16.67714 16.23045 NaN NaN NaN
## [ reached getOption("max.print") -- omitted 7 rows ]
vp <-
res %>%
ggplot(aes(x = logFC, y = -log10(adj.P.Val))) +
geom_point(aes(colour = TP)) +
geom_vline(xintercept = c(-1, 1)) +
geom_hline(yintercept = -log10(0.05)) +
scale_color_manual(values = c("black","red"))
library("plotly")
ggplotly(vp)
Using the pipeline described above, we would would identify a single differentially expressed protein at an 5 percent FDR but miss out the other 32 expected spike-in proteins. We can assess our results in terms of true/false postitves/negatives:
- True positives: 1
- False positives: 0
- True negatives: 1342
- False negatives: 32
As shown below, it is possible to substantially improve these results using robust summarisation, i.e robust regression with M-estimation using Huber weights, as described in section 2.7 in (⊕Sticker et al. 2019Sticker, Adriaan, Ludger Goeminne, Lennart Martens, and Lieven Clement. 2019. “Robust Summarization and Inference in Proteome-Wide Label-Free Quantification.” bioRxiv. https://doi.org/10.1101/668863.).
Figure 8.10: Aggregation using robust summarisation.
- True positives: 21
- False positives: 2
- True negatives: 1340
- False negatives: 12
Project
You are provided with the larger CPTAC data including a third condition (namely C) and a two additional lab, tallying now 27 samples.
LTQ-Orbitrap_86 | LTQ-OrbitrapO_65 | LTQ-OrbitrapW_56 | |
---|---|---|---|
6A | 3 | 3 | 3 |
6B | 3 | 3 | 3 |
6C | 3 | 3 | 3 |
The full design is shown below and is available here.
TRUE | id | condition | lab | previous |
---|---|---|---|---|
6A_1 | 1 | 6A | LTQ-Orbitrap_86 | new |
6A_2 | 2 | 6A | LTQ-Orbitrap_86 | new |
6A_3 | 3 | 6A | LTQ-Orbitrap_86 | new |
6A_4 | 4 | 6A | LTQ-OrbitrapO_65 | new |
6A_5 | 5 | 6A | LTQ-OrbitrapO_65 | new |
6A_6 | 6 | 6A | LTQ-OrbitrapO_65 | new |
6A_7 | 7 | 6A | LTQ-OrbitrapW_56 | |
6A_8 | 8 | 6A | LTQ-OrbitrapW_56 | |
6A_9 | 9 | 6A | LTQ-OrbitrapW_56 | |
6B_1 | 1 | 6B | LTQ-Orbitrap_86 | new |
6B_2 | 2 | 6B | LTQ-Orbitrap_86 | new |
6B_3 | 3 | 6B | LTQ-Orbitrap_86 | new |
6B_4 | 4 | 6B | LTQ-OrbitrapO_65 | new |
6B_5 | 5 | 6B | LTQ-OrbitrapO_65 | new |
6B_6 | 6 | 6B | LTQ-OrbitrapO_65 | new |
6B_7 | 7 | 6B | LTQ-OrbitrapW_56 | |
6B_8 | 8 | 6B | LTQ-OrbitrapW_56 | |
6B_9 | 9 | 6B | LTQ-OrbitrapW_56 | |
6C_1 | 1 | 6C | LTQ-Orbitrap_86 | new |
6C_2 | 2 | 6C | LTQ-Orbitrap_86 | new |
6C_3 | 3 | 6C | LTQ-Orbitrap_86 | new |
6C_4 | 4 | 6C | LTQ-OrbitrapO_65 | new |
6C_5 | 5 | 6C | LTQ-OrbitrapO_65 | new |
6C_6 | 6 | 6C | LTQ-OrbitrapO_65 | new |
6C_7 | 7 | 6C | LTQ-OrbitrapW_56 | new |
6C_8 | 8 | 6C | LTQ-OrbitrapW_56 | new |
6C_9 | 9 | 6C | LTQ-OrbitrapW_56 | new |