MSblender - Marcotte Lab (original) (raw)

MSblender is a statistical tool for merging database search results from multiple database search engines for peptide identification based on a multivariate modeling approach. We presented this work at RECOMB-CP 2011 on March, 2011, and published in the Journal of Proteomics Research in April, 2011 (see MSblender#Citation for details).

Prerequisites

(We have tested our code under Mac OSX (10.5 Leopard) and Ubuntu Linux (10.04 and later). We don't support MS Windows platforms yet.) To run MSblender, you should install the following programs/packages on the machine.

Also, you need to have search engine results to run MSblender. All searches should be conducted with same concatenated database (target + decoy). The current script recognizes 'xf_' prefix at protein ID as a decoy sequence, but you can easily modify this at 'make-msblender_in.py'.

Installation

How to use

MSblender works in three stages: pre-processing, modelling and post-processing.

Pre-processing

First MSblender converts the set of peptide mass spectral search engine results into a unified tab-delimited text file called the 'hit_list' format. Then it transforms this 'hit_list' into an MSblender modelling program input file. You can see a 'test' dataset and its output at http://www.marcottelab.org/users/MSblender/test/.

Currently, MSblender supports the following search engine results (and scores).

For example, you can convert an X!Tandem pepxml file to logE_hit_score as shown below:

$ ../src/MSblender-20110130/pre/tandem_pepxml-to-logE_hit_list.py test.tandem_k.pepxml Write test.tandem_k.logE_hit_list ...

The hit_list file generated by this program looks like this:

pepxml: test.tandem_k.pepxml

#Spectrum_id Charge PrecursorMz MassDiff Peptide Protein MissedCleavages Score(-log10[E-value]) MSups_5ul.07228.07228.4 4 689.596425 0.004000 SLLSNVEGDNAVPMQHNNRPTQPLK CAH1_HUMAN_UPS|P00915|5000|50000|260 0 1.795880 MSups_5ul.11647.11647.2 2 592.839650 0.000000 ADGLAVIGVLMK CAH1_HUMAN_UPS|P00915|5000|50000|260 0 1.148742 MSups_5ul.06405.06405.2 2 524.279350 0.003000 DLFNAIATGK CATA_HUMAN_UPS|P04040|5000|5000|526 0 0.327902 ....

Some search engines report multiple PSMs for a single spectrum (mainly because of different charge state estimations). For example, in the default setting, MyriMatch reports all best hits for both +2 and +3 charge states, so the total number of PSMs is almost twice that of other search engine results. To correct for these effects, you can choose 'the best' PSM per each spectrum based on the score you defined. And 'select-best-PSM.py' is the script for that.

$ wc test.myrimatch.mvh_hit_list 10888 87099 1168772 test.myrimatch.mvh_hit_list $ ../src/MSblender-20110130/pre/select-best-PSM.py test.myrimatch.mvh_hit_list $ wc test.myrimatch.mvh_hit_list_best 5516 44123 598964 test.myrimatch.mvh_hit_list_best

Then, you can compile multiple 'hit_list' files into msblender input file. You need to have a text conf file as below:

InsPect test.inspect.MQscore_hit_list_best MyriMatch test.myrimatch.mvh_hit_list_best SEQUEST test.sequest.xcorr_hit_list_best X!Tandem test.tandem_k.logE_hit_list_best

Then, run 'make-msblender_in.py' script. It will make .msblender_in file and .prot_list.

$ ../src/MSblender-20110130/pre/make-msblender_in.py msblender.conf

'msblender.msblender_in' looks like this:

sp_pep_id decoy InsPect_score MyriMatch_score SEQUEST_score X!Tandem_score MSups_5ul.00439.00439.3.ASLSNTPSIGQ 0 0.031000 NA NA NA MSups_5ul.00439.00439.3.LDELRDEGK 0 NA 18.090108 0.914975 -0.832509 MSups_5ul.00444.00444.1.GQFVK 1 NA 2.598828 NA NA MSups_5ul.00446.00446.3.LDELRDEGK 0 NA 13.341218 0.930569 -0.579784 MSups_5ul.00461.00461.3.ADDKETCFAEEGKK 0 NA 16.846330 1.834260 -0.770852 ...

And 'msblender.prot_list' looks like this:

MSups_5ul.00439.3.ASLSNTPSIGQ XXX.CATG_HUMAN_UPS|P08311|5000|5 MSups_5ul.00439.3.LDELRDEGK ALBU_HUMAN_UPS|P02768|5000|50000|584 MSups_5ul.00444.1.GQFVK xf_KCRM_HUMAN_UPS|P06732|5000|500 MSups_5ul.00446.3.LDELRDEGK ALBU_HUMAN_UPS|P02768|5000|50000|584 MSups_5ul.00461.3.ADDKETCFAEEGKK ALBU_HUMAN_UPS|P02768|5000|50000|584

Multivariate Modeling

Feed 'msbledner_in' file to the 'msblender' executive file under 'src/' directory as below:

$ ~/git/MSblender/src/msblender msblender.msblender_in .... 29 93483.809008 0.5133 TRUE 0.514 3.328 28.868 2.830 0.898

1.795 20.493 0.985 1.516 20.493 502.115 24.678 37.461 0.985 24.678 1.670 2.143 1.516 37.461 2.143 3.484

0.486 4.182 51.858 3.870 2.507

1.630 14.148 0.815 1.226 14.148 380.431 17.345 26.407 0.815 17.345 1.302 1.646 1.226 26.407 1.646 2.589

30 93484.230685 0.5133 TRUE 0.486 4.182 51.858 3.870 2.507

1.630 14.148 0.815 1.226 14.148 380.431 17.345 26.407 0.815 17.345 1.302 1.646 1.226 26.407 1.646 2.589

0.514 3.328 28.868 2.830 0.898

1.795 20.493 0.985 1.516 20.493 502.115 24.678 37.461 0.985 24.678 1.670 2.143 1.516 37.461 2.143 3.484 $

The program will terminate when it converges. If the number of iterations reaches your initial setting (here, 100), run the script again allowing for more iterations.

Now you should see an output file named 'test.msblender_in.msblender_out' in the same directory. The file looks like this:

Spectrum Decoy InsPect_score MyriMatch_score SEQUEST_score X!Tandem_score mvScore MSups_5ul.00439.00439.3.ASLSNTPSIGQ F 0.03 0.006 MSups_5ul.00439.00439.3.LDELRDEGK F 18.09 0.91 -0.83 1.000 MSups_5ul.00444.00444.1.GQFVK D 2.60 0.085 MSups_5ul.00446.00446.3.LDELRDEGK F 13.34 0.93 -0.58 1.000 MSups_5ul.00461.00461.3.ADDKETCFAEEGKK F 16.85 1.83 -0.77 1.000 MSups_5ul.00590.00590.2.AAFTECCQAADK F 4.80 34.62 3.17 1.39 1.000 ...

Post-processing

'make-spcount.py' script at 'post/' directory can estimate numbers of spectral counts (and peptides) at a given FDR cutoff, based on the mvscore model.

$ ~/git/MSblender/post/make-spcount.py msblender.msblender_in.msblender_out msblender.prot_list 0.01 Read msblender.prot_list ... Done Read msblender.msblender_in.msblender_out ... Done $

'.spcount_FDR0010', a output with spectral counts per protein, looks like this:

#ProtID TotalCount MSups_5ul ALBU_HUMAN_UPS|P02768|5000|50000|584 1300.00 1300.00 ANT3_HUMAN_UPS|P01008|5000|50|432 4.00 4.00 ANXA5_HUMAN_UPS|P08758|5000|0.5|319 1.00 1.00 CAH1_HUMAN_UPS|P00915|5000|50000|260 556.50 556.50 CAH2_HUMAN_UPS|P00918|5000|50000|259 391.00 391.00

And, '.pep_list_FDR0010' file looks like this:

ALBU_HUMAN_UPS|P02768|5000|50000|584 MSups_5ul ETYGEMADCCAKQEPERNECFLQHK 4.00 ALBU_HUMAN_UPS|P02768|5000|50000|584 MSups_5ul CCAAADPHECYAK 2.00 ALBU_HUMAN_UPS|P02768|5000|50000|584 MSups_5ul YLYEIARR 2.00 ALBU_HUMAN_UPS|P02768|5000|50000|584 MSups_5ul SLHTLFGDKLCTVATLR 90.00

Citation

See also