Open Mass Spectrometry Search Algorithm (original) (raw)
Related papers
Journal of Analytical Chemistry, 2015
High throughput proteomics technologies are gaining popularity in different areas of life sciences. One of the main objectives of proteomics is characterization of the proteins in biological samples using liquid chromatography/mass spectrometry analysis of the corresponding proteolytic peptide mixtures. Both the complexity and the scale of experimental data obtained even from a single experimental run require special ized bioinformatic tools for automated data mining. One of the most important tools is a so called proteomics search engine used for identification of proteins present in a sample by comparing experimental and theoret ical tandem mass spectra. The latter are generated for the proteolytic peptides derived from a protein data base. Peptide identifications obtained with the search engine are then scored according to the probability of a correct peptide spectrum match. The purpose of this work was to perform a comparison of different search algorithms using data acquired for complex protein mixtures, including both annotated protein standards and clinical samples. The comparison was performed for three popular search engines: commercially available Mascot, as well as open source X!Tandem and OMSSA. It was shown that the search engine OMSSA iden tifies in general a smaller number of proteins, while X!Tandem and Mascot deliver similar performance. We found no compelling reasons for using the commercial search engine instead of its open source competitor.
Journal of Proteomics & Bioinformatics, 2010
The availability of different scoring schemes and filter settings of protein database search algorithms has greatly expanded the number of search methods for identifying candidate peptides from MS/MS spectra. We have previously shown that consensus-based methods that combine three search algorithms yield higher sensitivity and specificity compared to the use of a single search engine (individual method). We hypothesized that union of four search engines (Sequest, Mascot, X!Tandem and Phenyx) can further enhance sensitivity and specificity. ROC plots were generated to measure the sensitivity and specificity of 5460 consensus methods derived from the same dataset. We found that Mascot outperformed individual methods for sensitivity and specificity, while Phenyx performed the worst. The union consensus methods generally produced much higher sensitivity, while the intersection consensus methods gave much higher specificity. The union methods from four search algorithms modestly improved sensitivity, but not specificity, compared to union methods that used three search engines. This suggests that a strategy based on specific combination of search algorithms, instead of merely 'as many search engines as possible', may be key strategy for success with peptide identification. Lastly, we provide strategies for optimizing sensitivity or specificity of peptide identification in MS/MS spectra for different userspecific conditions.
Database searching in mass spectrometry based proteomics
Current …, 2012
Bottom-up proteomics (mass spectrometry analysis of peptides obtained by proteolysis and separated by liquid chromatography, (LC-MS/MS)) is one of the most frequently used techniques for identifying and characterizing proteins in biological samples. A key element of the analysis is database searching when the mass spectra of the peptides are compared with a database of theoretically computed (or experimental) peptide spectra. Here we discuss the main computational approaches to spectrum database searching and the statistical analysis of the results.
Analysis of Peptide MS/MS Spectra from Large-Scale Proteomics Experiments Using Spectrum Libraries
Analytical Chemistry, 2006
A widespread proteomics procedure for characterizing a complex mixture of proteins combines tandem mass spectrometry and database search software to yield mass spectra with identified peptide sequences. The same peptides are often detected in multiple experiments, and once they have been identified, the respective spectra can be used for future identifications. We present a method for collecting previously identified tandem mass spectra into a reference library that is used to identify new spectra. Query spectra are compared to references in the library to find the ones that are most similar. A dot product metric is used to measure the degree of similarity. With our largest library, the search of a query set finds 91% of the spectrum identifications and 93.7% of the protein identifications that could be made with a SEQUEST database search. A second experiment demonstrates that queries acquired on an LCQ ion trap mass spectrometer can be identified with a library of references acquired on an LTQ ion trap mass spectrometer. The dot product similarity score provides good separation of correct and incorrect identifications.
A Novel Algorithm for Validating Peptide Identification from a Shotgun Proteomics Search Engine
Journal of Proteome Research, 2013
Liquid chromatography coupled with tandem mass spectrometry has revolutionized the proteomics analysis of complexes, cells, and tissues. In a typical proteomic analysis, the tandem mass spectra from a LC/MS/MS experiment are assigned to a peptide by a search engine that compares the experimental MS/MS peptide data to theoretical peptide sequences in a protein database. The peptide spectra matches are then used to infer a list of identified proteins in the original sample. However, the search engines often fail to distinguish between correct and incorrect peptides assignments. In this study, we designed and implemented a novel algorithm called De-Noise to reduce the number of incorrect peptide matches and maximize the number of correct peptides at a fixed false discovery rate using a minimal number of scoring outputs from the SEQUEST search engine. The novel algorithm uses a three step process: data cleaning, data refining through a SVM-based decision function, and a final data refining step based on proteolytic peptide patterns. Using proteomics data generated on different types of mass spectrometers, we optimized the De-Noise algorithm based on the resolution and mass accuracy of the mass spectrometer employed in the LC/MS/MS experiment. Our results demonstrate De-Noise improves peptide identification compared to other methods used to process the peptide sequence matches assigned by SEQUEST. Because De-Noise uses a limited number of scoring attributes, it can be easily implemented with other search engines.
Mass spectrometry-based shotgun proteomics approaches are currently considered as the technology-of-choice for large-scale proteo-genomics due to high throughput, good availability and relative ease of use. Protein mixtures are firstly digested with protease, e. g. trypsin, and the resultant peptides are analyzed using liquid chromatography-tandem mass spectrometry. Proteins and peptides are identified from the resultant tandem mass spectra by de novo interpretation of the spectra or by searching databases of putative sequences. Since this data represents the expressed proteins in the sample, it can be used to infer novel proteogenomic features when mapped to the genome. However, high-throughput mass spectrometry instruments can readily generate hundreds of thousands, perhaps millions, of spectra and the size of genomic databases, such as six-frame translated genome databases, is enormous. Therefore, computational demands are very high, and there is potential inaccuracy in peptide identification due to the large search space. These issues are considered the main challenges that limit the utilization of this approach. In this review, we highlight the efforts of the proteomics and bioinformatics communities to develop methods, algorithms and software tools that facilitate peptide sequence identification from databases in large-scale proteogenomic studies.
Proteomics, 2007
A notable inefficiency of shotgun proteomics experiments is the repeated rediscovery of the same identifiable peptides by sequence database searching methods, which often are time-consuming and error-prone. A more precise and efficient method, in which previously observed and identified peptide MS/MS spectra are catalogued and condensed into searchable spectral libraries to allow new identifications by spectral matching, is seen as a promising alternative. To that end, an open-source, functionally complete, high-throughput and readily extensible MS/MS spectral searching tool, SpectraST, was developed. A high-quality spectral library was constructed by combining the high-confidence identifications of millions of spectra taken from various data repositories and searched using four sequence search engines. The resulting library consists of over 30 000 spectra for Saccharomyces cerevisiae. Using this library, SpectraST vastly outperforms the sequence search engine SEQUEST in terms of speed and the ability to discriminate good and bad hits. A unique advantage of SpectraST is its full integration into the popular Trans Proteomic Pipeline suite of software, which facilitates user adoption and provides important functionalities such as peptide and protein probability assignment, quantification, and data visualization. This method of spectral library searching is especially suited for targeted proteomics applications, offering superior performance to traditional sequence searching.
Enhancing Peptide Identification Confidence by Combining Search Methods
Journal of Proteome Research, 2008
Confident peptide identification is one of the most important components in mass-spectrometry-based proteomics. We propose a method to properly combine the results from different database search methods to enhance the accuracy of peptide identifications. The database search methods included in our analysis are SEQUEST (v27 rev12), ProbID (v1.0), InsPecT (v20060505), Mascot (v2.1), X! Tandem (v2007.07.01.2), OMSSA (v2.0) and RAId_DbS. Using two data sets, one collected in profile mode and one collected in centroid mode, we tested the search performance of all 21 combinations of two search methods as well as all 35 possible combinations of three search methods. The results obtained from our study suggest that properly combining search methods does improve retrieval accuracy. In addition to performance results, we also describe the theoretical framework which in principle allows one to combine many independent scoring methods including de novo sequencing and spectral library searches. The correlations among different methods are also investigated in terms of common true positives, common false positives, and a global analysis. We find that the average correlation strength, between any pairwise combination of the seven methods studied, is usually smaller than the associated standard error. This indicates only weak correlation may be present among different methods and validates our approach in combining the search results. The usefulness of our approach is further confirmed by showing that the average cumulative number of false positive peptides agrees reasonably well with the combined E-value. The data related to this study are freely available upon request.
Molecular & Cellular Proteomics, 2008
Tandem mass spectrometry-based proteomics is currently in great demand of computational methods that facilitate the elimination of likely false positives in peptide and protein identification. In the last few years, a number of new peptide identification programs have been described, but scores or other significance measures reported by these programs cannot always be directly translated into an easy to interpret error rate measurement such as the false discovery rate. In this work we used generalized lambda distributions to model frequency distributions of database search scores computed by MASCOT, X!TANDEM with k-score plug-in, OMSSA, and InsPecT. From these distributions, we could successfully estimate p values and false discovery rates with high accuracy. From the set of peptide assignments reported by any of these engines, we also defined a generic protein scoring scheme that enabled accurate estimation of protein-level p values by simulation of random score distributions that was also found to yield good estimates of protein-level false discovery rate. The performance of these methods was evaluated by searching four freely available data sets ranging from 40,000 to 285,000 MS/MS spectra.