Finding Motif for Subcellular Localization (original) (raw)
Supporting Website for the Paper:
Discriminative Motif Finding for Subcellular Localization based on Profile Hidden Markov Models
Tien-ho Lin, Robert F. Murphy, and Ziv Bar-Joseph
Abstract
Knowing the subcellular location of proteins is important for understanding their functions. Many methods have been described to predict subcellular locaiton from sequence information. However, most of these methods either rely on global sequence properties or use a set of known protein targeting motifs to predict protein localization. Here we develop and test a novel method that identifies potential targeting motifs using a discriminative approach based on Hidden Markov models (discriminative HMMs). These models search for motifs that are present in a compartment but absent in other, nearby, compartments by utilizing an hierarchical structure that mimics the protein sorting mechanism. We show that both discriminative motif finding and the hierarchical structure improves localization prediction on a benchmark dataset of yeast proteins. The motifs identified can be mapped to known targeting motifs and they are more conserved than the average protein sequence. Using our motif-based predictions we can identify what we believe are annotation errors in public databases for the location of some of the proteins.
Prediction of all yeast proteins
- Click here to download prediction of localization of all yeast proteins.
- Sequence of 6,782 proteins are downloaded from SwissProt.
- The file is in plain text format (tab-seperated table) and straightforward to understand. The confidence of all protein-location pairs is provided.
- Prediction method: Based on a dataset of 1,521 proteins with verified localization on SwissProt, we extract discriminative HMM motifs for each location and train a SVM classifier using motifs as features. Both procedure are performed on every split on a protein sorting tree structure. Confidence is estimated using SVM margins. See the paper for details, and comparison to other methods.
Software download
- Click here to download software and dataset (34MB).
- See README in the package for documentation.
Please email Tien-ho Lin for any question.