(original) (raw)
Feature Set Downloading:
�������
� è [new] �����Since the related positive reference sets and feature sources have updated rapidly over the years, just sharing the extracted feature files or partial prediction scores are not good enough anymore. Thus,
Here I share the code and related files to generate our feature set. Download (both summary and detailed !)
The general framework and the codes should be quite useful. You could try to find more recent versions of related evidence sets to make improvement though.
- Here I share the derived feature sets (the 162 detailed version) to save others� time if also interested in this problem.
> Feature Details in the data set
- Details about each feature here are in this link.
Group Index | # of features | Dataset | Attribute Property | Data Position in the set |
---|---|---|---|---|
1 | 20 | Gene Expression | Real value: [-1, 1] | 1-20 |
2 | 21 | GO Molecular Function | {1, 0} | 21- 41 |
3 | 33 | GO Biological Process | {1, 0} | 42 - 74 |
4 | 23 | GO Component | {1, 0} | 75 - 97 |
5 | 1 | Protein Expression | Real Value � Non Negative | 98 |
6 | 1 | Essentiality | {2 , 1, 0} | 99 |
7 | 1 | HMS_PCI Mass * | { 1, 0} | 100 |
8 | 1 | TAP Mass * | { 1, 0} | 101 |
9 | 1 | Y2H | { 1, 0} | 102 |
10 | 1 | Synthetic Lethal | { 1, 0} | 103 |
11 | 1 | GeneNeighborhood / Gene Fusion / Gene Co-occur | { 1, 0} | 104 |
12 | 1 | Sequence Similarity | Real value - Non negative | 105 |
13 | 4 | Homology based PPI | Discrete: Non-negative (Most 0, 1) | 106 � 109 |
14 | 1 | Domain-Domain Interaction | Real value between [0, 1] | 110 |
15 | 16 | Protein-DNA TF group binding | Non-negative discrete, most 0 | 111 � 126 |
16 | 25 | MIPS Protein Class | { 1, 0} | 127 � 151 |
17 | 11 | MIPS Mutant Phenotype | { 1, 0} | 151 - 162 |
* Matrix model for co-complex and co-pathway prediction. Spoke model for direct PPI prediction.
> Shared data sets
- Yeast Protein ORF list
- File format read me�
- For physical Interaction Task in Detailed feature type
- Positive Set PPI list ( from DIP database )
- Positive Set feature set
- Random Negative Set Protein Pairs list (subset size ~230,000)
- Random Negative Set Protein Pairs Feature
- File format read me
- For co-complex Task in Detailed feature type
- Positive Set PPI list ( from MIPS database Complex catalogue )
- Positive Set feature set
- Random Negative Set Protein Pairs list (subset size ~230,000)
- Random Negative Set Protein Pairs Feature
- File format read me
- For co-pathway Task in Detailed feature type
- Positive Set PPI list ( from KEGG database pathway )
- Positive Set feature set
- Random Negative Set Protein Pairs list (subset size ~230,000)
- Random Negative Set Protein Pairs Feature
- File format read me
- Note: The co-pathway relation is an extreme simplified version for the protein-protein pair-wise relationships within a pathway (see paper for the reference work of this task). The main purpose using this task here is to make a comparison to the co-complex and the physical interaction tasks. For the future research, it is quite necessary to investigate the proteins� interactions within pathways in a more detailed fashion.
> Note
�
� �-100� in the feature sets means a missing value in that position!
� Details about the gold standard positive sets shared above, please check �Gold Standard datasets� section in the paper.
� The negative data sets I put here is just a random subset containing ~230,000 yeast protein-protein pairs that are not in the positive PPI set of each specific task.
� In the paper, we assume the size ratio between the positive examples and the negative examples is roughly 1:600 (estimated based on experimental data) in building the train-test sets.
� This ratio is still questionable and need further discussion.
� If you happen to know a better answer other than the above strategy I used, it would be greatly appreciated if you could contact me.
�
� If you notice any mistakes in the data, please contact me as soon as possible. Thanks ahead !
� FAQ page