Greedily building protein networks with confidence (original) (raw)

Abstract

Motivation: With genome sequences complete for human and model organisms, it is essential to understand how individual genes and proteins are organized into biological networks. Much of the organization is revealed by proteomics experiments that now generate torrents of data. Extracting relevant complexes and pathways from high-throughput proteomics data sets has posed a challenge, however, and new methods to identify and extract networks are essential. We focus on the problem of building pathways starting from known proteins of interest.

Results: We have developed an efficient, greedy algorithm, SEEDY, that extracts biologically relevant biological networks from protein–protein interaction data, building out from selected seed proteins. The algorithm relies on our previous study establishing statistical confidence levels for interactions generated by two-hybrid screens and inferred from mass spectrometric identification of protein complexes. We demonstrate the ability to extract known yeast complexes from high-throughput protein interaction data with a tunable parameter that governs the trade-off between sensitivity and selectivity. DNA damage repair pathways are presented as a detailed example. We highlight the ability to join heterogeneous data sets, in this case protein–protein interactions and genetic interactions, and the appearance of cross-talk between pathways caused by re-use of shared components.

Significance and comparison: The significance of the SEEDY algorithm is that it is fast, running time O[(E + V) log _V_] for V proteins and E interactions, a single adjustable parameter controls the size of the pathways that are generated, and an associated _P_-value indicates the statistical confidence that the pathways are enriched for proteins with a coherent function. Previous approaches have focused on extracting sub-networks by identifying motifs enriched in known biological networks. SEEDY provides the complementary ability to perform a directed search based on proteins of interest.

Availability: SEEDY software (Perl source), data tables and confidence score models (R source) are freely available from the author.

Present Address: Department of Biomedical Engineering, Johns Hopkins University, 3400 N. Charles St, Baltimore, MD 21218, USA

Author notes

CuraGen Corporation, 555 Long Wharf Drive, New Haven, CT 06511, USA