Document Detail

Genome-wide inference of protein interaction sites: lessons from the yeast high-quality negative protein-protein interaction dataset.
Jump to Full Text
MedLine Citation:
PMID:  18281313     Owner:  NLM     Status:  MEDLINE    
Abstract/OtherAbstract:
High-throughput studies of protein interactions may have produced, experimentally and computationally, the most comprehensive protein-protein interaction datasets in the completely sequenced genomes. It provides us an opportunity on a proteome scale, to discover the underlying protein interaction patterns. Here, we propose an approach to discovering motif pairs at interaction sites (often 3-8 residues) that are essential for understanding protein functions and helpful for the rational design of protein engineering and folding experiments. A gold standard positive (interacting) dataset and a gold standard negative (non-interacting) dataset were mined to infer the interacting motif pairs that are significantly overrepresented in the positive dataset compared to the negative dataset. Four negative datasets assembled by different strategies were evaluated and the one with the best performance was used as the gold standard negatives for further analysis. Meanwhile, to assess the efficiency of our method in detecting potential interacting motif pairs, other approaches developed previously were compared, and we found that our method achieved the highest prediction accuracy. In addition, many uncharacterized motif pairs of interest were found to be functional with experimental evidence in other species. This investigation demonstrates the important effects of a high-quality negative dataset on the performance of such statistical inference.
Authors:
Jie Guo; Xiaomei Wu; Da-Yong Zhang; Kui Lin
Related Documents :
17335273 - Identification of quantum dot bioconjugates and cellular protein co-localization by hyb...
24126143 - Quality measures of imaging mass spectrometry aids in revealing long-term striatal prot...
12493773 - Global profiling of the cell surface proteome of cancer cells uncovers an abundance of ...
Publication Detail:
Type:  Evaluation Studies; Journal Article; Research Support, Non-U.S. Gov't     Date:  2008-02-14
Journal Detail:
Title:  Nucleic acids research     Volume:  36     ISSN:  1362-4962     ISO Abbreviation:  Nucleic Acids Res.     Publication Date:  2008 Apr 
Date Detail:
Created Date:  2008-04-04     Completed Date:  2008-05-08     Revised Date:  2009-11-18    
Medline Journal Info:
Nlm Unique ID:  0411011     Medline TA:  Nucleic Acids Res     Country:  England    
Other Details:
Languages:  eng     Pagination:  2002-11     Citation Subset:  IM    
Affiliation:
MOE Key Laboratory for Biodiversity Science and Ecological Engineering and College of Life Sciences, Beijing Normal University, Beijing 100875, China.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Descriptor/Qualifier:
Data Interpretation, Statistical
Databases, Protein
Genomics*
Protein Interaction Domains and Motifs*
Protein Interaction Mapping* / standards
Reference Standards
Yeasts / genetics*
Comments/Corrections

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Full Text
Journal Information
Journal ID (nlm-ta): Nucleic Acids Res
Journal ID (publisher-id): nar
Journal ID (hwp): nar
ISSN: 0305-1048
ISSN: 1362-4962
Publisher: Oxford University Press
Article Information
Download PDF
? 2008 The Author(s)
creative-commons: This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Received Day: 10 Month: 9 Year: 2007
Revision Received Day: 17 Month: 12 Year: 2007
Accepted Day: 10 Month: 1 Year: 2008
collection publication date: Month: 4 Year: 2008
Print publication date: Month: 4 Year: 2008
Electronic publication date: Month: 4 Year: 2008
Volume: 36 Issue: 6
First Page: 2002 Last Page: 2011
ID: 2346601
DOI: 10.1093/nar/gkn016
Publisher Id: gkn016
PubMed Id: 18281313

Genome-wide inference of protein interaction sites: lessons from the yeast high-quality negative protein?protein interaction dataset
Jie Guo
Xiaomei Wu
Da-Yong Zhang
Kui Lin*
MOE Key Laboratory for Biodiversity Science and Ecological Engineering and College of Life Sciences, Beijing Normal University, Beijing 100875, China
Correspondence: *To whom correspondence should be addressed. +86 10 58805045+86 10 58807721linkui@bnu.edu.cn

INTRODUCTION

With the advent of high-throughput technologies such as yeast two-hybrid assays (1?5), and the development of various computational methods, either by integrating the vast amount of biological information contained in the genomic datasets (6,7) or by mining from an existing knowledgebase (8,9), rich data resources of interacting proteins have been produced and stored in publicly accessible databases (10?13). Constructing a map of protein?protein interactions is essential not only from a theoretical stance of studying cellular behavior and the machinery of a proteome, but also in the light of potential practical applications such as new drug design (14,15). By intensive analysis and comparison of protein-interaction networks, many studies have emerged to investigate the large-scale biological properties buried in the networks from functional and evolutionary aspects (16), for instance, protein function annotation (17) and interaction interface identification (18). To date, a variety of statistical data analysis techniques have been applied to address these issues, the capability of which depends largely on the accuracy of the protein-interaction dataset (positives), and equally importantly, the non-interaction dataset (negatives).

Currently, high-quality positive datasets have been assembled by combining multiple interaction datasets or integrating additional genomic evidence (19,20). However, the data collected by those methods are far from complete compared with the vast number of possible interactions (21). What makes things more complicated is how to define and assemble a high-quality negative dataset for a statistical analysis system. Negative datasets obviously have a strong effect on the performance of comparative statistical analyses, especially in machine-learning algorithms. The problems induced by lacking negatives cannot be addressed by fine-tuning parameters or finding better statistical methods (22). Currently, two main strategies employed in literatures for assembling negative examples are selection of protein pairs from separate cellular compartments (22) and random selection of protein pairs (23?25). Either of the two strategies has its own limitation. Two proteins localizing to different cellular components could interact with each other (e.g. in the nucleus and cytoplasm, respectively). The negative examples selected by random scheme can be often contaminated with positive ones because of the incomplete protein-interaction network.

To date, protein?protein interaction data do not provide explicit information about the specific regions of the proteins involved in binding or docking. These specific regions, in general only a subset of residues or very short and specific sequence segments (often 3?8 residues) within both interacting proteins, are critical for the highly specific recognition at the contact interface (referred to as the interaction or binding sites) (26?28). Such binding sites are implicated in many fundamental biological processes, including phosphorylation, modification and disease pathways, especially in signaling networks (29?31). Therefore, accurate identification of such interaction sites is essential to understand protein function, and helpful to design and rationalize protein engineering, folding experiments (32?34). Many highly efficient computational methods have been developed to assist the discovery of potential binding sites, especially through mining those protein-interaction datasets produced by high-throughput techniques on a genome-wide scale. In the past few years, most efforts for the prediction of interaction-site pairs were concentrated on finding interaction correlations between domain pairs by statistical analyses (35?43). Nonetheless, it is well known that the actual interaction sites directly responsible for protein binding are probably smaller than the whole domains, and are just subregions of the interacting domains. Recently, several studies have used protein?protein interactions in conjunction with prior biological knowledge to yield a set of putative interacting motif pairs. Li and Li used protein?protein interactions and protein complexes derived from Protein Data Bank (PDB) to identify stable and significant binding motif pairs that have unexpected frequency compared to random in protein-interaction datasets (44). Later, Li et al. mined all-versus-all interaction subnetworks to discover motif pairs at interaction sites on a proteome-wide scale (45). Tan et al. proposed a novel algorithm, D-MOTIF, to infer correlated motifs from interaction data (46). Yu et al. applied the AdaBoost algorithm to predict motif pairs from known interactions and putative non-interacting protein pairs (47). Wang et al. proposed a modified model inspired by Deng et al. (36) and Riley et al. (37) to predict interacting motif/domain pairs, and in particular, the specific binding regions involved in a certain protein interaction (43).

In this study, we focused on identifying motif pairs at interaction sites expected to mediate protein?protein interactions by mining both gold standard positives (GSPs) and gold standard negatives (GSNs) in yeast. Because protein-interaction sites are more conserved than the rest of the protein surface (48), we used short linear peptide motifs to represent the interaction sites (often 3?8 residues) where protein interactions take place. The linear motifs conform to particular sequence patterns indicative of a particular function. Currently, there are several motif databases such as the Eukaryotic Linear Motif (ELM) database (49), PROSITE (50), ScanSite (51) and Minimotif Miner (MnM) (52). Of these, MnM is a newly published motif database with a broad functional spectrum, and its contents were complied from searching the literature or exploring other public databases including PROSITE, ELM and Peptide Cutter. All motifs in MnM have been published and validated with experimental evidence. Because of its high quality, the motifs in MnM were used to annotate the yeast proteins in our study.

The GSP dataset was generated by measuring the relationship strength (including the functional association or the localization proximity) between two different proteins using a relative specificity similarity method. This was achieved by exploring the information buried in the Gene Ontology (GO) and GO annotations in our previous study (8,9). The reconstructed yeast protein?protein interaction map was proved to have a high confidence level when validated using the widely used evaluation dataset compiled from MIPS (53). Four negative datasets were generated by different methods, including a dataset of randomly selected protein pairs, a dataset of protein pairs with different cellular sublocalizations, and two datasets generated with different confidence levels based on the RSS method designed in (9). Furthermore, the quality of the four negative datasets was evaluated and compared. Of these, the one with the best performance was considered as the GSN dataset. To identify putative interacting motif pairs that are statistically overrepresented in their occurrence in the GSPs compared to the GSNs, two distinct statistical tests, the exact binomial test and Fisher's exact test, were integrated. The performance of the predicted results was validated by mapping the inferred motif pairs to three widely used datasets including iPfam (54), DOMINO (55) and the Yeast Core subset in DIP (56). Moreover, we also compared our method with the previously developed methods, and found our method outperformed the others in terms of prediction precision and converge. These results demonstrate that, by incorporating a high-quality negative dataset, our method presents good capability in identifying the interacting motif pairs mediating protein?protein interactions.


MATERIALS AND METHODS
Motif assignments

The motif definitions were drawn from the MnM database. The MnM motif database (the release of Jun 13, 2007) compiles 611 distinct motifs involved in a broad range, such as posttranslational modifications; binding to proteins, nucleic acids or small molecules; protein trafficking; and so on. Information on the subcellular localizations of a motif is also provided, and was utilized as a criterion to filter the false positive motif assignments in this study. We simply specified that if a motif and a protein localize to different subcellular compartments, the motif assignments to the protein be abandoned. We note that the proteins without motif assignments were also discarded. The filtering process is described as follows. First, both the proteins observed in the GSPs and GSNs and the motifs in MnM were annotated with one or more GO cellular compartment (CC) terms. Only if there was a path between one CC term of a protein and one CC term of a motif, was the motif assigned to the protein.

Positive- and negative-interaction datasets

In our previous work (9), we reconstructed a map of potential protein?protein interactions by fully exploring the information contained in the Biological Process (BP) and CC annotations of GO for the yeast genome. The premises of our method were: (a) interacting proteins often function in the same biological process and (b) interacting proteins should exist in close proximity. This was achieved by comparing the relative specificity similarity (RSS) of pairs of GO terms assigned to the two proteins within a GO DAG. The RSS is a new metric of semantic similarity used to score the degree of the functional association or localization proximity between two different proteins. The RSS values for CC and BP ontologies are denoted as RSSCC and RSSBP. We created a GSP dataset using protein pairs with values of RSSCC >0.80 and RSSBP >0.80 based on a new release of GO (the March 2006 release) and the GO annotations derived from SGD (submitted on March 31, 2006), which is now stored in the SPIDer database (8). To improve the quality of the GSP dataset, here we used the more stringent criterion of RSSCC >0.85 and RSSBP >0.85 (referred to as WGSPs). After motif assignments using a cellular compartment filter (as described earlier), the WGSP dataset consisted of 46 031 interacting protein pairs encompassing 2678 proteins. To assess how likely a protein pair in the WGSP dataset was to physically interact with each other, we created a high-quality validation dataset, called ?valid experimental interactions? (VEIs). VEIs combine the binary interactions from the MIPS complexes, the MIPS small-scale physical interactions, and the integrated interactions from de Lichtenberg et al. (57). There were 12 345 unique binary interactions among 1905 proteins in the VEIs. The MIPS complexes and the MIPS physical interactions are often used as or as part of the ?gold standard positives? to validate various prediction methods (19,58,59) and are also used to assess high-throughput interaction datasets (60,61). As a result, the WGSP dataset covered about 81% of the VEIs, proving that WGSPs had a high-confidence level. Thereafter, we simply used GSPs to refer to this new GSP dataset (WGSPs).

Four negative datasets assembled by different strategies were constructed in this study. (i) RGSNs: random pairs of proteins that are not known to interact. (ii) SGSNs: as described in (19), the protein pairs in SGSNs were selected from lists of proteins in separate subcellular compartments (cytoplasm, nucleus, mitochondrion and exocytic network) (62) according to the yeast localization data in GO (the details of the construction of SGSNs are available in the Supplementary Materials). According to the distribution of RSS values in the CC and BP ontologies, the RSSCC and RSSBP values were roughly divided into three confidence levels, high (H), medium (M) and low (L) confidence (see Supplementary Materials Figures S1 and S2). Then the other two negative datasets, W1GSNs and W2GSNs, were created based on different combinations of RSSCC and RSSBP. (iii) W1GSNs: protein pairs that have both RSSCC and RSSBP values with low confidence levels, namely the ones localizing in different cellular components and involved in weakly related or unrelated biological processes. (iv) W2GSNs: protein pairs that have RSSBP values with low confidence level and RSSCC values with median or low confidence level. In contrast to W1GSNs, W2GSNs had a larger size by including protein pairs localizing in relatively close cellular components (RSSCC value with median confidence) but involved in weakly related or unrelated biological processes (RSSBP value with low confidence level). Because the number of randomly selected protein pairs is very large, the size of RGSNs was simply chosen to be equal to that of W2GSNs. After motif assignments using the cellular compartment filter, W1GSNs, W2GSNs, RGSNs and SGSNs remained 66 183, 596 669, 645 009 and 3 815 110 protein pairs, respectively. For fair comparison, the four negative datasets were created from the same protein set that comprised 3654 proteins.

Statistical analysis

To measure the overrepresentation of the occurrence of motif pairs in positives compared to negatives, two distinct statistical models for counting the occurrence of motif pairs were adopted. Furthermore, the problem of multiple testing was taken into account in the process of statistical analysis.

One-tailed exact binomial test

The exact binomial test uses the binomial distribution model to compare the rate of the observed occurrence of a motif pair to the expected rate. The motif pairs both significantly overrepresented in the GSPs and significantly underrepresented in the GSNs were determined to be putative interacting motif pairs. Thus, using the R statistics package, for a given motif pair two P-values were calculated, one corresponding to the statistical significance in the GSPs and the other in the GSNs. Three basic parameters are required for the exact binomial test: the number of successes, the number of trials and the hypothesized probability of success. For a motif pair Mij in protein pair dataset I of size N encompassing n proteins, the three parameters respectively correspond to Xij (the observed number of protein pairs containing Mij, where one protein contains the motif i and its partner contains the other motif j), N (the size of the protein pair dataset I) and Efij (the expected frequency of protein pairs containing Mij). Efij was calculated as , where is the size of the universe of protein pairs collected from the n proteins (homo-pairs were excluded) and Sij is the number of all the protein pairs containing Mij in the universe. The exact binomial test is performed to evaluate significant differences in the rate of the occurrence of motif pairs, and thus is particularly good at detecting increased prevalence of common motif pairs.

One-tailed Fisher's exact test

In contrast to the exact binomial test, the Fisher's exact test uses a hyper-geometric distribution model to compare the proportion of protein pairs containing a motif pair in the GSPs to that in the GSNs, and therefore is good at detecting rare motif pairs that occur less frequently in interacting protein pairs. For the Fisher's exact test, a 2 ? 2 contingency table of frequency is created for each motif pair, in which the two rows represent the GSPs and GSNs, respectively, and the two columns represent the numbers of protein pairs containing the given motif pair and the ones not containing the motif pair, respectively. Using the R statistics package, each motif pair is assigned with a P-value.

Multiple testing problem

The q-value method proposed by Storey (63,64) was employed to control the false discovery rate (FDR). The q-value measures the expected proportion of false positives incurred when a test is called significant. Similar to a P-value, the q-value can be considered a measure of statistical significance. We used QVALUE software, which takes a list of P-values resulting from the simultaneous tests as input and estimates their q-values (63). The q-value can be calculated for each test and ranked in ascending order. In practice, a cutoff for null hypothesis rejection was set to 0.05 to ensure a 5% FDR.

Validation datasets

Currently, comprehensive interacting motif pair data do not yet exist and are difficult to assemble. Fortunately, there are several high-quality databases of interaction sites, such as iPfam, DOMINO and the Saccharomyces cerevisiae core subset in DIP (Yeast Core). Here, we defined a pair of sequence segments with exact start and end positions to represent an interaction-site pair. iPfam is a popular database of domain?domain interactions derived from the protein complexes in PDB (54). It contains 3020 domain?domain interactions (version 20). DOMINO is a database of domain?peptide interactions storing more than 3900 annotated interactions with experimentally verified evidence (55), from which only the segment pairs with both exactly annotated start and end positions were used in this study. In addition, a high-confidence Yeast Core dataset of protein interactions in DIP generated by merging several high-quality subsets from experimental and computational validation (56) was used. The sequences of proteins composing the interacting protein pairs could be regarded as the maximal potential interaction regions. The core dataset (the release of 7 January 2007) contains 17 420 protein?protein interactions encompassing 4909 proteins. Note that for iPfam and DOMINO, only the segment pairs in S. cerevisiae were chosen.

We defined that a motif can be mapped to a sequence segment if one instance of the motif is nested by the segment. Then we defined that a motif pair can be mapped to a segment pair if both members of the motif pair can be mapped to those of the segment pair. Finally, after motif assignment using the cellular compartment filter, the respective numbers of segment pairs for the iPfam, DOMINO and Yeast Core datasets were 351, 392 and 12 680, respectively.

The validation of the inferred motif pairs was performed by estimating their positive predictive values (PPVs) and sensitivities (SNs). The PPV was calculated as TP/(TP + FP), where true positives (TP) and false positives (FP) were estimated with respect to each validation dataset. As negative datasets of motif pairs do not yet exist, we simply defined PPV as the proportion of the inferred motif pairs overlapping with each validation dataset. The SN, calculated as TP/P (P being the size of the validation dataset), was simply defined as the proportion of the segment pairs in each validation dataset overlapping with the inferred motif pairs.

Randomizing simulation

Obviously, a good prediction system should contain more inferred motif pairs mapped to the validation datasets than expected at random. For evaluating the enrichment of our inferred motif pairs in the validation datasets, we attached a measure of statistical significance to the overlaps. As the distribution of the overlaps is unknown, we estimated the significance by randomizing the simulation process. To do so, for each validation dataset, we randomly generated 1000 datasets of segment pairs collected from the segments composing the validation dataset. The size of each randomly generated dataset is the same as that of the validation dataset. Both the PPVs and SNs of the inferred motif pairs with the validation datasets were assigned empirical P-values. The empirical P-value was calculated as the proportion of the simulated datasets with an equal or larger PPV (or SN) than the observed one.


RESULTS
Assessment of four negative datasets

To evaluate the effect of the four negative datasets (RGSNs, SGSNs, W1GSNs and W2GSNs) on identifying interacting motif pairs, we compared their respective inferred motif pairs with interaction-site pairs derived from the three reference databases, iPfam, DOMINO and the Yeast Core in DIP. We used the exact binomial test to predict the putative interacting motif pairs mining from each negative dataset. It should be noted that as the interaction sites in Yeast Core were roughly defined as the whole protein sequences, iPfam and DOMINO have more accurate definitions of interaction sites than Yeast Core. Therefore, the evaluation of the different negative datasets (and the assessments thereafter) depended mainly on the validation results derived from iPfam and DOMINO, while the validation results from DIP can be considered as auxiliary evidence.

The respective numbers of the inferred motif pairs mining from RGSNs, SGSNs, W1GSNs and W2GSNs were 38, 4593, 3684 and 1762. Tables 1 and 2 list validation result statistics of the four negative datasets. Surprisingly, only a small number of motif pairs were predicted by RGSNs, much fewer than from the other negative datasets. Although the PPVs of RGSNs were highest (Table 1), the SNs were much lower than those of the other datasets (Table 2). Moreover, its SN with DOMINO and PPVs with iPfam and Yeast Core were not significant. The reason may be that, because of a lack of biological significance, compared with other methods the random selection method would be more likely to choose positive examples or protein pairs with similar attributes as positives (e.g. with close proximity or related biological process).

Figure 1 shows a comparison of the PPVs and SNs (defined in ?Materials and Methods? section) of SGSNs, W1GSNs and W2GSNs. We observed that W1GSNs generally outperformed SGSNs and W2GSNs both in terms of PPVs and SNs, and SGSNs came second. The superior performance of W1GSNs to SGSNs is due to W1GSNs? stricter generation criteria (involved in both different biological processes and different cellular compartments) than for SGSNs (only considering different cellular compartments). Compared with W1GSNs (or SGSNs), the lower performance of W2GSNs may be due to the inclusion of the protein pairs with median-confidence RSSCC, implying the main effect of the localization proximity on the capability of negative datasets in identifying interacting motif pairs. Comparison of the four negative datasets based on the motif pairs inferred by the Fisher's exact test was also performed. We obtained the similar results that generally W1GSNs performed the best (see Supplementary Materials Tables S2a and S2b, Figure S3). Finally, considering that W1GSNs were generated using the most stringent criteria and produced in the same system as WGSPs, we chose W1GSNs as the GSNs for predicting interacting motif pairs. Thereafter, we simply used GSNs to refer to W1GSNs.

In addition, we found that for the four negative datasets, the PPV values of Yeast Core were not significant. A plausible explanation may be that as the definition of interaction sites of Yeast Core is rather general (the whole protein sequence), while linear motifs are short and less specific in contrast to domains, it would lead to frequent nonfunctional (or random) motif assignments along proteins; consequently a number of motif pairs may appear randomly in the simulation datasets of Yeast Core, which would make the validation results of Yeast Core non-significant.

Inference of putative protein interacting motif pairs

We implemented both the exact binomial test and the Fisher's exact test to assess the statistical significance of the overrepresentation of co-occurring motif pairs in the GSPs compared to the GSNs. For the exact binomial test with q-value <0.01, 3684 putative interacting motif pairs both significantly overrepresented in the GSPs and underrepresented in the GSNs (referred to as the EBT dataset) were detected. For the Fisher's exact test with q-value <0.01, 33 341 putative interacting motif pairs (denoted as the FET dataset) were obtained. And 3665 motif pairs overlapped between EBT and FET, whereas, 29 695 were inferred solely by the Fisher's exact test and 19 were inferred solely by the exact binomial test. Thereafter, these three groups of inferred motif pairs are denoted as A, B and C, respectively (Figure 2).

We found that FET contained a much larger number of motif pairs than EBT. The reason is that the Fisher's exact test can detect both common and rare motif pairs and therefore be more sensitive (higher SN) than the exact binomial test as shown in Tables 2 and 3. Although the PPVs of FET were higher than EBT, the empirical P-values for the PPVs with iPfam and DOMINO were less than those of EBT, especially for iPfam, the result was not significant (Tables 1 and 3). These results indicate that the higher sensitive of FET may be accompanied with higher false positives. As shown in Tables 1 and 3, we noted that the PPVs for iPfam and DOMINO were at a low level compared with the Yeast Core dataset. A plausible explanation may be that these two datasets are relatively incomplete, for the domain?domain interactions in iPfam are observed in the protein complexes with known 3D structures, and DOMINO only collects interactions with experimentally verified evidence. As expected, only a small fraction of biologically occurring interaction-site pairs was sampled.

Assembly of an interacting motif pair dataset with high confidence

As the exact binomial test does well in detecting common motif pairs, and the Fisher's exact test is effective for detecting rare motif pairs, the two interacting motif pair datasets inferred by the distinct statistical methods were combined. Before doing this, we defined that a motif pair can be assigned with one of the three evidence types corresponding to the three validation databases, iPfam, DOMINO and Yeast Core, if it can be mapped to one of the validation datasets. According to the number of evidence types, motif pairs can be divided into four groups: no evidence (evi0), exactly only one evidence type (evi1), two evidence types (evi2) and three evidence types (evi3). Intuitively, the larger the number of evidence types a motif pair has, the greater the confidence level for the motif pair.

First, as 99.48% (3665 out of 3684) of the motif pairs in EBT were also predicted by FET method, we assembled EBT into the final interacting motif pair dataset. To increase the coverage of the interacting motif pair dataset, the motif pairs solely inferred from the Fisher's exact test (group B) were also considered as a candidate set (Figure 2). Because of the propensity of FET to contain more false positives as described earlier, we were interested in those motif pairs that appear underrepresented with significance in the GSPs and overrepresented in the GSNs (but without significance), where the statistical significance was according to the q-values derived from the exact binomial test. We defined that a motif pair Mij is ?overrepresented in the GSPs but without significance? if Nobsij > Nexpij and q-value ?0.01, where Nobsij is the observed number of protein pairs containing Mij and Nexpij is the expected number of protein pairs containing Mij. Nexpij is calculated as Efij ? N, where Efij is the expected frequency of protein pairs containing Mij (defined in the ?Materials and Methods? section) and N is the size of protein pair dataset. As a result, 5731 motif pairs (denoted as filtered group B) were extracted and assembled into the final interacting motif pair dataset.

Therefore, two groups of motif pairs, EBT and filtered group B, were incorporated into the set of high-confidence interacting motif pairs [denoted as Interacting Motif Pairs (IMP)]. IMP contained 9415 motif pairs in total. We found that only 2.25% (212/9415) of motif pairs in IMP had no evidence (Figure 3), and IMP covered 96.01, 78.57 and 98.51% of iPfam, DOMINO and Yeast Core, respectively.

Ranking the inferred motif pairs with high confidence

We ranked the motif pairs according to the q-values from the exact binomial test in GSPs and the q-values from the Fisher's exact test in ascending order, respectively. Then we compared the performance of the two ranking methods. The q-value of the Fisher's exact test had the better performance and was used as our ranking scheme (see Supplementary Materials Figure S4). The reason may be that some specific and rare motif pairs that are likely to be true interacting motif pairs would be assigned a higher rank in the Fisher's exact test. We expected that motif pairs with more evidence types, which are more likely to interact with each other, should rank higher. This can be determined by analyzing the distribution of the frequency of the motif pairs in IMP with different evidence types. As shown in Figure 4, at the various rank cutoffs, the numbers of motif pairs within evi0 and evi3 were counted. Intuitively, the motif pairs within evi0 and evi3 can be regarded as the least and the most reliable, respectively. The motif pairs within evi0 appear seldom among the top ranks, of which the highest rank is 2972; while the majority of the motif pairs within evi3 are top-ranked (i.e. about 50% are among the top 216).

Interesting motif pairs inferred with no validation evidence

Because of the incompleteness of the validation datasets, motif pairs inferred without evidence might truly bind to each other to mediate protein?protein interactions. In order to find whether there is evidence to indirectly support their binding function, we mapped the 212 motif pairs without evidence in IMP to the datasets of interaction-site pairs in the other species. To this end, three confidence datasets were adopted, iPfam, DOMINO and the dataset compiled from Pawson lab (http://pawsonlab.mshri.on.ca/). As a result, 93 motif pairs have been mapped to the interaction-site pairs of the other species. For instance, the motif pair composed of the motif FGRA [MnM:PBMDNA00004A] and the motif Y-D/K/N-H/F/R-P/V/L [MnM:PBMSH200020B], occurs in the physical interacting regions of a DNA-binding protein in Methanopyrus kandleri [PDBID:1f1e] (Figure 5). Another instance is that the motif V/Y/F-?-I/V/A> [MnM:PBMPDZ00002A], a PDZ Class II binding motif, was predicted to interact with the Motif CPV [MnM:PBMMHL00001A] occurring in a PDZ domain in Mus musculus (65).


DISCUSSION
Comparison with previously developed methods

MLE (36), DPEA (37) and a simple association method were compared with our method. The measure of the simple association method is defined as the fraction of the interacting protein pairs among all of the protein pairs containing a given motif pair Mij. The measure of the MLE method is the estimated value of the probability of an interacting motif pair Pr(Mij = 1) by using the expectation maximization algorithm (EM) to maximize the expectation of observing a given protein interaction network. We calculated it as Deng et al. (36). We also used an extended measure of MLE provided by Lee et al. (41), the expected number of occurrences of motif pairs. It is defined as Nij ? Pr(Mij = 1) where Nij is the number of all protein pairs containing a motif pair Mij. The DPEA method is based on computing an E-value, which measures how disallowing the given motif pair reduces the likelihood of a protein?protein interaction network (37). The four measures are referred to as Frequency, Probability, Expectation and E-value, respectively. The measure of our method is referred to as Qvalue. The power of the different methods was evaluated by plotting the curves of their PPV values versus the top percent rank in the validation datasets iPfam and DOMINO (Figure 6). We observed that the Qvalue outperformed the other methods in the two validation datasets. The E-value and Expectation had similar performance and came second, and the Probability and Frequency performed the worst, which were also observed in (37,40,41). A plot similar to Figure 6, depicting the relationship between SN versus the top percent rank is available as Supplementary Materials Figure S5. Similar results were obtained.

Our Qvalue method is an association method. The dominance of the Qvalue method over the others could be attributed to two main reasons. First, compared with the simple association method, Frequency, the Qvalue method uses more stringent statistical tests to find motif pairs with significant occurrence. Second, an advantage of the complicated methods (MLE and DPEA) is that they take into account the mutual impact of multiple motif pairs coexisting in an interacting protein pair on the interaction of the protein pair. However, in contrast to domains, motif assignments may introduce much more noise because of the lower specificity of linear motifs, so the advantage of the complicated methods in considering the mutual effect of multiple motif pairs may be impaired. Moreover, the complicated methods have so many parameters to be tuned that they are more likely to be affected by this noise. In such a situation, a simple model with strict statistical analysis may be more suitable. However, we should note that because we have not compared our method with these complicated methods in predicting domain?domain interaction, the results can only suggest that our method performs better than the complicated methods when identifying interacting motif pairs.

The effectiveness of a high-quality negative dataset on inference performance

We took the exact binomial test as an example to investigate the effectiveness of a high-quality negative dataset. A serious problem underlying methods of inferring interacting motif pairs is that promiscuous motif pairs are scored highly because of the frequency of their occurrence, but not to because of the specific topology of the network (37). We wondered, by using a high-quality negative dataset, whether the overprediction of promiscuous interactions could be controlled. This is based on the assumption that through incorporating high-quality negatives, some false positives could be reduced by eliminating the motif pairs significantly overrepresented in both the GSPs and GSNs, and that these eliminated motif pairs usually occur promiscuously in many if not most interacting proteins. In total, there were 5101 motif pairs overrepresented in the GSPs regardless of their occurrences in the GSNs (called ?Background?), and 1417 (about 27%) were eliminated by the GSNs (called ?Eliminated?). To this end, we tested this assumption by comparing the promiscuity of motif pairs among the three datasets, ?Background?, ?Inferred? (EBT) and ?Eliminated?. As shown in Figure 7, the promiscuity of ?Inferred? was significantly less than that of ?Eliminated? (Mann?Whitney U-test, P-value < 2.2e-16) and that of ?Background? (P-value < 2.2e-16), and the promiscuity of ?Eliminated? was significantly higher than that of ?Background? (P-value < 2.2e-16), suggesting the data eliminated by the GSNs contain the most promiscuous interactions. In addition, we also mapped these eliminated motif pairs to the validation datasets (see Supplementary Materials Table S3). We found that the PPVs and SNs of these eliminated motif pairs were much less than PPVs and SNs of those mining from both the GSPs and GSNs (EBT dataset, Tables 1 and 2), indicating these eliminated motif pairs may contain high false positives (for details see Supplementary Materials). These results suggest that a high-quality negative dataset has a large effect on decreasing motif pairs with promiscuous interactions, and plays a critical role in the inference of interacting motif pairs with high confidence.

Caveats on our method

There are several underlying limitations in our approach. (i) In contrast to domains, linear motifs are difficult to detect experimentally or computationally because of their short length and some degree of degeneracy. Therefore, existing motif databases are far from comprehensive, and thus the use of these predefined patterns will reduce the motif search space to enable motif pair mining in large interaction networks. (ii) Another problem is non-functional false positive assignments, which is a serious consideration in motif assignments. In this study, we used information regarding subcellular components to filter out putative false positive assignments, but the effectiveness of such a strategy may be still limited. We expect to integrate other information such as species information and evolutionary conservation to reduce false positive rates in our future work. (iii) As our work was only based on S. cerevisiae, some motif pairs specific to other species or those appearing rarely in yeast could not be detected by our method. Thus, in the future, our interacting motif pair mining method will be extended to other organisms, and thus both the accuracy and coverage of our prediction system should be improved greatly.

Finally, we should note that the statistical significance used in our method is not equivalent to biological function. Not every protein with one motif of our inferred interacting motif pair is expected to interact with another protein with the other motif of the pair. The inferred motif pairs may indirectly mediate protein interactions, or help shape the structure of proteins. In any case, the motif pairs predicted by our method can be used to direct new experimental interaction screens, in both yeast and other species, through which the search space of putative interacting protein pairs would be greatly reduced.


SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.


ACKNOWLEDGEMENTS

We thank three anonymous reviewers for their invaluable comments. This research was supported by NSFC (Grants 30770445 and 30571037) and by Beijing Normal University. Funding to pay the Open Access publication charges for this article was provided by Beijing Normal University.

Conflict of interest statement. None declared.


REFERENCES
1. Li S,Armstrong CM,Bertin N,Ge H,Milstein S,Boxem M,Vidalain PO,Han JD,Chesneau A,Hao T,et al. A map of the interactome network of the metazoan C. elegansScience 2004;303:540–543. [pmid: 14704431]
2. Giot L,Bader JS,Brouwer C,Chaudhuri A,Kuang B,Li Y,Hao YL,Ooi CE,Godwin B,Vitols E,et al. A protein interaction map of Drosophila melanogasterScience 2003;302:1727–1736. [pmid: 14605208]
3. Rain JC,Selig L,De Reuse H,Battaglia V,Reverdy C,Simon S,Lenzen G,Petel F,Wojcik J,Schachter V,et al. The protein-protein interaction map of Helicobacter pyloriNature 2001;409:211–215. [pmid: 11196647]
4. Uetz P,Giot L,Cagney G,Mansfield TA,Judson RS,Knight JR,Lockshon D,Narayan V,Srinivasan M,Pochart P,et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiaeNature 2000;403:623–627. [pmid: 10688190]
5. Ito T,Chiba T,Ozawa R,Yoshida M,Hattori M,Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactomeProc. Natl Acad. Sci. USA 2001;98:4569–4574. [pmid: 11283351]
6. Xia Y,Yu H,Jansen R,Seringhaus M,Baxter S,Greenbaum D,Zhao H,Gerstein M. Analyzing cellular biochemistry in terms of molecular networksAnnu. Rev. Biochem. 2004;73:1051–1087. [pmid: 15189167]
7. Valencia A,Pazos F. Computational methods for the prediction of protein interactionsCurr. Opin. Struct. Biol. 2002;12:368–373. [pmid: 12127457]
8. Wu X,Zhu L,Guo J,Fu C,Zhou H,Dong D,Li Z,Zhang DY,Lin K. SPIDer: Saccharomyces protein-protein interaction databaseBMC Bioinformatics 2006;7(Suppl 5):S16. [pmid: 17254300]
9. Wu X,Zhu L,Guo J,Zhang DY,Lin K. Prediction of yeast protein-protein interaction network: insights from the gene ontology and annotationsNucleic Acids Res. 2006;34:2137–2150. [pmid: 16641319]
10. Bader GD,Donaldson I,Wolting C,Ouellette BF,Pawson T,Hogue CW. BIND?the biomolecular interaction network databaseNucleic Acids Res. 2001;29:242–245. [pmid: 11125103]
11. Xenarios I,Rice DW,Salwinski L,Baron MK,Marcotte EM,Eisenberg D. DIP: the database of interacting proteinsNucleic Acids Res. 2000;28:289–291. [pmid: 10592249]
12. Hermjakob H,Montecchi-Palazzi L,Lewington C,Mudali S,Kerrien S,Orchard S,Vingron M,Roechert B,Roepstorff P,Valencia A,et al. IntAct: an open source molecular interaction databaseNucleic Acids Res. 2004;32:D452–D455. [pmid: 14681455]
13. Zanzoni A,Montecchi-Palazzi L,Quondam M,Ausiello G,Helmer-Citterich M,Cesareni G. MINT: a Molecular INTeraction databaseFEBS Lett. 2002;513:135–140. [pmid: 11911893]
14. Eisenberg D,Marcotte EM,Xenarios I,Yeates TO. Protein function in the post-genomic eraNature 2000;405:823–826. [pmid: 10866208]
15. Hartwell LH,Hopfield JJ,Leibler S,Murray AW. From molecular to modular cell biologyNature 1999;402:C47–C52. [pmid: 10591225]
16. Jeong H,Mason SP,Barabasi AL,Oltvai ZN. Lethality and centrality in protein networksNature 2001;411:41–42. [pmid: 11333967]
17. Vazquez A,Flammini A,Maritan A,Vespignani A. Global protein function prediction from protein-protein interaction networksNat Biotechnol. 2003;21:697–700. [pmid: 12740586]
18. Kim WK,Henschel A,Winter C,Schroeder M. The many faces of protein-protein interactions: a compendium of interface geometryPLoS Comput. Biol. 2006;2:e124. [pmid: 17009862]
19. Jansen R,Yu H,Greenbaum D,Kluger Y,Krogan NJ,Chung S,Emili A,Snyder M,Greenblatt JF,Gerstein M. A Bayesian networks approach for predicting protein-protein interactions from genomic dataScience 2003;302:449–453. [pmid: 14564010]
20. Sprinzak E,Sattath S,Margalit H. How reliable are experimental protein-protein interaction data?J. Mol. Biol. 2003;327:919–923. [pmid: 12662919]
21. Hart GT,Ramani A,Marcotte E. How complete are current yeast and human protein-interaction networks?Genome Biol. 2006;7:120. [pmid: 17147767]
22. Jansen R,Gerstein M. Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network predictionCurr. Opin. Microbiol. 2004;7:535–545. [pmid: 15451510]
23. Chen XW,Liu M. Prediction of protein-protein interactions using random decision forest frameworkBioinformatics 2005;21:4394–4400. [pmid: 16234318]
24. Shen J,Zhang J,Luo X,Zhu W,Yu K,Chen K,Li Y,Jiang H. Predicting protein-protein interactions based only on sequences informationProc. Natl Acad. Sci. USA 2007;104:4337–4341. [pmid: 17360525]
25. Ben-Hur A,Noble WS. Kernel methods for predicting protein-protein interactionsBioinformatics 2005;21(Suppl 1):i38–i46. [pmid: 15961482]
26. Forrest JC,Campbell JA,Schelling P,Stehle T,Dermody TS. Structure-function analysis of reovirus binding to junctional adhesion molecule 1. Implications for the mechanism of reovirus attachmentJ. Biol. Chem. 2003;278:48434–48444. [pmid: 12966102]
27. Zhang Y,Rassa JC,deObaldia ME,Albritton LM,Ross SR. Identification of the receptor binding domain of the mouse mammary tumor virus envelope proteinJ. Virol. 2003;77:10468–10478. [pmid: 12970432]
28. Kim WK,Henschel A,Winter C,Schroeder M. The many faces of protein-protein interactions: a compendium of interface geometryPlos Comput. Biol. 2006;2:1151–1164.
29. Mayer BJ. SH3 domains: complexity in moderationJ. Cell Sci. 2001;114:1253–1263. [pmid: 11256992]
30. Neduva V,Russell RB. Linear motifs: evolutionary interaction switchesFEBS Lett. 2005;579:3342–3345. [pmid: 15943979]
31. Yaffe MB. Phosphotyrosine-binding domains in signal transductionNat. Rev. Mol. Cell. Biol. 2002;3:177–186. [pmid: 11994738]
32. Lichtarge O,Sowa ME,Philippi A. Evolutionary traces of functional surfaces along G protein signaling pathwayMethods Enzymol. 2002;344:536–556. [pmid: 11771409]
33. Loregian A,Palu G. Disruption of protein-protein interactions: towards new targets for chemotherapyJ. Cell Physiol. 2005;204:750–762. [pmid: 15880642]
34. Arkin MR,Randal M,DeLano WL,Hyde J,Luong TN,Oslob JD,Raphael DR,Taylor L,Wang J,McDowell RS,et al. Binding of small molecules to an adaptive protein-protein interfaceProc. Natl Acad. Sci. USA 2003;100:1603–1608. [pmid: 12582206]
35. Sprinzak E,Margalit H. Correlated sequence-signatures as markers of protein-protein interactionJ. Mol. Biol. 2001;311:681–692. [pmid: 11518523]
36. Deng M,Mehta S,Sun F,Chen T. Inferring domain-domain interactions from protein-protein interactionsGenome Res. 2002;12:1540–1548. [pmid: 12368246]
37. Riley R,Lee C,Sabatti C,Eisenberg D. Inferring protein domain interactions from databases of interacting proteinsGenome Biol. 2005;6:R89. [pmid: 16207360]
38. Nye TM,Berzuini C,Gilks WR,Babu MM,Teichmann SA. Statistical analysis of domains in interacting protein pairsBioinformatics 2005;21:993–1001. [pmid: 15509600]
39. Ng S-K,Zhang Z,Tan S-H. Integrative approach for computationally inferring protein domain interactionsBioinformatics 2003;19:923–929. [pmid: 12761053]
40. Guimaraes K,Jothi R,Zotenko E,Przytycka T. Predicting domain-domain interactions using a parsimony approachGenome Biol. 2006;7:R104. [pmid: 17094802]
41. Lee H,Deng M,Sun F,Chen T. An integrated approach to the prediction of domain-domain interactionsBMC Bioinformatics 2006;7:269. [pmid: 16725050]
42. Chen X-W,Liu M. Prediction of protein-protein interactions using random decision forest frameworkBioinformatics 2005;21:4394–4400. [pmid: 16234318]
43. Wang H,Segal E,Ben-Hur A,Li QR,Vidal M,Koller D. InSite: a computational method for identifying protein-protein interaction binding sites on a proteome-wide scaleGenome Biol. 2007;8:R192. [pmid: 17868464]
44. Li H,Li J. Discovery of stable and significant binding motif pairs from PDB complexes and protein interaction datasetsBioinformatics 2005;21:314–324. [pmid: 15374856]
45. Li H,Li J,Wong L. Discovering motif pairs at interaction sites from protein sequences on a proteome-wide scaleBioinformatics 2006;22:989–996. [pmid: 16446278]
46. Tan SH,Hugo W,Sung WK,Ng SK. A correlated motif approach for finding short linear motifs from protein interaction networksBMC Bioinformatics 2006;7:502. [pmid: 17107624]
47. Yu H,Qian MP,Deng MH. Using a Stochastic AdaBoost algorithm to discover interactome motif pairs from sequences.Lecture Notes in Comput. Sci 2006;4115:622–630.
48. Caffrey DR,Somaroo S,Hughes JD,Mintseris J,Huang ES. Are protein-protein interfaces more conserved in sequence than the rest of the protein surface?Protein Sci 2004;13:190–202. [pmid: 14691234]
49. Puntervoll P,Linding R,Gemund C,Chabanis-Davidson S,Mattingsdal M,Cameron S,Martin DMA,Ausiello G,Brannetti B,Costantini A,et al. ELM server: a new resource for investigating short functional sites in modular eukaryotic proteinsNucleic Acids Res. 2003;31:3625–3630. [pmid: 12824381]
50. Falquet L,Pagni M,Bucher P,Hulo N,Sigrist CJ,Hofmann K,Bairoch A. The PROSITE database, its status in 2002Nucleic Acids Res. 2002;30:235–238. [pmid: 11752303]
51. Obenauer JC,Cantley LC,Yaffe MB. Scansite 2.0: Proteome-wide prediction of cell signaling interactions using short sequence motifsNucleic Acids Res. 2003;31:3635–3641. [pmid: 12824383]
52. Balla S,Thapar V,Verma S,Luong T,Faghri T,Huang CH,Rajasekaran S,del Campo JJ,Shinn JH,Mohler WA,et al. Minimotif Miner: a tool for investigating protein functionNat. Methods 2006;3:175–177. [pmid: 16489333]
53. Mewes HW,Amid C,Arnold R,Frishman D,Guldener U,Mannhaupt G,Munsterkotter M,Pagel P,Strack N,Stumpflen V,et al. MIPS: analysis and annotation of proteins from whole genomesNucleic Acids Res. 2004;32:D41–D44. [pmid: 14681354]
54. Finn RD,Marshall M,Bateman A. iPfam: visualization of protein-protein interactions in PDB at domain and amino acid resolutionsBioinformatics 2005;21:410–412. [pmid: 15353450]
55. Ceol A,Chatr-aryamontri A,Santonico E,Sacco R,Castagnoli L,Cesareni G. DOMINO: a database of domain-peptide interactionsNucleic Acids Res. 2007;35:D557–560. [pmid: 17135199]
56. Deane CM,Salwinski L,Xenarios I,Eisenberg D. Protein interactions: two methods for assessment of the reliability of high-throughput observationsMol. Cell. Proteomics. 2002:M100037–MCP100200.
57. de Lichtenberg U,Jensen LJ,Brunak S,Bork P. Dynamic complex formation during the yeast cell cycleScience 2005;307:724–727. [pmid: 15692050]
58. Lu LJ,Xia Y,Paccanaro A,Yu H,Gerstein M. Assessing the limits of genomic data integration for predicting protein networksGenome Res. 2005;15:945–953. [pmid: 15998909]
59. Patil A,Nakamura H. Filtering high-throughput protein-protein interaction data using a combination of genomic featuresBMC Bioinformatics 2005;6:100. [pmid: 15833142]
60. von Mering C,Krause R,Snel B,Cornell M,Oliver SG,Fields S,Bork P. Comparative assessment of large-scale data sets of protein-protein interactionsNature 2002;417:399–403. [pmid: 12000970]
61. Edwards AM,Kus B,Jansen R,Greenbaum D,Greenblatt J,Gerstein M. Bridging structural biology and genomics: assessing protein interaction data with known complexesTrends Genet. 2002;18:529–536. [pmid: 12350343]
62. Kumar A,Agarwal S,Heyman JA,Matson S,Heidtman M,Piccirillo S,Umansky L,Drawid A,Jansen R,et al. Subcellular localization of the yeast proteomeGenes Dev. 2002;16:707–719. [pmid: 11914276]
63. Storey JD,Tibshirani R. Statistical significance for genomewide studiesProc. Natl Acad. Sci. USA 2003;100:9440–9445. [pmid: 12883005]
64. Storey JD. A direct approach to false discovery ratesJ. R. Stat. Soc. Ser. B ? Stat. Methodol. 2002;64:479–498.
65. Hamazaki Y,Itoh M,Sasaki H,Furuse M,Tsukita S. Multi-PDZ domain protein 1 (MUPP1) is concentrated at tight junctions through its possible interaction with claudin-1 and junctional adhesion moleculeJ. Biol. Chem. 2002;277:455–461. [pmid: 11689568]

Figures

[Figure ID: F1]
Figure 1. 

Comparison of the (A) positive predictive values [PPVs, defined as TP/(TP + FP)] and (B) sensitivities (SNs) of the three negative datasets W1GSNs, W2GSNs and SGSNs. W1GSNs and W2GSNs were generated using our previously described RSS method (9). The RSS is a new metric of semantic similarity used to score the degree of the functional association or localization proximity between two different proteins. W1GSNs comprised protein pairs with low-confidence RSSBP and low-confidence RSSCC; W2GSNs comprised the protein pairs with low-confidence RSSBP and low- or median-confidence RSSCC values. SGSNs were generated using the method of selecting protein pairs with different subcellular localizations.



[Figure ID: F2]
Figure 2. 

A Venn diagram of the numbers of motif pairs inferred by the exact binomial test and the Fisher's exact test. These motif pairs can be divided into three data groups: (A) the intersection between the dataset inferred by the exact binomial test (EBT) and the dataset inferred by the Fisher's exact test (FET); (B) the portion inferred only by the Fisher's exact test and (C) the portion inferred only by the exact binomial test.



[Figure ID: F3]
Figure 3. 

The distribution of the inferred motif pairs (IMPs) with different numbers of evidence types.



[Figure ID: F4]
Figure 4. 

The distribution of the inferred motif pairs in different cumulative ranks. The number of the inferred motif pairs is plotted against their cumulative ranks. At the various rank cutoffs, the numbers of the motif pairs with no evidence (evi0) and with three evidence types (evi3) were counted, respectively.



[Figure ID: F5]
Figure 5. 

Three-dimensional structure of a motif pair without evidence in yeast but found in the physical interacting regions of DNA-binding protein in Methanopyrus kandleri [PDB:1f1e].



[Figure ID: F6]
Figure 6. 

The relationship between the top percent rank versus the positive predictive value (PPV) estimated by iPfam (A) and DOMINO (B). Five measures of prediction methods were assessed. ?Frequency? is the measure of a simple association method that scores the fraction of the interacting protein pairs among all of the protein pairs containing a given motif pair. ?Probability? is the measure of the MLE method to score the probability of an interacting motif pair (36). ?Expectation? is an extended measure of the MLE provided by Lee et al. (41) that scores the expected number of occurrences of motif pairs. ?E-value? is the measure of the DPEA method that measures how disallowing the given motif pair reduces the likelihood of a protein?protein interaction network (37). The measure of our method is referred to as the ?Qvalue?, which is calculated using the Fisher's exact test.



[Figure ID: F7]
Figure 7. 

Comparison of the promiscuity of the motif pairs among the three datasets: ?Inferred??the motif pairs were both significantly overrepresented in the GSPs and underrepresented in the GSNs, ?Eliminated??the motif pairs were significantly overrepresented in the GSPs but did not satisfy the criterion of ?significantly underrepresented in the GSNs? and ?Background??the motif pairs were significantly overrepresented in the GSPs. The promiscuity of a motif pair was measured by #Pairsobserved/#Pairspossible, where #Pairsobserved is the number of the observed interacting protein pairs containing the motif pair, and #Pairspossible is the number of all the possible protein pairs containing the motif pair.



Tables
[TableWrap ID: T1] Table 1. 

Positive predictive values (PPVs) of the motif pairs inferred by the exact binomial test from the four negative datasets


W1GSNs SGSNs W2GSNs RGSNs




PPV (%) P-value PPV (%) P-value PPV (%) P-value PPV (%) P-value
DOMINO 15.61 0.006 13.52 0.013 9.36 0.018 78.95 0.025
iPfam 14.69 0.032 12.61 0.014 11.80 0.012 81.58 0.957
Yeast Core 98.24 0.930 95.17 0.085 96.54 0.059 100.00 1.000

PPV was calculated as TP/(TP+FP), where true positives (TP) and false positives (FP) were estimated with respect to each validation dataset. Here, PPV was defined as the proportion of the inferred motif pairs overlapping with each validation dataset.


[TableWrap ID: T2] Table 2. 

Sensitivities (SNs) of the motif pairs inferred by the exact binomial test from the four negative datasets


W1GSNs SGSNs W2GSNs RGSNs




SN (%) P-value SN (%) P-value SN (%) P-value SN (%) P-value
DOMINO 75.77 <0.001 75.77 <0.001 5.87 1.000 1.79 0.750
iPfam 95.16 <0.001 95.16 <0.001 60.40 <0.001 45.30 <0.001
Yeast Core 97.93 <0.001 98.23 <0.001 87.82 <0.001 53.24 <0.001

The SN, calculated as TP/P (P being the size of validation dataset), was simply defined as the proportion of the segment pairs in each validation dataset overlapping with the inferred motif pairs.


[TableWrap ID: T3] Table 3. 

Statistical analysis of the validation results of the motif pairs inferred by the Fisher's exact test


PPV (%) P-valuea SN (%) P-valueb
DOMINO 16.13 0.027 89.54 <0.001
iPfam 31.90 0.190 99.15 <0.001
Yeast Core 98.41 0.506 99.95 <0.001

TF1aThe empirical P-value for the PPVs with the validation datasets.

TF2bThe empirical P-value for the SNs with the validation datasets.



Article Categories:
  • Computational Biology


Previous Document:  Evidence for Ku70/Ku80 association with full-length RAG1.
Next Document:  Induction of NO synthase 2 in ventricular cardiomyocytes incubated with a conventional bicarbonate d...