Document Detail

Classification and analysis of regulatory pathways using graph property, biochemical and physicochemical property, and functional property.
Jump to Full Text
MedLine Citation:
PMID:  21980418     Owner:  NLM     Status:  MEDLINE    
Abstract/OtherAbstract:
Given a regulatory pathway system consisting of a set of proteins, can we predict which pathway class it belongs to? Such a problem is closely related to the biological function of the pathway in cells and hence is quite fundamental and essential in systems biology and proteomics. This is also an extremely difficult and challenging problem due to its complexity. To address this problem, a novel approach was developed that can be used to predict query pathways among the following six functional categories: (i) "Metabolism", (ii) "Genetic Information Processing", (iii) "Environmental Information Processing", (iv) "Cellular Processes", (v) "Organismal Systems", and (vi) "Human Diseases". The prediction method was established trough the following procedures: (i) according to the general form of pseudo amino acid composition (PseAAC), each of the pathways concerned is formulated as a 5570-D (dimensional) vector; (ii) each of components in the 5570-D vector was derived by a series of feature extractions from the pathway system according to its graphic property, biochemical and physicochemical property, as well as functional property; (iii) the minimum redundancy maximum relevance (mRMR) method was adopted to operate the prediction. A cross-validation by the jackknife test on a benchmark dataset consisting of 146 regulatory pathways indicated that an overall success rate of 78.8% was achieved by our method in identifying query pathways among the above six classes, indicating the outcome is quite promising and encouraging. To the best of our knowledge, the current study represents the first effort in attempting to identity the type of a pathway system or its biological function. It is anticipated that our report may stimulate a series of follow-up investigations in this new and challenging area.
Authors:
Tao Huang; Lei Chen; Yu-Dong Cai; Kuo-Chen Chou
Related Documents :
21584718 - Development of appropriate equations for physiologically based pharmacokinetic modeling...
15742358 - Improved estimation of controlled direct effects in the presence of unmeasured confound...
21720038 - Development of a rapid process monitoring method for dry-coated tableting process by us...
22100448 - A bayesian view on cryo-em structure determination.
15456458 - Practical examples: trials and conclusions -- an interactive segment.
23615898 - A probabilistic method for computing quantitative risk indexes from medical injuries co...
Publication Detail:
Type:  Journal Article; Research Support, Non-U.S. Gov't     Date:  2011-09-28
Journal Detail:
Title:  PloS one     Volume:  6     ISSN:  1932-6203     ISO Abbreviation:  PLoS ONE     Publication Date:  2011  
Date Detail:
Created Date:  2011-10-07     Completed Date:  2012-03-05     Revised Date:  2013-05-23    
Medline Journal Info:
Nlm Unique ID:  101285081     Medline TA:  PLoS One     Country:  United States    
Other Details:
Languages:  eng     Pagination:  e25297     Citation Subset:  IM    
Affiliation:
Institute of Systems Biology, Shanghai University, Shanghai, People's Republic of China.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Descriptor/Qualifier:
Animals
Computational Biology
Humans
Signal Transduction*
Systems Biology
Comments/Corrections

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Full Text
Journal Information
Journal ID (nlm-ta): PLoS One
Journal ID (publisher-id): plos
Journal ID (pmc): plosone
ISSN: 1932-6203
Publisher: Public Library of Science, San Francisco, USA
Article Information
Download PDF
Huang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Received Day: 9 Month: 3 Year: 2011
Accepted Day: 31 Month: 8 Year: 2011
collection publication date: Year: 2011
Electronic publication date: Day: 28 Month: 9 Year: 2011
Volume: 6 Issue: 9
E-location ID: e25297
ID: 3182212
PubMed Id: 21980418
Publisher Id: PONE-D-11-04586
DOI: 10.1371/journal.pone.0025297

Classification and Analysis of Regulatory Pathways Using Graph Property, Biochemical and Physicochemical Property, and Functional Property Alternate Title:Classification and Analysis of Regulatory Pathways
Tao Huang123
Lei Chen4
Yu-Dong Cai15*
Kuo-Chen Chou5
Cathal Seoigheedit1 Role: Editor
1Institute of Systems Biology, Shanghai University, Shanghai, People's Republic of China
2Key Laboratory of Systems Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, People's Republic of China
3Shanghai Center for Bioinformation Technology, Shanghai, People's Republic of China
4College of Information Engineering, Shanghai Maritime University, Shanghai, People's Republic of China
5Gordon Life Science Institute, San Diego, California, United States of America
National University of Ireland Galway, Ireland
Correspondence: * E-mail: cai_yud@yahoo.com.cn
Contributed by footnote: Conceived and designed the experiments: Y-DC. Performed the experiments: TH LC. Analyzed the data: TH. Contributed reagents/materials/analysis tools: LC. Wrote the paper: TH LC KC-C.

Introduction

During the past decade, much information on different organisms has been accumulated at both the genetic and metabolic levels; meanwhile, many specific databases, such as KEGG/LIGAND [1], [2], [3], [4], ENZYME [5], BRENDA [6], EcoCyc and MetaCyc [7], [8], have been developed. However, biological meaningful pathways, such as the regulatory pathway and metabolic pathway, are still poorly understood. As one of the most important pathways in systems biology, the regulatory pathway includes two kinds of interactions: direct protein–protein interactions (such as physical binding and phosphorylation) and indirect protein–protein interactions (such as the relations between transcription factors and downstream gene products) [2].

KEGG (Kyoto Encyclopedia of Genes and Genomes) [1], [2], [3], [4] is a collection of online databases for dealing with genomes, enzymatic pathways, and biological chemicals. KEGG contains five main databases [4]: (i) KEGG Atlas, (ii) KEGG Pathway, (iii) KEGG Genes, (iv) KEGG Ligand, and (v) KEGG BRITE. The KEGG BRITE database (http://www.genome.jp/kegg/brite.html) includes some known regulatory pathways. It is an ontology database for representing functional hierarchies of various biological objects. The database also includes molecules, cells, organisms, diseases and drugs, as well as the relationships among them [9], [10]. In this database, experimental knowledge is collected and diagramed as pathways, i.e. smaller networks of specific function. Several visualization tools have been developed to view and analyze the global networks through web interfaces [11], [12], [13].

According to the data in KEGG BRITE, regulatory pathways are classified into six pathway classes. Since different class pathway represents different biological function, developing a successful classifier to identify the pathway class is very useful in system biology. Some efforts have been made in this regard. Dale et al. [14] tried to predict whether a metabolic pathway is present or absent in an organism. In our previous work [15], we developed a model to predict whether a regulatory pathway can be formed for a system consisting of certain number of different proteins. But predicting the biological function of regulatory pathway is still an untouched problem. It is a big challenge in both systems biology and proteomics because this kind of information is very hard to recover and transform into the data that can be processed by computers. The purpose of this study is not to achieve a high accuracy, but to analyze some features, which may provide useful information for characterizing a meaningful regulatory pathway.

To realize this, some feature selection methods, such as the minimum redundancy maximum relevance [16] and incremental feature selection approaches, were employed to analyze the relevant features, while Nearest Neighbor Algorithm (NNA) [17], [18], Sequential Minimal Optimization (SMO) [19], [20] and Bayesian network (BayesNet) [21] were used to classify the pathways. Finally, the jackknife cross-validation [22] was adopted to evaluate the prediction performance. As a result, 49 features were selected as the optimal features and the overall accuracy by using these features was 78.8%.

It was suggested by analyzing the optimized features that biochemical and physicochemical property and functional property are important to determine the biological function of each regulatory pathway. Although it represents the first work ever in predicting the classification of regulatory pathways and it is still quite preliminary, we believe that our exploration can stimulate a series of follow-up studies in this area important to both system biology and proteomics.

According to a recent review [23], to establish a really useful statistical predictor for a protein system, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the attribute to be predicted; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us describe how to deal with these steps one by one.”


Materials and Methods
Benchmark dataset

We downloaded the human KGML (KEGG XML) files from KEGG FTP site (ftp://ftp.genome.jp/pub/kegg/xml) in April 2009. We reduced the original data by the following two steps: (i) remove proteins without GO information or biochemical and physicochemical properties in each pathway; (ii) exclude pathways with less than three proteins. As a result, 146 regulatory pathways were obtained. According to the data in KEGG BRITE (http://www.genome.jp/kegg/brite.html), these pathways belong to the following six functional categories: (i) Metabolism, (ii) Genetic Information Processing, (iii) Environmental Information Processing, (iv) Cellular Processes, (v) Organismal Systems, and (vi) Human Diseases. Shown in Table 1 is the distribution of the six classes of regulatory pathways in this study.

Features construction

To develop a powerful predictor for classifying a protein system or pathway consisting of a set of proteins, one of the keys is to formulate the protein system with an effective mathematical expression that can truly reflect its intrinsic correlation with the attribute to be predicted [23]. In this regard, we can utilize the concept of pseudo amino acid composition (PseAAC) [24]. For a brief introduction about Chou's PseAAC, visit the Wikipedia web-page at http://en.wikipedia.org/wiki/Pseudo_amino_acid_composition. Ever since the concept of PseAAC was introduced, it has been widely used to study various problems in proteins and protein-related systems (see, e.g., [25], [26], [27], [28], [29], [30], [31], [32], [33], [34]). For various different modes of PseAAC, see [35]. Actually, the general form of PseAAC can be formulated as (see Eq.6 of [23]):


(1) 
where is a transpose operator, while the subscript is an integer and its value as well as the components , , … will depend on how to extract the desired information from the amino acid sequence of . Likewise, a pathway consisting a set of proteins can also be generally formulated as vector with components; i.e.,

(2) 
where represents the 1st feature of the pathway, the 2nd feature, and so forth. Below, let us elaborate how to define as well as the components in Eq.2.

1. Graph property

Graphic approaches are deemed as useful tools to study complex biological systems as they can provide intuitive insights and the overall structure property, as indicated by various studies on a series of important biological topics [36], [37], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48]. To use the graphic approach for the current study, each regulatory pathway was represented as a graph, where the vertices represent proteins and the arcs represent the relations between the corresponding proteins. In fact, it is a directed graph or digraph [38], [39]. This is because the relation between two proteins is directional; i.e., one protein, say P1, can regulate another protein, say P2, while P2 cannot always regulate P1. In this paper, we extracted 88 graph features from each directed graph that represents a regulatory pathway. Most of the graph features were derived in [49], [50], [51], [52], [53] where, however, the graphs are undirected. In this study, we extended them into directed graphs. The features of our directed graphs can be briefed as follows.

  1. Graph size and graph density. Let G = (V, E) be a pathway graph, where V denotes vertex set and E arcs set. The graph size is the number of vertices in the graph. |E|max = |V|2 is the theoretical maximum number of arcs in G with |V| vertices. The graph density is calculated by |E|/|E|max[49].
  2. Degree statistics. The in-degree (out-degree) of a vertex is the number of its in-neighbors (out-neighbors). The mean, variance, median, and maximum of in-degree and out-degree, respectively, were taken as features in this feature group [50].
  3. Edge weight statistics. Let G = (V, w(E)) be a weighted pathway graph where each arc is weighted by a weight w in the range of [0,1]. The symbol e is called a missing edge if w(e) = 0. In this study, the mean and variance of the arc weights were considered as features, including two different cases (with and without missing edges) [49].
  4. Topological change. Let G = (V, w(E)) be a weighted pathway graph. This group of features is to measure the topological changes when different cutoffs of the weights are applied to the graph. The weight cutoffs included 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7 and 0.8. Topology changes were defined as the change rate of the number of arcs in subgraphs under two consecutive cutoffs.
  5. Degree correlation. Let G = (V, E) be a pathway graph with V = {u1,u2,…,un}. For each vertex ui, calculate the average number of arcs of its in-neighbors and out-neighbors, respectively. Considered as features in this study were the mean, variance and maximum of the two kinds of property, respectively [51].
  6. Clustering. Let G = (V, E) be a pathway graph with V = {u1,u2,…,un}. For each vertex ui, calculate the graph density of the subgraph induced by its in-neighbors and out-neighbors, respectively. Take the mean, variance and maximum of the two kinds of property [50], respectively, as the features for the current study.
  7. Topological. Let G = (V, E) be a pathway graph with V = {u1,u2,…,un}. Define four function as follows: (i) in-in(ui, uj) for the number of both in-neighbors of ui and in-neighbors of uj; (ii) in-out(ui, uj) for the number of both in-neighbors of ui and out-neighbors of uj; (iii) out-in(ui, uj) for the number of both out-neighbors of ui and in-neighbors of uj; (iv) out-out(ui, uj) for the number of both out-neighbors of ui and out-neighbors of uj. For each vertex ui, calculate the four values Ti1, Ti2, Ti3, and Ti4 as follows: (i) Ti1 is the mean of in-in(ui, uj)/ni1; (ii) Ti2 the mean of in-out(ui, uj)/ni1; (iii) Ti3 the mean of out-in(ui, uj)/ni2; (iv) Ti4 the mean of out-out(ui, uj)/ni1. In the above, ni1 and ni2 are the number of in-neighbors and out-neighbors of ui, respectively. Take the mean, variance and maximum of Ti1, Ti2, Ti3, and Ti4, respectively, as the features [51] for the current study.
  8. Singular values. Let A be the adjacent matrix of the pathway graph. Take the first three largest singular values [49] as the features for this study.
  9. Local density change. Let G = (V, E) be a pathway graph with V = {u1,u2,…,un}. For each vertex ui, let and be its in-neighbors and out-neighbors, respectively. Here we only introduce how to extract features from out-neighbors of each vertex under the cutoff w, which may be 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and 0.9. Construct a weighted undirected complete graph Ki with vertices and the weights of each edge can be calculated by Eq. 2 in Section 2 “Gene ontology”. Extract a spanning subgraph Gi(w) of Ki with edges whose weights are greater than w. Calculate Li(w) = 2|E(Gi(w))|/(l(l−1)) (Li(w) = 0 if l≤1). Take the mean and maximum of L1(w), L2(w),…, Ln(w) under cutoff w as the features for the current study.
2. Gene ontology

As mentioned before, some features need the arc weight to evaluate the relation between two proteins. Thus, we used the information from gene ontology consortium (GO) [54] to represent each of the proteins concerned and evaluate its relation with the other proteins. “Ontology” is a specification of a conceptualization and refers to the subject of existence. GO is established according to the following three criteria: molecular function, biological process, and cellular component. Using GO information to represent protein samples can catch their core features [23] as proved by significantly enhancing the success rate in predicting their subcellular localization [55], [56], [57]. The GO approach has also been used to study protein-protein interactions [58], [59]. Here, using the similar method as in [52], each protein sample can be formulated as a 5218-D vector:


(3) 
where pi = 1 if the sample hit the GO number; otherwise, pi = 0. The interaction between Pi and Pj, i.e. the weight of arc between the two proteins, is defined by

(4) 
where is the dot product of Pi and Pj, and ∥ Pi ∥ and ∥ Pj ∥ are their modulus.

3. Biochemical and physicochemical property

Beside the graph property, the biological property of each pathway is also indispensable to characterize meaningful regulatory pathways. In this study, the biochemical and physicochemical properties, which have been used to study various biological problems [60], [61], [62], were employed to represent the biological property of each pathway. These properties included hydrophobicity, normalized van der Waals volume, polarity, polarizability, secondary structure, solvent accessibility, and amino acid compositions. For a regulatory pathway involving n proteins, both the mean and maximum values of their biological properties were taken for the features of the pathway, as detailed below.

  1. Hydrophobicity, normalized van der Waals volume, polarity and polarizability: 42 features can be extracted from each of these properties [63], [64], respectively. Here we only describe how to obtain the features from the hydrophobicity property, while features from the other properties can be obtained in a similar way. Each amino acid is substituted by one of the three letters, polar (P), neutral (N) and hydrophobic (H). Given a protein sequence, use P, N or H to substitute each amino acid in the sequence, and the sequence thus obtained is called a protein pseudo-sequence. Composition (C) is the percentage of P, N and H in the whole pseudo-sequence. Transition (T) is the changing frequency between any two characters. Distribution (D) is the sequence segment (in percentage) of the pseudo-sequence which is needed to contain the first, 25%, 50%, 75% and the last of the Ps, Ns and Hs, respectively. In conclusion, there are three, three, and fifteen properties for (C), (T) and (D), respectively. Accordingly, we have features for the “mean” category, feature for the “maximum” category, and hence a total of features by considering the “hydrophobicity” property alone. Similarly, we also have features by considering each of the other three properties, i.e., the “normalized van der Waals volume”, “polarity”, and “polarizability”. Thus, we have a total of 42×4 = 168 features by considering the above four properties.
  2. Secondary structure: according to the secondary structural propensity of amino acids, each protein sequence can also be coded with three letters [65], [66]. Thus, like the case in considering hydrophobicity, we also have 21×2 = 42 features by considering the “secondary structure” property (or propensity).
  3. Solvent accessibility: ACCpro [67] can be used to predict each amino acid as hidden (H) or exposed (E) to solvent. Then the protein sequence is coded with letters H and E. Use composition (C) for H, transition (T) between H and E, and five distributions (D) for H in this property. Thus we have (1+1+5)×2 = 14 features by considering the “solvent accessibility” property.
  4. Amino acid compositions: it contains 20 components with each representing the percentage of each amino acid in a protein sequence [68]. Thus, we have 20 features for the “mean” category, and 20 features for the “maximum” category. Totally, we have 20×2 = 40 features for a pathway system by considering the amino acid composition.

Shown in Table 2 is a breakdown of the 264 features for a pathway system by considering its biochemical and physicochemical properties. Before taking the mean and maximum values of each property into account, the following equations were used to adjust them according to a standard scale [61]:


(5) 
where Tj is the standard deviation of the j-th feature and uj the mean value of the j-th feature.

4. Functional property

The last category of features is about the functional property of each regulatory pathway. The gene ontology enrichment score of pathway i on gene ontology item j was defined as the −log10 of the hypergeometric test p value [15], [69], [70], [71] of proteins in pathway i and can be computed by the following equation:


(6) 
where N is the number of overall proteins in KEGG of human, M is the number of proteins annotated to gene ontology item j, is the number of proteins in pathway i, is the number of proteins in pathway i that are annotated to gene ontology item j. The larger the enrichment score of one gene ontology item, the more overrepresented this item is. There were a total of 5,218 gene ontology (GO) enrichment score features.

5. Representation of each pathway

It follows from the description in Section 1 “Graph property”, 3 “Biochemical and physicochemical property” and 4 “Functional property” that the total number of features was , as summarized Table 3. Thus, according to Eq.2, each of the 146 pathway samples in the benchmark dataset (Table S1) will be represented by a 5570-D vector.

mRMR method

Minimum Redundancy Maximum Relevance (mRMR), first proposed by Peng et al.[16], was employed in this study, as it is established according to two excellent criteria: Max-Relevance and Min-Redundancy. Max-Relevance guarantees that features giving most contribution to the classification will be selected, while Min-Redundancy guarantees that features whose classification ability has already been covered by selected features will be excluded. By mRMR program, we can obtain two feature lists: MaxRel features list and mRMR features list. MaxRel features list sort features only according to the Max-Relevance criteria, while mRMR features list is obtained in terms of both Max-Relevance and Min-Redundancy. Thus, for a feature set Ω with N features, mRMR program will execute N rounds and a feature with maximum relevance and minimum redundancy is selected in each round. Finally, we can obtain an ordered feature list, i.e., mRMR features list:


(7) 
For detail description of the mRMR method, please refer to Peng et al.'s paper [16]. Now, mRMR method has been widely utilized to tackle various biological problems [45], [52], [72], [73], [74], [75], [76] and deemed as a powerful and useful tool to extract important information in complex systems. The mRMR program developed by Peng et al [16] is available at http://penglab.janelia.org/proj/mRMR/.

Prediction model

In this study, we tried three prediction methods: Nearest Neighbor Algorithm (NNA), Sequential Minimal Optimization (SMO) and Bayesian network (BayesNet). NNA using cosine similarity as “nearness” [15], [61], [62], [71], [77] was implemented with in-house script. The NNA program can be downloaded from http://pcal.biosino.org/NNA.html. SMO and BayesNet were implemented in Weka (Waikato Environment for Knowledge Analysis) [78]. Weka, which was developed by the University of Waikato in New Zealand, is software collecting a variety of state-of-art machine learning algorithms and data preprocessing tools. It provides extensive support for the whole process of experimental data mining, including preparing the input data, evaluating learning schemes statistically, and visualizing the input data and the result of learning [78]. Weka can be downloaded from http://www.cs.waikato.ac.nz/ml/weka/.

1. Nearest Neighbor Algorithm (NNA)

Nearest Neighbor Algorithm (NNA) [17], [18], which has been widely used in bioinformatics and computational biology [15], [59], [60], [72], [79], [80], was adopted to predict the pathway class of each query pathway. The “nearness” is calculated as below


(8) 
where and are two vectors representing two pathways, is their dot product, and are the modulus of vector and . The smaller the , the more similar the two pathways are [55]. In NNA, suppose there are m training pathways, each of them belongs to exact one pathway class, and a query pathway needs to be classified into one pathway class. The distances between each of the m training pathways and the query pathway can be calculated, and the nearest neighbor of the query pathway is found. If the nearest neighbor belongs to the i-th pathway class, the query pathway is classified into the i-th pathway class. For an intuitive illustration of how NNA works, see Fig.5 of [23].

2. SMO

SMO implements John Platt's sequential minimal optimization algorithm for training a support vector classifier using polynomial or Gaussian kernels [19], [20]. All attributes are processed before using SMO to make prediction, for example nominal attributes are transformed into binary ones, and attributes are normalized [78].

3. BayesNet

BayesNet learns Bayesian networks under the assumptions that all attributes should be nominal (In particular, numeric ones should be prediscretized) and there are no missing values. Two different algorithms are used to estimate the conditional probability tables of the network [78] and several search algorithms are implemented for local score metrics, such as K2 [81], Hill Climbing [82], TAN [83], [84] and so on. For more detailed description of this classifiers in Weka can be found in [21].

Jackknife test to examine the quality of the current prediction method

In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling test, and jackknife test [85]. However, of the three test methods, the jackknife test is deemed the most objective [56]. The reasons are as follows. (i) For the independent dataset test, although all the proteins used to test the predictor are outside the training dataset used to train it so as to exclude the “memory” effect or bias, the way of how to select the independent proteins to test the predictor could be quite arbitrary unless the number of independent proteins is sufficiently large. This kind of arbitrariness might result in completely different conclusions. For instance, a predictor achieving a higher success rate than the other predictor for a given independent testing dataset might fail to keep so when tested by another independent testing dataset [85]. (ii) For the subsampling test, the concrete procedure usually used in literatures is the 5-fold, 7-fold or 10-fold cross-validation. The problem with this kind of subsampling test is that the number of possible selections in dividing a benchmark dataset is an astronomical figure even for a very simple dataset, as demonstrated by Eqs.28–30 in [23]. Therefore, in any actual subsampling cross-validation tests, only an extremely small fraction of the possible selections are taken into account. Since different selections will always lead to different results even for a same benchmark dataset and a same predictor, the subsampling test cannot avoid the arbitrariness either. A test method unable to yield a unique outcome cannot be deemed as a good one. (iii) In the jackknife test, all the proteins in the benchmark dataset will be singled out one-by-one and tested by the predictor trained by the remaining protein samples. During the process of jackknifing, both the training dataset and testing dataset are actually open, and each protein sample will be in turn moved between the two. The jackknife test can exclude the “memory” effect. Also, the arbitrariness problem as mentioned above for the independent dataset test and subsampling test can be avoided because the outcome obtained by the jackknife cross-validation is always unique for a given benchmark dataset. Accordingly, the jackknife test has been increasingly and widely used by those investigators with strong math background to examine the quality of various predictors (see, e.g., [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [86], [87], [88], [89], [90]). In view of this, here the jackknife test was also used to examine the quality of the current predictor in identifying the pathway class.

Incremental feature selection (IFS)

As described in Section “mRMR method”, mRMR features list F = [f0, f1,…,fN−1] can be obtained by mRMR program. Denote the i-th feature set by Fi = { f0, f1,…,fi} (0≤iN−1). For each i (0≤iN−1), execute NNA, SMO and BayesNet with the features in Fi, then the overall accuracy of the classification (ACC), defined by “the number of correctly predicted pathways”/“the total number of pathways”, evaluated by jackknife test, was obtained. As a result, we can plot a curve named IFS curve with ACC as its y-axis and the index i of Fi as its x-axis.


Results and Discussion
Results of mRMR

The mRMR program was achieved from http://penglab.janelia.org/proj/mRMR. It was run with default parameters and two feature lists were obtained by executing mRMR program: (i) MaxRel features list; (ii) mRMR features list (see Table S2).

MaxRel features list was obtained by sorting features according to their contribution to the classification. We investigated the most relevant 1% of the features (totally 55) and Table 4 shows the distribution of these features. It is clear that 32 (32/55, 58.18%) features come from biochemical and physicochemical property and 23 (23/55, 41.82%) features come from functional property. All of these indicate that among the adopted features the biochemical and physicochemical property of each pathway provide the most contribution to classification and functional property also gives important contribution. It is startling that none of the features about graph property was the most relevant 1% feature, while they were considered as important factors to form some biological meaningful systems, such as protein complex [45], [53]. In this study, we only take care of classifying a regulatory pathway into correct pathway class but not to analyze which feature is more important to form a regulatory pathway. In this stage, graph property may be not very important while biological and functional properties are more important to determine the biological function of each pathway.

Results of IFS

Shown in Figure 1 are the IFS curves of NNA, SMO and BayesNet. The highest ACC value of IFS is 78.8% using 49 features and SMO models (See Table 5 for the detail 49 features). The detailed IFS data can be found in Table S3.

Figure 2 shows the distribution of the optimized 49 features. It is straightforward to see that 25 (25/49, 51.0%) features were from the biochemical and physicochemical property and 24 (24/49, 49.0%) features were from the functional property, while none of features in graph property was selected into the optimized feature set. All of these indicate the same conclusion as described in Section “Results of mRMR”.

Analysis of optimal features for pathway classification

It was seen from Table 5 and Figure 2 that the biochemical and physicochemical properties and Gene Ontology functional properties were important for pathway classification.

Within the selected 25 biochemical and physicochemical properties, there were 6 secondary structure features, 6 amino acid composition features, 3 solvent accessibility features, 3 polarity features, 3 hydrophobicity features, 2 vanderWaal features and 2 polarizability features. Obviously, secondary structure features and amino acid composition features were more important than other biochemical and physicochemical properties. The correct secondary structure of protein is essential to its function. Structural incorrect proteins are associated with many different kinds of disease such as Alzheimer's disease, Huntington's and Parkinson's disease [91]. In KEGG pathway classification, there are 28 disease pathways. Some of the disease pathways, such as neurodegenerative disease pathways and cancer pathways, are caused by or associated with protein misfolding [91]. Amino acid composition has been used to explain a lot of biological phenomenon, such as translation rate [62] and metabolic stability of proteins [61]. Amino acid composition has a close relationship with protein synthesis and degradation [62], [70]. In KEGG pathway classification, there are 73 metabolism pathways. The amino acid composition features may affect these metabolism pathways.

To investigate the association between KEGG pathway classes and GO terms in optimized features, we calculated their hypergeometric test p values which were shown in Table 6. As shown from the table, “Metabolism” pathways were associated with GO term “GO:0043627 response to estrogen stimulus”, “Genetic Information Processing” pathways were associated with GO term “GO:0045121 membrane raft”, “Environmental Information Processing” pathways, “Cellular Processes” pathways, “Organismal Systems” pathways and “Human Diseases” pathways were associated with many GO terms in optimized features. Some associations are obvious and well-known, such as the association between “Environmental Information Processing” pathways and GO term “GO:0043627 response to estrogen stimulus”, the association between “Cellular Processes” pathways and GO terms “GO:0048519 negative regulation of biological process” and “GO:0048523 negative regulation of cellular process”, the association between “Organismal Systems” pathways and GO terms “GO:0030217 T cell differentiation”, “GO:0030225 macrophage differentiation” etc., the association between “Human Diseases” pathways and GO terms “GO:0048519 negative regulation of biological process”, “GO:0048523 negative regulation of cellular process” and “GO:0042063 gliogenesis”. The relationship between “Metabolism” pathways and GO term “GO:0043627 response to estrogen stimulus” may be indirect. Estrogen can introduce dramatic changes of cell, such as apoptosis and carcinogenesis [92], [93]. During these cellular changes, the metabolism pathways will change as well. “Genetic Information Processing” pathways include many biological processes, such as transcription, translation, folding, sorting, degradation, replication and repair. All these steps require translocation of big molecular which needs the assistant of membrane systems. Membrane raft involves in biosynthetic traffic, endocytosis and signal transduction [94].

Combining the 25 biochemical and physicochemical properties and 24 Gene Ontology functional properties together, most KEGG pathways can correctly classified with reasonable biological meanings. The prediction model can be used to classify new pathway into existing pathway function groups. This means predicting the function of new pathways which is one of the ultimate goals of biology research.

We have analyzed 5570 features extracted from each of known regulatory pathway in humans. Of the 5570 features, 88 were derived from the graph property, 264 from the biochemical and physicochemical property of proteins, and 5218 from the functional property. Subsequently, the mRMR method and IFS techniques were employed to analyze and identify the the important features. Nearest neighbor algorithm and jackknife test were utilized to evaluate the accuracy of the classifier. As a result, 49 features were found to be as the important features for classifying the pathway groups according to their biological functions. These findings might provide useful insights, stimulating in-depth investigation into such an important and challenging problem.


Supporting Information Table S1

The pathway benchmark dataset. It contains 146 pathways classified into six classes or groups according their biological functions.

(XLS)


Click here for additional data file (pone.0025297.s001.xls)

Table S2

Two lists obtained by mRMR program.

(PDF)


Click here for additional data file (pone.0025297.s002.pdf)

Table S3

The IFS results for NNA, SMO and BayesNet.

(XLS)


Click here for additional data file (pone.0025297.s003.xls)


Notes

Competing Interests: The authors have declared that no competing interests exist.

Funding: This work was supported by grants from National Basic Research Program of China (2011CB510102, 2011CB510101). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

The authors would like to take this opportunity to express our gratitude to the editor and the anonymous reviewer for their constructive comments, which were very helpful in strengthening the presentation of this study.


References
1. Kanehisa M. Year: 1997A database for post-genome analysis.Trends in genetics: TIG133753769287494
2. Kanehisa M,Goto S. Year: 2000KEGG: Kyoto encyclopedia of genes and genomes.Nucleic acids research28273010592173
3. Ogata H,Goto S,Sato K,Fujibuchi W,Bono H,et al. Year: 1999KEGG: Kyoto encyclopedia of genes and genomes.Nucleic acids research2729349847135
4. Kanehisa M,Goto S,Kawashima S,Okuno Y,Hattori M. Year: 2004The KEGG resource for deciphering the genome.Nucleic acids research32D277D28014681412
5. Bairoch A. Year: 1994The ENZYME data bank.Nucleic acids research22362636277937072
6. Schomburg I,Chang A,Hofmann O,Ebeling C,Ehrentreich F,et al. Year: 2002BRENDA: a resource for enzyme data and metabolic information.Trends in biochemical sciences27545611796225
7. Schomburg I,Chang A,Schomburg D. Year: 2002BRENDA, enzyme data and metabolic information.Nucleic acids research30474911752250
8. Krieger C,Zhang P,Mueller L,Wang A,Paley S,et al. Year: 2004MetaCyc: a multiorganism database of metabolic pathways and enzymes.Nucleic acids research32D438D44214681452
9. Kanehisa M,Araki M,Goto S,Hattori M,Hirakawa M,et al. Year: 2008KEGG for linking genomes to life and the environment.Nucleic Acids Res36D48048418077471
10. Klukas C,Schreiber F. Year: 2007Dynamic exploration and editing of KEGG pathway diagrams.Bioinformatics2334435017142815
11. Caspi R,Foerster H,Fulcher CA,Hopkinson R,Ingraham J,et al. Year: 2006MetaCyc: a multiorganism database of metabolic pathways and enzymes.Nucleic Acids Res34D51151616381923
12. Caspi R,Foerster H,Fulcher CA,Kaipa P,Krummenacker M,et al. Year: 2008The MetaCyc Database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases.Nucleic Acids Res36D62363117965431
13. Pharkya P,Nikolaev EV,Maranas CD. Year: 2003Review of the BRENDA Database.Metab Eng5717312850129
14. Dale JM,Popescu L,Karp PD. Year: 2010Machine learning methods for metabolic pathway prediction.BMC Bioinformatics111520064214
15. Chen L,Huang T,Shi XH,Cai YD,Chou KC. Year: 2010Analysis of protein pathway networks using hybrid properties.Molecules158177819221076385
16. Peng H,Long F,Ding C. Year: 2005Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy.IEEE Transactions on pattern analysis and machine intelligence1226123816119262
17. Salzberg S,Cost S. Year: 1992Predicting protein secondary structure with a nearest-neighbor algorithm* 1.Journal of molecular biology2273713741404357
18. Denoeux T. Year: 1995A k-nearest neighbor classification rule based on Dempster-Shafer theory.IEEE Transactions on Systems Man and Cybernetics25804813
19. Platt JYear: 1998Fast training of support vector machines using sequential minimal optimizationCambridge, MAMIT Press
20. Keerthi SS,Shevade SK,Bhattacharyya C,Murthy KRK. Year: 2001Improvements to Platt's SMO algorithm for SVM classifier design.Neural Computation13637649
21. Bouckaert RR. Year: 2004Bayesian network classifiers in Weka. Department of Computer Science, University of Waikato, New Zealand.
22. Chou KC,Zhang CT. Year: 1995Critical Reviews in Biochemistry and Molecular.Biology30275349
23. Chou KC. Year: 2011Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review).Journal of Theoretical Biology27323624721168420
24. Chou KC. Year: 2001Prediction of protein cellular attributes using pseudo amino acid composition.PROTEINS: Structure, Function, and Genetics (Erratum: ibid, 2001, Vol44, 60)43246255
25. Mohabatkar H. Year: 2010Prediction of cyclin proteins using Chou's pseudo amino acid composition.Protein & Peptide Letters171207121420450487
26. Esmaeili M,Mohabatkar H,Mohsenzadeh S. Year: 2010Using the concept of Chou's pseudo amino acid composition for risk type prediction of human papillomaviruses.Journal of Theoretical Biology26320320919961864
27. Zeng YH,Guo YZ,Xiao RQ,Yang L,Yu LZ,et al. Year: 2009Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach.Journal of Theoretical Biology25936637219341746
28. Chen C,Chen L,Zou X,Cai P. Year: 2009Prediction of protein secondary structure content by using the concept of Chou's pseudo amino acid composition and support vector machine.Protein & Peptide Letters16273119149669
29. Ding H,Luo L,Lin H. Year: 2009Prediction of cell wall lytic enzymes using Chou's amphiphilic pseudo amino acid composition.Protein & Peptide Letters1635135519356130
30. Georgiou DN,Karakasidis TE,Nieto JJ,Torres A. Year: 2009Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou's pseudo amino acid composition.Journal of Theoretical Biology257172619056401
31. Mohabatkar H,Mohammad Beigi M,Esmaeili A. Year: 2011Prediction of GABA(A) receptor proteins using the concept of Chou's pseudo-amino acid composition and support vector machine.Journal of Theoretical Biology281182321536049
32. Yu L,Guo Y,Li Y,Li G,Li M,et al. Year: 2010SecretP: Identifying bacterial secreted proteins by fusing new features into Chou's pseudo-amino acid composition.Journal of Theoretical Biology2671620691704
33. Gu Q,Ding YS,Zhang TL. Year: 2010Prediction of G-Protein-Coupled Receptor Classes in Low Homology Using Chou's Pseudo Amino Acid Composition with Approximate Entropy and Hydrophobicity Patterns.Protein & Peptide Letters1755956719594431
34. Qiu JD,Huang JH,Shi SP,Liang RP. Year: 2010Using the concept of Chou's pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform.Protein & Peptide Letters1771572219961429
35. Chou KC. Year: 2009Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology.Current Proteomics6262274
36. Chou K. Year: 1980A new schematic method in enzyme kinetics.European Journal of Biochemistry1131951987460947
37. Zhou GP,Deng MH. Year: 1984An extension of Chou's graphical rules for deriving enzyme kinetic equations to system involving parallel reaction pathways.Biochemical Journal2221691766477507
38. Chou KC. Year: 1989Graphic rules in steady and non-steady enzyme kinetics.Journal of Biological Chemistry26412074120792745429
39. Chou K. Year: 1990Review: Applications of graph theory to enzyme kinetics and protein folding kinetics: Steady and non-steady-state systems.Biophysical chemistry351242183882
40. Andraos J. Year: 2008Kinetic plasticity and the determination of product ratios for kinetic schemes leading to multiple products without rate laws: new methods based on directed graphs.Canadian Journal of Chemistry86342357
41. Chou K. Year: 2010Graphic rule for drug metabolism systems.Current Drug Metabolism1136937820446902
42. Althaus I,Chou J,Gonzales A,Deibel M,Chou K,et al. Year: 1993Steady-state kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-87201E.Journal of Biological Chemistry268611961247681060
43. Althaus I,Gonzales A,Chou J,Romero D,Deibel M,et al. Year: 1993The quinoline U-78036 is a potent inhibitor of HIV-1 reverse transcriptase.Journal of Biological Chemistry26814875148807686907
44. Althaus I,Chou J,Gonzales A,Deibel M,Chou K,et al. Year: 1993Kinetic studies with the non-nucleoside HIV-1 reverse transcriptase inhibitor U-88204E.Biochemistry32654865547687145
45. Chen C,Chen L,Zou X,Cai P. Year: 2009Prediction of Protein Secondary Structure Content by Using the Concept of Chous Pseudo Amino Acid Composition and Support Vector Machine.Protein and Peptide Letters16273119149669
46. Chou KC,Zhang CT,Maggiora GM. Year: 1997Disposition of amphiphilic helices in heteropolar environments.PROTEINS: Structure, Function, and Genetics2899108
47. Zhou GP. Year: 2011The disposition of the LZCC protein residues in wenxiang diagram provides new insights into the protein-protein interaction mechanism.Journal of Theoretical Biology28414214821718705
48. Wu ZC,Xiao X,Chou KC. Year: 20102D-MH: A web-server for generating graphic representation of protein sequences based on the physicochemical properties of their constituent amino acids.J Theor Biol267293420696175
49. Chakrabarti D. Year: 2005Tools for large graph miningCarnegie Mellon University
50. Barabasi A,Oltvai Z. Year: 2004Network biology: understanding the cell's functional organization.Nature Reviews Genetics5101113
51. Stelzl U,Worm U,Lalowski M,Haenig C,Brembeck F,et al. Year: 2005A human protein-protein interaction network: a resource for annotating the proteome.Cell12295796816169070
52. Chen L,Lu L,Feng KR,Li WJ,Song J,et al. Year: 2009Multiple Classifier Integration for the Prediction of Protein Structural Classes.Journal of Computational Chemistry302248225419274708
53. Qi Y,Balem F,Faloutsos C,Klein-Seetharaman J,Bar-Joseph Z. Year: 2008Protein complex identification by supervised graph local clustering.Bioinformatics24i250i25818586722
54. Camon E,Magrane M,Barrell D,Binns D,Fleischmann W,et al. Year: 2003The gene ontology annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro.Genome Research1366267212654719
55. Chou K,Shen H. Year: 2007Recent progress in protein subcellular location prediction.Analytical Biochemistry37011617698024
56. Chou KC,Shen HB. Year: 2008Cell-PLoc: A package of Web servers for predicting subcellular localization of proteins in various organisms (updated version: Cell-PLoc 2.0: An improved package of web-servers for predicting subcellular localization of proteins in various organisms, Natural Science, 2010, 2, 1090–1103).Nature Protocols3153162
57. Chou KC,Wu ZC,Xiao X. Year: 2011iLoc-Euk: A Multi-Label Classifier for Predicting the Subcellular Localization of Singleplex and Multiplex Eukaryotic Proteins.PLoS One6e1825821483473
58. Chou K,Cai Y. Year: 2006Predicting Protein-Protein interactions from sequences in a hybridization space.J Proteome Res531632216457597
59. Chen L,Shi XH,Kong XY,Zeng ZB,Cai YD. Year: 2009Identifying Protein Complexes Using Hybrid Properties.Journal of Proteome Research85212521819764809
60. Chen L,Feng KY,Cai YD,Chou KC,Li HP. Year: 2010Predicting the network of substrate-enzyme-product triads by combining compound similarity and functional domain composition.BMC bioinformatics1129320513238
61. Huang T,Shi XH,Wang P,He Z,Feng KY,et al. Year: 2010Analysis and prediction of the metabolic stability of proteins based on their sequential features, subcellular locations and interaction networks.PLoS ONE5e1097220532046
62. Huang T,Wan S,Xu Z,Zheng Y,Feng KY,et al. Year: 2011Analysis and prediction of translation rate based on sequence and functional features of the mRNA.PLoS ONE6e1603621253596
63. Dubchak I,Muchnik I,Holbrook S,Kim S. Year: 1995Prediction of protein folding class using global description of amino acid sequence.Proceedings of the National Academy of Sciences of the United States of America92870087047568000
64. Dubchak I,Muchnik I,Mayor C,Dralyuk I,Kim S. Year: 1999Recognition of a protein fold in the context of the SCOP classification.Proteins: Structure, Function, and Bioinformatics35401407
65. Frishman D,Argos P. Year: 1997Seventy-five percent accuracy in protein secondary structure prediction.Proteins: Structure, Function, and Bioinformatics27329335
66. Cheng J,Randall A,Sweredoski M,Baldi P. Year: 2005SCRATCH: a protein structure and structural feature prediction server.Nucleic acids research33W72W7615980571
67. Pollastri G,Baldi P,Fariselli P,Casadio R. Year: 2002Prediction of coordination number and relative solvent accessibility in proteins.Proteins: Structure, Function, and Bioinformatics47142153
68. Chou KC. Year: 1995A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space.Proteins: Structure, Function & Genetics21319344
69. Carmona-Saez P,Chagoyen M,Tirado F,Carazo JM,Pascual-Montano A. Year: 2007GENECODIS: a web-based tool for finding significant concurrent annotations in gene lists.Genome Biol8R317204154
70. Huang T,Wang P,Ye ZQ,Xu H,He Z,et al. Year: 2010Prediction of Deleterious Non-Synonymous SNPs Based on Protein Interaction Network and Hybrid Properties.PLoS ONE5e1190020689580
71. Huang T,Xu Z,Chen L,Cai YD,Kong X. Year: 2011Computational Analysis of HIV-1 Resistance Based on Gene Expression Profiles and the Virus-Host Interaction Network.PLoS ONE6e1729121394196
72. He Z,Zhang J,Shi X,Hu L,Kong X,et al. Year: 2010Predicting Drug-Target Interaction Networks Based on Functional Groups and Biological Features.PLoS ONE5e960320300175
73. Cai Y,Lu L. Year: 2008Predicting n-terminal acetylation based on feature selection method.Biochemical and biophysical research communications37286286518533108
74. Cai Y,Lu L,Chen L,He J. Year: 2010Predicting subcellular location of proteins using integrated-algorithm method.Molecular Diversity1455155819662505
75. Lu L,Niu B,Zhao J,Liu L,Lu W,et al. Year: 2009GalNAc-transferase specificity prediction based on feature selection method.Peptides3035936418955094
76. Lu L,Shi X,Li S,Xie Z,Feng Y,et al. Year: 2010Protein sumoylation sites prediction based on two-stage feature selection.Molecular Diversity14818619472067
77. Huang T,Cui W,Hu L,Feng K,Li YX,et al. Year: 2009Prediction of pharmacological and xenobiotic responses to drugs based on time course gene expression profiles.PLoS ONE4e812619956587
78. Witten IH,Frank E. Year: 2005Data Mining: Practical machine learning tools and techniques.Morgan Kaufmann Pub
79. Chen L,Qian ZL,Fen KY,Cai YD. Year: 2010Prediction of Interactiveness Between Small Molecules and Enzymes by Combining Gene Ontology and Compound Similarity.Journal of Computational Chemistry311766177620033913
80. Cai Y,Chou K. Year: 2003Nearest neighbour algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition.Biochemical and biophysical research communications30540741112745090
81. Cooper GF,Herskovits E. Year: 1992A Bayesian method for the induction of probabilistic networks from data.Machine learning9309347
82. Buntine W. Year: 1996A guide to the literature on learning probabilistic networks from data.IEEE Transactions on Knowledge and Data Engineering8195210
83. Cheng J,Greiner R. Comparing Bayesian network classifiers; 1999.101107 Proceedings UAI.
84. Friedman N,Geiger D,Goldszmidt M. Year: 1997Bayesian network classifiers.Machine learning29131163
85. Chou KC,Zhang CT. Year: 1995Review: Prediction of protein structural classes.Critical Reviews in Biochemistry and Molecular Biology302753497587280
86. Lin H. Year: 2008The modified Mahalanobis discriminant for predicting outer membrane proteins by using Chou's pseudo amino acid composition.Journal of Theoretical Biology25235035618355838
87. Xiao X,Wu ZC,Chou KC. Year: 2011A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites.PLoS One6e2059221698097
88. Zhang GY,Fang BS. Year: 2008Predicting the cofactors of oxidoreductases based on amino acid composition distribution and Chou's amphiphilic pseudo amino acid composition.Journal of Theoretical Biology25331031518471832
89. Zhou XB,Chen C,Li ZC,Zou XY. Year: 2007Using Chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes.Journal of Theoretical Biology24854655117628605
90. Gao CF,Qiu ZX,Wu XJ,Tian FW,Zhang H,et al. Year: 2011A Novel Fuzzy Fisher Classifier for Signal Peptide Prediction.Protein Peptide Letters18831838
91. Chiti F,Dobson CM. Year: 2006Protein misfolding, functional amyloid, and human disease.Annu Rev Biochem7533336616756495
92. Lobanova YS,Scherbakov AM,Shatskaya VA,Krasil'nikov MA. Year: 2007Mechanism of estrogen-induced apoptosis in breast cancer cells: role of the NF-kappaB signaling pathway.Biochemistry (Mosc)7232032717447886
93. Chang M. Year: 2011Dual roles of estrogen metabolism in mammary carcinogenesis.BMB Rep4442343421777512
94. Chazal N,Gerlier D. Year: 2003Virus entry, assembly, budding, and membrane rafts.Microbiol Mol Biol Rev67226237, table of contents12794191

Figures

[Figure ID: pone-0025297-g001]
doi: 10.1371/journal.pone.0025297.g001.
Figure 1  The IFS curve.

The highest ACC value of IFS is 78.8% using 49 features and SMO model.



[Figure ID: pone-0025297-g002]
doi: 10.1371/journal.pone.0025297.g002.
Figure 2  Distribution of the optimized 49 features.

It is straightforward to see that 25 (25/49, 51.0%) features were from the biochemical and physicochemical property and 24 (24/49, 49.0%) features were from the functional property, while none of features in graph property was selected into the optimized feature set.



Tables
[TableWrap ID: pone-0025297-t001] doi: 10.1371/journal.pone.0025297.t001.
Table 1  The distribution of the 146 regulatory pathways.
Pathway class Number of pathway
Metabolism 73
Genetic Information Processing 2
Environmental Information Processing 15
Cellular Processes 9
Organismal Systems 19
Human Diseases 28
Total 146

[TableWrap ID: pone-0025297-t002] doi: 10.1371/journal.pone.0025297.t002.
Table 2  A breakdown of the 264 features for a pathway system by considering its biochemical and physicochemical properties.
Properties C T D Mean category Maximum category Pathway system
Hydrophobicity 3 3 15 21 21 42
Normalized van der Waals volume 3 3 15 21 21 42
Polarity 3 3 15 21 21 42
Polarizability 3 3 15 21 21 42
Secondary structure 3 3 15 21 21 42
Solvent accessibility 1 1 5 7 7 14
Amino acid composition 20 N/A N/A 20 20 40
Total 36 36 80 132 132 264

[TableWrap ID: pone-0025297-t003] doi: 10.1371/journal.pone.0025297.t003.
Table 3  A breakdown of the of 5570 features.
Categories Group name Number of features
Graph property Graph size and graph density 2
Degree statistics 8
Edge weight statistics 4
Topological change 7
Degree correlation 6
Clustering 6
Topological 12
Singular values 3
Local density change 40
Biochemical and physicochemical property Amino acid compositions 40
Hydrophobicity, normalized van der Waals volume, polarity and polarizability 168
Solvent accessibility 14
Secondary structure 42
Functional property Gene ontology enrichment score 5218
Total N/A 5570

[TableWrap ID: pone-0025297-t004] doi: 10.1371/journal.pone.0025297.t004.
Table 4  The distribution of the most relevant 55 features.
Category Number of features
Graph property 0
Biochemical and physicochemical property 32
Functional property 23
Total 55

[TableWrap ID: pone-0025297-t005] doi: 10.1371/journal.pone.0025297.t005.
Table 5  The 49 optimized features.
Order Featurename
1 secondary_structure_composition_P_max
2 solvent_accessibility_composition_H_mean
3 solvent_accessibility_distribution_H.0.75_max
4 GO:0043627 response to estrogen stimulus
5 GO:0045121 membrane raft
6 secondary_structure_distribution_H.0.25_max
7 AA_composition_S_mean
8 secondary_structure_distribution_N.0.25_max
9 VanDerWaal_composition_P_max
10 GO:0043330 response to exogenous dsRNA
11 VanDerWaal_distribution_H.0.75_max
12 AA_composition_T_max
13 AA_composition_D_max
14 secondary_structure_distribution_H.0.5_max
15 GO:0048519 negative regulation of biological process
16 GO:0002687 positive regulation of leukocyte migration
17 secondary_structure_composition_P_mean
18 polarity_composition_N_max
19 GO:0042088 T-helper 1 type immune response
20 polarity_transition_NH_max
21 AA_composition_S_max
22 GO:0042063 gliogenesis
23 polarizability_distribution_P.0.75_max
24 GO:0090068 positive regulation of cell cycle process
25 GO:0014829 vascular smooth muscle contraction
26 secondary_structure_distribution_H.0.75_max
27 AA_composition_Q_mean
28 GO:0030225 macrophage differentiation
29 GO:0046661 male sex differentiation
30 hydrophobicity_composition_N_max
31 solvent_accessibility_distribution_H.0.0_max
32 polarity_distribution_P.0.5_max
33 polarizability_distribution_H.0.75_max
34 GO:0031594 neuromuscular junction
35 GO:0031330 negative regulation of cellular catabolic process
36 AA_composition_P_max
37 GO:0042953 lipoprotein transport
38 GO:0048523 negative regulation of cellular process
39 GO:0030217 T cell differentiation
40 GO:0007517 muscle organ development
41 GO:0009913 epidermal cell differentiation
42 GO:0042177 negative regulation of protein catabolic process
43 GO:0048641 regulation of skeletal muscle tissue development
44 hydrophobicity_distribution_N.0.75_max
45 hydrophobicity_distribution_H.0.75_max
46 GO:0022408 negative regulation of cell-cell adhesion
47 GO:0048608 reproductive structure development
48 GO:0045638 negative regulation of myeloid cell differentiation
49 GO:0006897 endocytosis

[TableWrap ID: pone-0025297-t006] doi: 10.1371/journal.pone.0025297.t006.
Table 6  Hypergeometric test of overlap between KEGG pathway classes and GO terms in optimized features.
Metabolism Genetic Information Processing Environmental Information Processing Cellular Processes Organismal Systems Human Diseases
GO:0043627 response to estrogen stimulus 0.032588 1 5.15E-16 1.86E-08 0.004826 2.30E-19
GO:0045121 membrane raft 0.681728 0.018851 2.68E-13 7.52E-15 1.09E-22 8.64E-15
GO:0043330 response to exogenous dsRNA 1 1 0.106165 0.003522 0.000117 0.001727
GO:0048519 negative regulation of biological process 1 1 1.86E-59 8.01E-39 4.20E-12 1.90E-51
GO:0002687 positive regulation of leukocyte migration 1 1 2.11E-09 0.001789 0.013702 0.000707
GO:0042088 T-helper 1 type immune response 1 1 3.50E-06 0.471266 0.094723 0.001178
GO:0042063 gliogenesis 0.993714 1 5.20E-11 1.30E-05 0.019525 1.32E-13
GO:0090068 positive regulation of cell cycle process 0.911776 1 9.12E-08 3.49E-06 0.024096 3.29E-08
GO:0014829 vascular smooth muscle contraction 1 1 0.000189 0.049965 0.023416 0.002415
GO:0030225 macrophage differentiation 1 1 0.003204 0.022913 0.00372 0.001178
GO:0046661 male sex differentiation 0.664515 1 4.00E-10 0.036323 0.938207 3.85E-07
GO:0031594 neuromuscular junction 1 1 0.001106 4.49E-06 1.97E-05 0.00224
GO:0031330 negative regulation of cellular catabolic process 1 1 0.006858 0.527536 0.137844 0.00224
GO:0042953 lipoprotein transport 1 1 0.127363 0.312566 0.023416 0.031663
GO:0048523 negative regulation of cellular process 0.999997 1 1.89E-56 1.93E-38 1.57E-08 4.91E-50
GO:0030217 T cell differentiation 0.957773 1 1.26E-16 0.023685 0.000397 1.82E-10
GO:0007517 muscle organ development 0.998366 1 6.32E-12 6.49E-09 0.32379 2.38E-09
GO:0009913 epidermal cell differentiation 1 1 0.123185 0.55964 0.968491 0.395449
GO:0042177 negative regulation of protein catabolic process 1 1 0.019214 0.002942 0.021538 0.001178
GO:0048641 regulation of skeletal muscle tissue development 1 1 5.03E-05 0.001284 0.447341 2.50E-06
GO:0022408 negative regulation of cell-cell adhesion 1 1 0.015685 0.040951 0.017213 0.001727
GO:0048608 reproductive structure development 0.431739 1 2.90E-16 0.036125 0.271969 4.81E-12
GO:0045638 negative regulation of myeloid cell differentiation 1 1 0.032936 0.289118 0.009817 1.09E-06
GO:0006897 endocytosis 0.995474 1 0.000121 0.012134 0.09916 0.006247


Article Categories:
  • Research Article
Article Categories:
  • Biology
    • Biochemistry
      • Metabolism
        • Metabolic Pathways
    • Computational Biology
      • Genomics
        • Genome Analysis Tools
          • Genetic Networks
      • Metabolic Networks
      • Regulatory Networks
      • Signaling Networks
    • Genetics
      • Gene Networks
    • Genomics
      • Genome Analysis Tools
        • Genetic Networks
    • Molecular Cell Biology
      • Signal Transduction


Previous Document:  The orphan gene ybjN conveys pleiotropic effects on multicellular behavior and survival of Escherich...
Next Document:  Human biodistribution and dosimetry of ¹¹C-CUMI-101, an agonist radioligand for serotonin-1a recep...