Recovering genetic regulatory networks from chromatin immunoprecipitation and steadystate microarray data.  
Jump to Full Text  
MedLine Citation:

PMID: 18584039 Owner: NLM Status: PubMednotMEDLINE 
Abstract/OtherAbstract:

Recent advances in highthroughput DNA microarrays and chromatin immunoprecipitation (ChIP) assays have enabled the learning of the structure and functionality of genetic regulatory networks. In light of these heterogeneous data sets, this paper proposes a novel approach for reconstruction of genetic regulatory networks based on the posterior probabilities of gene regulations. Built within the framework of Bayesian statistics and computational Monte Carlo techniques, the proposed approach prevents the dichotomy of classifying gene interactions as either being connected or disconnected, thereby it reduces significantly the inference errors. Simulation results corroborate the superior performance of the proposed approach relative to the existing stateoftheart algorithms. A genetic regulatory network for Saccharomyces cerevisiae is inferred based on the published real data sets, and biological meaningful results are discussed. 
Authors:

Wentao Zhao; Erchin Serpedin; Edward R Dougherty 
Related Documents
:

18488999  What's left in asymmetry? 25343659  Productivity losses because of morbidity attributable to fetal alcohol spectrum disorde... 18712309  Integrating functional genomics data. 21131149  An empirical estimate of the precision of likelihood ratios from a forensicvoicecompa... 14630649  Mgraph: graphical models for microarray data analysis. 19002249  Evolving synaptic plasticity with an evolutionary cellular development model. 22617489  Computational modeling of epilepsy for an experimental neurologist. 14625609  Development of a riskadjusted capitation model based on principal inpatient diagnoses ... 21960719  Mz5: space and timeefficient storage of mass spectrometry data sets. 
Publication Detail:

Type: Journal Article 
Journal Detail:

Title: EURASIP journal on bioinformatics & systems biology Volume:  ISSN: 16874145 ISO Abbreviation: EURASIP J Bioinform Syst Biol Publication Date: 2008 
Date Detail:

Created Date: 20080627 Completed Date: 20100628 Revised Date: 20120426 
Medline Journal Info:

Nlm Unique ID: 101263720 Medline TA: EURASIP J Bioinform Syst Biol Country: United States 
Other Details:

Languages: eng Pagination: 248747 Citation Subset:  
Affiliation:

Electrical and Computer Engineering Department, Texas A&M University, College Station, TX 77843, USA. 
Export Citation:

APA/MLA Format Download EndNote Download BibTex 
MeSH Terms  
Descriptor/Qualifier:


Comments/Corrections 
Full Text  
Journal Information Journal ID (nlmta): EURASIP J Bioinform Syst Biol Journal ID (publisherid): BSB ISSN: 16874145 ISSN: 16874153 Publisher: Hindawi Publishing Corporation 
Article Information Download PDF Copyright ? 2008 Wentao Zhao et al. openaccess: This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Received Day: 28 Month: 11 Year: 2007 Accepted Day: 20 Month: 5 Year: 2008 Print publication date: Year: 2008 Electronic publication date: Day: 19 Month: 6 Year: 2008 Volume: 2008Elocation ID: 248747 ID: 2435223 DOI: 10.1155/2008/248747 PubMed Id: 18584039 
Recovering Genetic Regulatory Networks from Chromatin Immunoprecipitation and SteadyState Microarray Data  
Wentao Zhao^{1}  
Erchin Serpedin^{1}*  
Edward R. Dougherty^{2}  
^{1}Electrical and Computer Engineering Department, Texas A&M University, College Station, TX 77843, USA 

^{2}The Translational Genomics Research Institute (TGen), 400 North Fifth Street, Suite 1600, Phoenix, AZ 85004, USA 

Correspondence: *Erchin Serpedin: serpedin@ece.tamu.edu [other] Recommended by Z. Wang 
Currently, one of the most important research problems in molecular biology and bioinformatics consists of finding out the mechanisms that govern the gene regulations, which are considered to play fundamental roles in the operation of all processes taking place in living cells. Learning the structure and machinery of gene regulations opens up the possibility for understanding and controlling the functioning of organisms at the molecular level, and for designing intelligent therapies and drugs. In a biological process such as cell cycle or environmental response, a gene's product, the protein, can serve as a transcription factor of a target gene by binding to the target gene's regulatory region on chromatin and affect its transcription. The protein can also influence another gene's expression in subsequent stages, for example, through splicing or translation. Alternatively, these proteingene relationships can be viewed as genegene interactions, and are modeled in general as genetic regulatory networks.
Recent years have witnessed a number of different frameworks for modeling genetic regulatory networks, ranging from finescale modeling at the molecular level in terms of partial differential equations and stochastic equations, to large scale modeling at the gene and proteinlevel in terms of Boolean and probabilistic Boolean networks, and (dynamic) Bayesian networks; see, for example, [^{1}?^{6}] and their toolboxes [^{7}?^{9}]. The small scale modeling techniques are used to capture the detailed biochemical aspects of molecular interactions and are in general very computational demanding. On the other side, the largescale models provide a global vision of the interactions among the constituent elements of genetic regulatory networks and are generally represented in terms of graphs.
In the middle of 1990s, the birth of DNA microarrays equipped the industry with the capability to measure simultaneously the concentration of genomewide mRNA expressions. The gene expression data produced thereafter by gene chips have attracted extensive research on the inference of genetic regulatory networks based on various network models [^{10}?^{18}]. There are two types of DNA microarray data sets: time series (or time dependent) and time independent (also called steadystate or single point time series) data sets. In general, the timeindependent gene expression profiles are capable of recovering steadystate attractors, but fail to recover the direct and oriented (temporal regulating) relationships. On the other side, time series data sets can improve the inference greatly in contrast to timeindependent data sets [^{13}]. However, the financial costs, ethnical concerns, and implementation issues prevent collecting beneficial time series data. Recent statistics show that about 70% of published data are time independent [^{19}]. Therefore, the steadystate analysis is highly valuable despite the difficulty of making accurate inference of temporal relationships.
Inference of gene regulatory networks based solely on the information provided by microarray data is limited by a number of factors: number of available microarrays, quality of data samples, experimental noise, and errors (crosshybridizations). It is also known that posttranscriptional modifications and transcripts that are present at low levels are generally not detectable by microarrays. Since the gene activity is measured by the mRNA level, the underlying assumption is that there is a significant correlation between the mRNA level and the amount of protein associated with mRNA. However, the magnitude of such a correlation varies significantly depending on the type of protein involved. Therefore, a combined approach which, besides gene expression data, exploits additional data sources is likely to enhance the inference process.
The advent of in vivo chromatin immunoprecipitation (ChIP) assays has enabled to test whether a protein acting as a transcription factor binds to a specific DNA segment. Hence, ChIP assays serve as a promising mechanism to examine the regulatory relationships. In ChIP experiments, the protein is immobilized on the chromatin, then the chromatin is broken into DNA fragments, and the DNAprotein complexes are immunoprecipitated by using antibodies corresponding to the tested protein. Afterwards the DNA bound by the protein in question can be isolated and identified by using a cDNA microarray chip. The whole process is also called a ChIPchip experiment, and inherits several disadvantages. The protein to be tested has to possess a specific antibody, which might not be synthesized, discovered, or exist. In addition, the transcriptional regulation is a complex process that is expressed in several different aspects. The binding of the transcription factor to the promoter region of the target gene is the most pristine mode. Especially for eukaryotic organisms, some regulatory bindings take place at a region far away from the regulated gene. This fact makes the binding information questionable for determining the regulation relationships. Furthermore, the experimental results are represented by pvalues and the determination of the binding relationship is achieved through threshold comparison. However, the selection of the pvalue threshold introduces a dilemma. A high threshold not only identifies the most probable binding relationships but also might miss many true relationships with lower pvalues, while a low threshold infers more relationships, among which more might be false alarms. A good tradeoff is not easy to make. Besides, the cost factor has also to be considered. Generally, ChIPchip experiments are very expensive and testing thousands of proteins is not affordable.
A combination of both steadystate microarray data and ChIPchip data might help in making more accurate inferences. Intuitively, these two different types of data complement the shortcomings of each other. This motivates us to propose a Bayesian approach to analyze jointly both data sets and to establish a confidence measure of gene interactions. The proposed scheme possesses six key features which make it different from the existing algorithms. First, gene expression data in steadystate are considered, while time course data are used in other works like [^{11}, ^{13}, ^{20}]. Second, most of the current schemes recover a unique genetic network represented by a graph which best fits the observed data in a certain metric, while the proposed approach determines the posterior probabilities for all genepair interactions and avoids to make a dichotomous decision that classifies each gene interaction as being either connected or disconnected. The proposed approach can be easily transformed into a dichotomous scheme by only preserving the highly probable gene interactions. Third, the underlying structural model is assumed to be a directed cyclic graph, which allows cycles (feedback loops) and directed acyclic graphs are treated as special cases. This contrasts to Bayesian networks, which are directed acyclic graphs. Feedback loops are a common network motif in biological processes and their function is to yield the necessary redundancy and stability for the system [^{1}]. Therefore, methods based on Bayesian networks, for example, [^{21}?^{23}], lose their validity in the inference of cyclic graphs. Fourth, the proposed approach assumes continuousvalued variables, and this prevents the information loss incurred by data quantization. This represents an advantage compared with the discretevalued networks such as [^{21}?^{23}]. Fifth, the proposed connectivity score is oriented and has a very clear meaning, in the sense of posterior probabilities, while the existing scores based on the mutual information [^{14}, ^{18}, ^{24}] are vague and lack orientation information. Sixth, in the proposed approach the system kinetics are assumed to be nonlinear, while linear models are commonly utilized for the purpose of simplification [^{12}, ^{15}]. Besides, the proposed scheme establishes a general framework whose components can be customized to fit the nature of the underlying biological system.
The rest of the paper is organized as follows. Section 2 discusses the graphical model and system dynamics that govern the genetic expressions. Section 3 translates the pvalues of ChIPchip experiments into regulation probabilities and formulates the inference algorithm through Bayesian analysis. In Section 4, the proposed algorithm and other three schemes are simulated on a set of artificial networks. Performance comparisons illustrate that the proposed algorithm exceeds in terms of several metrics. The robustness of kinetics model is also discussed via simulations. Realistic data sets are exploited in the proposed inference framework and a genetic network is presented to account for the genetic response to environmental changes. Finally, Section 5 concludes the paper with remarks on possible future works.
Genetic regulatory networks can be represented by a parameterized graph (G, ?), where G and ? stand for the graph structure and parameter set, respectively. The graph structure qualitatively explains the direct gene interactions, while the parameter set quantitatively describes the system kinetics.
The graph G(V, E) is employed to map gene interactions at transcriptional level, where V denotes the set of vertices (genes) and E stands for the set of edges (regulation relationships). If gene X regulates gene Y, graphically such a relation is represented in terms of an oriented edge X? Y, where X is a parent of Y and Y is considered a child of X. All genes that present incidence edges with gene X represent the set of parental genes of X, and are compactly denoted in terms of the notation ?_{X}. If two genes X and Y interact with each other but the regulation orientation cannot be determined, an undirected edge is laid between the two genes as XY, which means both orientations are possible. A sequence of consecutiveoriented edges constitutes a directed path. If there is no directed path which starts and ends at the same vertex, in other words, the graph contains no loops, the graph is called a directed acyclic graph (DAG). DAGs lie at the basis of Bayesian networks, which are commonly employed to model causal relationships [^{25}].
General directed graphs (with possibly cycles) will serve as our structural model since they are consistent with the features exhibited by biological systems, in which loops account for system redundancy and stability. Given the graph structure G, the parent set ? is specified for any gene X. For conciseness, the subscript X associated with some variables is omitted in the analysis procedure when the context has clearly specified the gene in question. Next, we discuss the system kinetics and parameters defined in ?.
The system kinetics represents the dynamics that governs the gene mRNA concentrations in terms of genegene interactions. It can be modeled by a set of differential equations (DEs). A simplified form is a set of linear DEs. However, we accept the more complex model which was employed previously by [^{16}, ^{17}] since it is much more realistic and accounts for the expression saturation. Given a gene X, its parent set ? can be further partitioned into two disjoint subsets: the activator set A and the repressor set R, that is, ? = A ? R and A?R=?. The kinetics of gene X can be explained by the following differential equation:
(1)
dxdt=??x+?+?i=1Aai?i1+?i=1Aai?i+?j=1Rrj?j, 
As the response to environmental changes or incitations, a mature biological system always converges to a certain steadystate, in which all genes stay in equilibrium and do not change their expressions. In this context, the periodic processes, for example, cell cycle and circadian rhythm, are excluded from our research interest. By setting dx/dt = 0 and ? = 1, the observation y of the steadystate gene expression for gene X can be expressed as
(2)
y?=??+?i=1Aai?i1+?i=1Aai?i+?j=1Rrj?j?+??. 
Given a parent structure ? for gene X, the parameters in ? can be summarized as follows.
 For each parent ? ? ?, a binary variable is demanded to specify whether the parent is an activator or repressor, that is, 1_{A}(?), where 1 is the indicator function and it assumes the value 1 when ? ? A, and 0 otherwise. It can be modeled by a Bernoulli random variable with known success probability ?.
 For each activator a ? A and repressor r ? R, it is assumed that the regulating factors ?, ? ? Gamma(?, ?), where ?, ? are known.
 The baseline parameter ? is usually known.
 The noise ? ? N(0, ?^{2}), where ?^{2} can be set to a specific value or estimated.
It is worth to note that the choice of nonlinear differential equation and parameter priors does not influence the flow of analysis. Our scheme stands for a general framework and the detailed parameters can be easily customized to fit different scenarios. There are various mathematical models for system kinetics, such as [^{27}?^{29}]. The kinetics in 1 is chosen as our dynamic model because it possess the property of saturation, a key idea of MichaelisMenten kinetics [^{29}]. Besides, it is fairly simple and it also takes account of most other biological properties. Therefore, in the simulation of the real data set, we are assuming the proposed kinetics is true.
Consider a system composed of n genes indexed by {1, 2,?, n}. ChIPchip experiments can be conducted to examine whether gene i's corresponding protein binds gene j's regulatory region. Usually this regulatory sequence is a promoter region which is located within 600 base pairs upstream of the coding region of gene j. The experimental results are represented in terms of pvalues. In the first step, it is necessary to translate the pvalue p into the probability of existence of a regulation relationship from gene i to gene j, which is denoted as 𝒫(i?j ? p). This probability will act as the prior knowledge to integrate gene expression data.
The pvalue is within the range of [0,1]. After studying the properties of the microarray data, Allison proposed to exploit mixed Beta distribution to model the pvalue [^{30}]. If the transcription factor i regulates gene j, it is assumed that the ChIPchip experiment produces a pvalue p which conforms to a Beta distribution with parameters (?, ?),
(3)
f(p???i?j)?=?p??1(1?p)??1B(?,?), 
(4)
f(?p???i?j)=p??1(1?p)??1B(?,?). 
(5)
𝒫(i?j???p)=?B(?,??)p??1(1?p)??1?B(?,??)p??1(1?p)??1+(1??)B(?,??)p??1(1?p)??1. 
(6)
𝒫(i?j???p)=?p??1(1?p)??1?p??1(1?p)??1?+?(1??)B(?,?). 
The determination of ? and ? depends on the experimental knowledge of the accuracy of selecting pvalue thresholds. In the first step, a pvalue threshold p_{t} is imposed, then the validity of all bindings with pvalues less than p_{t} is corroborated by biological experiments. In this way, we can gain knowledge of the probability 𝒫(i?j ? p < p_{t}), which can be written in the form of
(7)
𝒫(i?j???p?<?pt)=?𝒫(p?<?pt???i?j)?𝒫(p?<?pt???i?j)?+?(1????)𝒫(p?<?pt???i?j)=??0ptp??1(1?p)??1dp??0ptp??1(1?p)??1dp?+?pt(1??)B(?,??). 
Some works in the literature, for example, [^{33}], have made the observation that at a pvalue threshold of 0.001, the frequency of false positives is 6%?10%, that is, 𝒫(i ? j ? p < p_{t}) ? [6%, 10%]. Taking into account these special points, we can determine the pair (?, ?) in a small range. In our case, ??0.1 and ? ? 100. Finally, a table can be set up to map the pvalue into the edge existence probability, which can be computed only once. It is an overhead for the computational system but it does not assume much computational resource in the runtime.
Assume that m observations of expression vector are obtained and stored in matrix D^{n?m}. Next, we develop a computational approach to establish the posterior probability of the regulation i?j, that is, the probability of the existence of the edge i?j, which is represented by 𝒫(i?j ? D, p). This posterior can be obtained through integration over the whole parental gene set and parameter space for gene j:
(8)
𝒫(i?j???D,p)=??j??jf(i?j,??j,??j???D,?p)d?j=??j??j1?j(i)f(?j,??j???D,?p)d?j, 
(9)
f(?j,??j???D,?p)=f(D????j,??j,?p)f(?j,??j???p)f(D???p)=f(D????j,??j)f(?j,??j???p)f(D)=f(D????j,??j)f(?j,??j???p)??j??jf(D????j,??j)f(?j,??j???p)d?j=f(Dj???Dj?,??j,?j)f(?j,??j???p)??j??jf(Dj???Dj?,??j,??j)f(?j,??j???p)d?j, 
(10)
𝒫(i?j???D,?p)=??j??j1?j(i)f(Dj???Dj?,??j,??j)f(?j,??j???p)d?j??j??jf(Dj???Dj?,??j,??j)f(?j,??j???p)d?j. 
(11)
𝒫(i?j???D,p)???j,??j1?j(i)f(Dj???Dj?,??j,?j)??j,??jf(Dj???Dj?,??j,?j). 
Assuming that the selection of a parent as an activator is performed in an independent manner, and that the selection of the regulation factor value is also performed independently, the model probability density f(?, ? ? p) can be further expanded by using the chain rule
(12)
f(?,????p)=f(?????)𝒫(????p)=?i=1A[?f(?i)]?j=1R[(1??)f(?j)]𝒫(????p). 
Equation (12) conveys the idea that the random samples of graphical models can be sequentially created and processed. First the network structure is created based on the binding probability of gene regulation obtained in the ChIPchip experiment, then each parent is randomly assigned to represent an activator or repressor, and finally regulation factors are generated.
Our computational procedure can be briefly formulated in terms of Algorithm 1, where the Matlab coding convention is used to write the pseudocode. There exist n genes in the system. An n ? n matrix is created to represent the pvalues produced in the ChIPchip experiment. We collect m steadystate gene expression samples. The output entry C_{ij} stands for 𝒫(i?j ? D, p), and M denotes the number of MonteCarlo iterations. Lines 1 and 2 deal with the ChIPchip experimental data and translate pvalues into the binding probabilities by using (5). The results are stored in matrix B. Lines 3 and 4 perform the preprocessing of the gene expression data. Let Y_{(1)}, Y_{(2)}, Y_{(3)},?, Y_{(m?2)}, Y_{(m?1)}, Y_{(m)} be the values of a specific gene expression in ascending order. The smallest two values, Y_{(1)}, Y_{(2)}, and the largest two values, Y_{(m?1)}, Y_{(m)}, are treated as outliers and discarded. The dynamic range is defined as R = Y_{(m?2)} ? Y_{(3)}. The gene expressions are normalized as follows: the smallest two samples are assigned the null value and the largest two samples are assigned the unit value; the intermediary samples Y_{(i)} are normalized as (Y_{(i)} ? Y_{(3)})/R; if there is a missing sample, it is recovered through interpolation by gene's mean expression. Lines 12 through 16 implement the numerator of (11), and Line 17 computes the denominator of (11).
The algorithm can be easily reorganized into a parallel form so that we can exploit efficiently the distributed computational resources. The entries of output matrix C represent the posterior probabilities of regulation relationships between any two genes. It is directional (asymmetrical), and it possesses a clear probabilistic meaning compared with other vague connectivity metrics, for example, mutual information. It grants the biologists the flexibility first to examine the most significant interactions, then to proceed with less evidenced edges. Therefore, it is advantageous relative to a purely dichotomous scheme, in which genes are treated as being either connected or disconnected. A probability threshold can be imposed to change the algorithm into a dichotomous classifier. Since the posterior probability has a universal meaning, this threshold can be easily selected, usually within the range of [0.3?0.9]. A tradeoff has also to be made for different performance metrics.
The simulation consists of two parts. In the first part, artificial networks are created and the performance of the proposed algorithm is compared with other representative algorithms available in the literature, namely the relevance network (RN) method [^{14}], ChowLiu algorithm [^{24}], and ARACNE [^{18}]. In the second part, the algorithm is tested on the real Saccharomyces cerevisiae (budding yeast) data set and a biologically meaningful genetic network is inferred for the genetic response to environmental changes.
The proposed algorithm is compared with other three algorithms to evaluate its capability of recovering genetic networks based on gene expression data alone. The relevance network (RN) model [^{14}] represents a robust inference method based on gene expression profiles. In the first step, it computes the mutual information between any two genes X and Y, denoted as I(X; Y). Then it suggests two genes X and Y to be relevant if their mutual information assumes a larger value than a prespecified threshold and it lays down an undirected edge as XY. Hence, RN measures the significance of gene interactions in terms of mutual information between the gene expressions and produces an undirected cyclic graph. ChowLiu algorithm [^{24}] approaches the inference problem by finding the maximum spanning tree in which the edge weights stand for the mutual information. However, it loses validity if the underlying model is a cyclic graph. In addition, when the graph is densely connected, this scheme might falsely miss too many edges. ARACNE algorithm [^{18}] exploits the data processing inequality (DPI). It starts with a fully connected graph and a predefined mutual information threshold. Whenever the mutual information between two genes X and Y, that is, I(X; Y), is less than a threshold, it disconnects the two genes. Next, in the preliminary graph if there exists Z so that I(X; Y) < min(I(X; Z),I(Y; Z)), then it disconnects X and Y. In our simulations, we resort to an already available but efficient Matlab toolbox [^{34}] to estimate the mutual information.
Before making performance comparisons, we define inference errors and performance metrics. Because RN, ChowLiu, and ARACNE algorithms all construct undirected graphs, we have to disregard the orientation information inferred by the proposed algorithm. The synthetic and inferred graphs are represented by G(V, E) and G^(V,E^), respectively. The two graphs share the same set of vertices but differ in the set of edges.
There are two types of inference errors. The type1 errors are false positives (FP) and are also called false alarms. If the inference algorithm determines an interaction for two vertices X and Y in the inferred graph, denoted as X?Y?E^, but there is no such edge in the synthetic graph, that is, XY ? E, then an FP is produced. The number of FPs, represented by N_{FP}, can be counted as follows:
(13)
NFP=??X,Y((X?Y?E^)?(X?Y?E)), 
(14)
NFN=??X,Y((X?Y?E)?(X?Y?E^)). 
Correct inference can also be divided into two categories. If X?Y?E^ and XY ? E, the correctness is defined as a true positive (TP). Its summation, annotated by N_{TP}, is
(15)
NTP=??X,Y((X?Y?E^)?(X?Y?E)). 
(16)
NTN=??X,Y((X?Y?E^)?(X?Y?E)). 
Different performance metrics are proposed in the literature. Three most popular of them are considered here. The first metric, referred to as the Hamming distance, is the summation of all the inference errors and is given by
(17)
Hamming??distance?=?NFP?+?NFN. 
The second metric is called the sensitivity, and is defined as
(18)
Sensitivity?=?NTPNTP+NFN. 
(19)
Specificity?=?NTNNTN+NFP. 
A set of artificial networks are created based on the system dynamic equation (1). Each network has 30 vertices and 60 oriented edges. Such a network scale is selected for the consideration of the computational resources and the biological network that we are going to infer. The steadystate data are sampled by emulating the gene knockout experiment. A gene's expression is mandatorily forced to 0 while all other genes are free to change their expressions. The initial values of the system are randomly generated. When the system converges to the equilibrium, a Gaussian noise N(0, 0.03) is added and a few samples are obtained. All genes are shut down one by one. An extra in silico experiment is performed and no genes are shut down. These samples correspond to the wild type strain.
Different numbers of steadystate samples were generated based on the adopted system kinetics. The transcription factor is assumed to be an activator or repressor with equal probability, that is, ? = .5. The baseline parameter ? = 0.5 and the gamma parameters of regulation factors are (? = 16, ? = 0.0625) so that the regulation factor has a unit mean. ChowLiu algorithm creates a spanning tree; therefore, it preserves only 29 edges, while the original synthetic network possesses 30 vertices and 60 edges. In order to make comparisons, we tune the parameters for the other three schemes so that the number of inferred edges is around 30. For the RN method, we keep the 30 edges with the highest mutual information. For ARACNE, the mutual information threshold is adjusted. In our proposed algorithm, the posterior probability thresholds are changed in the range of [0.3, 0.9] so that approximately 30 edges are obtained. It has to be noted that RN, ARACNE, and ChowLiu algorithms only preserve interactions but disregard the interaction orientation. Therefore, in order to make consistent comparisons, we have to sacrifice the orientation information offered by the proposed algorithm. Besides, these three schemes have no capability of processing ChIPchip data. Therefore, we have to configure the proposed algorithm such that any two nodes are associated with a small prior probability of connection (0.1). This reflects the fact that the connection between two arbitrary nodes in the graph is very unlikely, but not impossible. This also exemplifies how the algorithm works in the absence of the ChIPchip data.
Figure 1(a) compares the performance in terms of Hamming distance for the four schemes assuming different sample sizes. The proposed method provides much better inference accuracy because it achieves the lowest Hamming distance. Larger sample size rewards a better inference precision. ChowLiu's algorithm and ARACNE do not perform well. This can be attributed to the assumption of the network. Our synthetic networks actually are cyclic networks in order to reflect the real world scenario. However, cycles in the network ruin the inference precisions of ChowLiu and ARACNE. Figure 1(c) illustrates the impact of sample size on the sensitivity. The proposed scheme outperforms the other three schemes. The sensitivities of all algorithms are less than 0.5. This is mainly due to the constraint that we pose on the number of inferred edges, that is, 30 edges. If we relax the posterior probability threshold, the sensitivity will be improved by sacrificing the specificity. Figure 1(e) depicts specificity for all schemes. All of them have high specificities, which are all greater than 0.90. The proposed scheme still exceeds. This high specificity is mainly due to the stringent constraint posed on the number of inferred edges. When considering the orientation of the edges, we find that 90% true positives inferred by the proposed algorithm are actually oriented correctly. This represents a big advantage of the proposed algorithm compared with the other three schemes.
In the previous simulations, the proposed inference algorithm assumes the system dynamic as depicted by (1). Actually, for different biological processes, there exist various mathematical models which achieve tradeoffs between the sophistication of the underlying molecular reaction and the simplification of the formula description (see [^{27}, ^{29}] for model comparisons). Savageau [^{28}] proposed an alternative mathematical model to account for the gene control and various forms of coupling among elementary gene circuits. This model can be denoted as
(20)
dxdt?=??A?i=1Aai?i??R?j=1Rrj?j, 
Although the proposed inference framework can ?plug and play" with different models, it is still necessary to examine its robustness against the underlying model. We evaluate this model dependence by the following steps: configure the model as 13 and create a set of synthetic data, then apply the proposed algorithm based on the dynamic equation (1), finally determine the performance metrics for different algorithms and compare the results with those in the previous section.
The simulation results are plotted in Figures 1(b), 1(d), and 1(f). Each figure corresponds to a different performance metric. All algorithms exhibit different values for performance values. This shows that the inference is dependent on the particular data sets and their underlying model. Compared with other three schemes, the proposed algorithm still achieves good performance in terms of three metrics. However, the advantage of the proposed algorithm are not significant now. ARACNE, ChowLiu, and relevance method do not degenerate much. This attributes mainly to the nonparametric property of these three schemes. The persistent good performance of the proposed algorithm is due to the fact that both dynamic models have to convey the basic properties of the gene interaction kinetics, such as the activation and repression effects and the coupling of the circuitry.
Saccharomyces cerevisiae (yeast) has been extensively studied in the literature of molecular biology because it is a unicellular eukaryotic organism, which shares similar cell structure with plants and animals. Also, yeast presents a short life cycle, which makes the experiments to be easily conducted. Lee et al. [^{33}] performed the ChIPchip experiment, in which 141 transcription factors were tested for binding intergenetic regions corresponding to 6270 genes. The gene expression data were published by Mnaimneh et al. [^{35}], who created promoter shutoff strains for 2/3 of all essential genes. The data set contains 215 steadystate cDNA microarray samples. The model parameters are assumed the same as artificial networks.
The intracellular signalling pathway in response to environmental changes has been conserved through evolution. Therefore, a study of this biological subsystem on the Saccharomyces cerevisiae might help to decipher the cell survival mechanism of other organisms. We select 30 genes which are annotated to participate in the stress response process. The given ChIPchip experiment did not provide full prior knowledge between any two genes (nodes in the graph). We believe that, among these genes, there are some genes whose protein products may also serve as transcription factors. Therefore, if the binding between two genes was not tested in the ChIPchip experiment, a small probability value 0.1 is assigned as the prior knowledge. The proposed inference algorithm leads to the genetic network illustrated in Figure 2.
The inferred genetic regulatory network shows strong proneness toward a scalefree network instead of a random network. Some genes possess especially high degree of connectivity. Three hub genes CIN5, HSF1, MSN4 already connect with more than 60% of all selected genes. Each of them has a connectivity degree not less than 8 while on average each gene in the network is connected with no more than 4 genes. These hub genes constitute the backbone of the network and they are potential control targets. This scalefree property is advantageous in maintaining the system robustness because a failure in one subsystem will not be propagated to the whole body.
Multiple works, for example [^{36}], have identified MSN4 and MSN2 as two of the most important genes in the response to environmental changes. A recent work [^{37}] recognized the functionality of another crucial gene HSF1, which is a heat shock transcription factor and functions in a different domain than the one corresponding to MSN2/4. Our inferred network confers this experimental result by showing that HSF1 and MSN2/4 regulate different set of genes except a weak connectivity between HSF1 and MSN4. MSN2/4 are not conserved in humans, while HSF genes have been preserved for various organisms such as Drosophila melanogaster, chickens, and mammals. Therefore, a study of the HSF1 pathway opens up the possibility of understanding the mechanism that governs the survival of normal cells under austere conditions.
CIN5 (YAP4) and YAP6 are two genes that play key roles in controlling the resistance to drugs, for example, cisplatin [^{38}]. CAD1(YAP2), CIN5, YAP1, and YAP6 share a structure motif called basic leucine zipper (bZIP) and they are located closely in the network. However, they are not neighboring the other two bZIP genes: YAP5 and YAP7. It is hypothesized that although they have similar molecular structures, their biological functionalities are in distinct domains.
Several edges, discovered by imposing a stringent pvalue threshold 0.001 to the location data, were persevered in our inferred network. Actually, these connections constitute a small portion of the proposed network, and they are CIN5?MSN1, CIN5?YAP6, CIN5?ROX1, YAP1?YAP6, MAC1?CUP9, CUP9?YAP6, and HAL9?MSN4. Various evidences are found to corroborate the recovered interactions, which can not be obtained by employing a stringent pvalue for the location data. For example, YAP5 is recovered to directly regulate STE50. This regulation relationship has also been reported in the work of Horak [^{39}]. The relationship between MSN2 and SCH9 is studied in [^{40}] in the context of extending the life span.
It is worthwhile to note that gene expression data mainly provide statistical relationships among genes, while location data offer physical binding interactions at the molecular level. By combining the two data sources, we are aiming to refine the inferred network to be biologically more meaningful. However, it also runs at a risk of confusing statistical regulatory relationships with real binding interactions. When such a case occurs, the proposed algorithm is capable of constraining the interacting genes within the same biological process and common functional relationships. A related discussion about the meaning of inferred network can also be found in [^{41}].
A novel algorithm is proposed to recover the genetic regulatory networks in the light of knowledge of transcriptional kinetics, ChIPchip, and gene microarray data. The analysis is based on the Bayesian methodology and Monte Carlo techniques. The proposed scheme is useful to compensate the shortcomings of the utilization of only one data set alone. Our in silico experiments corroborate that the algorithm outperforms in specificity, sensitivity and Hamming distance relative to three stateoftheart schemes. A budding yeast genetic regulatory network is proposed to account for the stress response.
There are possible extensions to our current scheme. An analysis of the error estimation is desired for the Monte Carlo simulation in order to determine the appropriate number of iterations. Several other knowledge sources are to be integrated into the current framework. For example proteinprotein interactions are useful to identify cobinding regulations. Genome sequencing data have been utilized to find regulatory motifs. Protein structure knowledge can be exploited to categorize the proteins and find similar functionality. A crossspecies research is also highly desirable since similar regulation mechanisms are expected to be conserved. If a gene is conserved in both humans and mice, then the knowledge of the gene pathway in the mouse will be an excellent reference for the study of human genetic diseases. We expect a global distributed framework, in which each data source acts as a separate component and its absence does not interfere with the whole computational process.
This work was supported by the National Cancer Institute (CA90301) and the National Science Foundation (ECS0355227 and CCF0514644).
References
1.  Kauffman SA. Metabolic stability and epigenesist in randomly constructed genetic netsJournal of Theoretical Biology 1969;22(3):437–467. [pmid: 5803332] 
2.  Murphy K,Mia S. Modelling gene expression data using dynamic Bayesian networks1999Berkeley, Calif, USA: Computer Science Division, University of California; Tech. Rep. 
3.  Sebastiani P,Abad MM,Ramoni M. Bayesian networksThe Data Mining and Knowledge Discovery Handbook 2005New York, NY, USA: Springer; :193–230. 
4.  Shmulevich I,Dougherty ER,Kim S,Zhang W. Probabilistic Boolean networks: a rulebased uncertainty model for gene regulatory networksBioinformatics 2002;18(2):261–274. [pmid: 11847074] 
5.  Tabus I,Astola J. On the use of MDL principle in gene expression predictionJournal on Applied Signal Processing 2001;2001(4):297–303. 
6.  ?ktem H,Pearson R,YliHarja O,Nicorici D,Egiazarian K,Astola J,et al. A computational model for simulating continuous time Boolean networksIn: Proceedings of IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '02)October 2002Raleigh, NC, USA 
7.  Murphy K. Bayes Net Toolbox for Matlab http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html. 
8.  Leray P. Structure learning toolbox of Bayesian Networks http://bnt.insarouen.fr/ajouts.html. 
9.  Friedman N,Elidan G. Bayesian Network learning tool http://www.cs.huji.ac.il/labs/compbio/LibB/programs.html#LearnBayes. 
10.  Rao A,Hero AO,States DJ,Engel JD. Manifold embedding for understanding mechanisms of transcriptional regulationIn: Proceedings of IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '06)May 2006College Station, Tex, USA:3–4. 
11.  Liang S,Fuhrman S,Somogyi R. REVEAL, a general reverse engineering algorithm for inference of genetic network architecturesIn: Proceedings of the Pacific Symposium on Biocomputing (PSB '98)January 1998Maui, Hawaii, USA:18–29. 
12.  Luna IT,Yin Y,Huang Y,Padillo DPR,Perez MCC,Wang Y. Uncovering gene regulatory networks using variational Bayes variable selectionIn: Proceedings of IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '06)May 2006College Station, Tex, USA:111–112. 
13.  Zhao W,Serpedin E,Dougherty ER. Inferring gene regulatory networks from time series data using the minimum description length principleBioinformatics 2006;22(17):2129–2135. [pmid: 16845143] 
14.  Butte AJ,Kohane IS. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurementsIn: Proceedings of the Pacific Symposium on Biocomputing (PSB '00)January 2000Honolulu, Hawaii, USA:418–429. 
15.  Rogers S,Girolami M. A Bayesian regression approach to the inference of regulatory networks from gene expression dataBioinformatics 2005;21(14):3131–3137. [pmid: 15879452] 
16.  Rice JJ,Tu Y,Stolovitzky G. Reconstructing biological networks using conditional correlation analysisBioinformatics 2005;21(6):765–773. [pmid: 15486043] 
17.  Yeung MKS,Tegn?r J,Collins JJ. Reverse engineering gene networks using singular value decomposition and robust regressionProceedings of the National Academy of Sciences of the United States of America 2002;99(9):6163–6168. [pmid: 11983907] 
18.  Margolin AA,Nemenman I,Basso K,et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular contextBMC Bioinformatics 2006;7(supplement 1, S7):1–15. [pmid: 16393334] 
19.  Simon I,Siegfried Z,Ernst J,BarJoseph Z. Combined static and dynamic analysis for determining the quality of timeseries expression profilesNature Biotechnology 2005;23(12):1503–1508. [pmid: 16333294] 
20.  Bernard A,Hartemink AJ. Informative structure priors: joint learning of dynamic regulatory networks from multiple types of dataIn: Proceedings of the Pacific Symposium on Biocomputing (PSB '05)January 2005The Big Island of Hawaii, Hawaii, USA:459–470. 
21.  Hartemink AJ,Gifford DK,Jaakkola TS,Young RA. Combining location and expression data for principled discovery of genetic regulatory network modelsIn: Proceedings of the Pacific Symposium on Biocomputing (PSB '02)January 2002Lihue, Hawaii, USA:437–449. 
22.  Cooper GF,Herskovits E. A Bayesian method for the induction of probabilistic networks from dataMachine Learning 1992;9(4):309–347. 
23.  Heckerman D,Geiger D,Chickering DM. Learning Bayesian networks: the combination of knowledge and statistical dataMachine Learning 1995;20(3):197–243. 
24.  Chow C,Liu C. Approximating discrete probability distributions with dependence treesIEEE Transaction on Information Theory 1968;14(3):462–467. 
25.  Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988San Francisco, Calif, USA: Morgan Kaufmann; 
26.  Friedman N,Cai L,Xie XS. Linking stochastic dynamics to population distribution: an analytical framework of gene expressionPhysical Review Letters 2006;97(16):4 pages. Article ID 168302. 
27.  Wessels LF,van Someren EP,Reinders MJ. A comparison of genetic network modelsIn: Proceedings of the 6th Pacific Symposium on Biocomputing (PSB '01)January 2001The Big Island of Hawaii, Hawaii, USA:508–519. 
28.  Savageau MA. Rules for the evolution of gene circuitryIn: Proceedings of the 3rd Pacific Symposium on Biocomputing (PSB '98)January 1998Maui, Hawaii, USA:54–65. 
29.  EdelsteinKeshet L. Mathematical Models in Biology. 1988New York, NY, USA: Random House; 
30.  Allison DB,Gadbury GL,Heo M,et al. A mixture model approach for the analysis of microarray gene expression dataComputational Statistics & Data Analysis 2002;39(1):1–20. 
31.  Guelzim N,Bottani S,Bourgine P,K?p?s F. Topological and causal structure of the yeast transcriptional regulatory networkNature Genetics 2002;31(1):60–63. [pmid: 11967534] 
32.  Giot L,Bader JS,Brouwer C,et al. A protein interaction map of Drosophila melanogasterScience 2003;302(5651):1727–1736. [pmid: 14605208] 
33.  Lee TI,Rinaldi NJ,Robert F,et al. Transcriptional regulatory networks in Saccharomyces cerevisiaeScience 2002;298(5594):799–804. [pmid: 12399584] 
34.  Ihler A. Kernel density estimation software http://www.ics.uci.edu/~ihler/code/. 
35.  Mnaimneh S,Davierwala AP,Haynes J,et al. Exploration of essential gene functions via titratable promoter allelesCell 2004;118(1):31–44. [pmid: 15242642] 
36.  Gasch AP,Spellman PT,Kao CM,et al. Genomic expression programs in the response of yeast cells to environmental changesMolecular Biology of the Cell 2000;11(12):4241–4257. [pmid: 11102521] 
37.  Eastmond DL,Nelson HCM. Genomewide analysis reveals new roles for the activation domains of the Saccharomyces cerevisiae heat shock transcription factor (Hsf1) during the transient heat shock responseJournal of Biological Chemistry 2006;281(43):32909–32921. [pmid: 16926161] 
38.  Furuchi T,Ishikawa H,Miura N,et al. Two nuclear proteins, Cin5 and Ydr259c, confer resistance to cisplatin in Saccharomyces cerevisiaeMolecular Pharmacology 2001;59(3):470–474. [pmid: 11179441] 
39.  Horak CE,Luscombe NM,Qian J,et al. Complex transcriptional circuitry at the G1/S transition in Saccharomyces cerevisiaeGenes & Development 2002;16(23):3017–3033. [pmid: 12464632] 
40.  Fabrizio P,Pozza F,Pletcher SD,Gendron CM,Longo VD. Regulation of longevity and stress resistance by Sch9 in yeastScience 2001;292(5515):288–290. [pmid: 11292860] 
41.  Hartemink AJ. Reverse engineering gene regulatory networksNature Biotechnology 2005;23(5):554–555. [pmid: 15877071] 
Article Categories:

Previous Document: Microglia and Astrocyte Activation by TollLike Receptor Ligands: Modulation by PPARgamma Agonists.
Next Document: A Novel Design of 4Class BCI Using Two Binary Classifiers and Parallel Mental Tasks.