Document Detail

Recovering genetic regulatory networks from chromatin immunoprecipitation and steady-state microarray data.
Jump to Full Text
MedLine Citation:
PMID:  18584039     Owner:  NLM     Status:  PubMed-not-MEDLINE    
Abstract/OtherAbstract:
Recent advances in high-throughput DNA microarrays and chromatin immunoprecipitation (ChIP) assays have enabled the learning of the structure and functionality of genetic regulatory networks. In light of these heterogeneous data sets, this paper proposes a novel approach for reconstruction of genetic regulatory networks based on the posterior probabilities of gene regulations. Built within the framework of Bayesian statistics and computational Monte Carlo techniques, the proposed approach prevents the dichotomy of classifying gene interactions as either being connected or disconnected, thereby it reduces significantly the inference errors. Simulation results corroborate the superior performance of the proposed approach relative to the existing state-of-the-art algorithms. A genetic regulatory network for Saccharomyces cerevisiae is inferred based on the published real data sets, and biological meaningful results are discussed.
Authors:
Wentao Zhao; Erchin Serpedin; Edward R Dougherty
Related Documents :
18488999 - What's left in asymmetry?
25343659 - Productivity losses because of morbidity attributable to fetal alcohol spectrum disorde...
18712309 - Integrating functional genomics data.
21131149 - An empirical estimate of the precision of likelihood ratios from a forensic-voice-compa...
14630649 - Mgraph: graphical models for microarray data analysis.
19002249 - Evolving synaptic plasticity with an evolutionary cellular development model.
22617489 - Computational modeling of epilepsy for an experimental neurologist.
14625609 - Development of a risk-adjusted capitation model based on principal inpatient diagnoses ...
21960719 - Mz5: space- and time-efficient storage of mass spectrometry data sets.
Publication Detail:
Type:  Journal Article    
Journal Detail:
Title:  EURASIP journal on bioinformatics & systems biology     Volume:  -     ISSN:  1687-4145     ISO Abbreviation:  EURASIP J Bioinform Syst Biol     Publication Date:  2008  
Date Detail:
Created Date:  2008-06-27     Completed Date:  2010-06-28     Revised Date:  2012-04-26    
Medline Journal Info:
Nlm Unique ID:  101263720     Medline TA:  EURASIP J Bioinform Syst Biol     Country:  United States    
Other Details:
Languages:  eng     Pagination:  248747     Citation Subset:  -    
Affiliation:
Electrical and Computer Engineering Department, Texas A&M University, College Station, TX 77843, USA.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Descriptor/Qualifier:
Comments/Corrections

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Full Text
Journal Information
Journal ID (nlm-ta): EURASIP J Bioinform Syst Biol
Journal ID (publisher-id): BSB
ISSN: 1687-4145
ISSN: 1687-4153
Publisher: Hindawi Publishing Corporation
Article Information
Download PDF
Copyright ? 2008 Wentao Zhao et al.
open-access: This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Received Day: 28 Month: 11 Year: 2007
Accepted Day: 20 Month: 5 Year: 2008
Print publication date: Year: 2008
Electronic publication date: Day: 19 Month: 6 Year: 2008
Volume: 2008E-location ID: 248747
ID: 2435223
DOI: 10.1155/2008/248747
PubMed Id: 18584039

Recovering Genetic Regulatory Networks from Chromatin Immunoprecipitation and Steady-State Microarray Data
Wentao Zhao1
Erchin Serpedin1*
Edward R. Dougherty2
1Electrical and Computer Engineering Department, Texas A&M University, College Station, TX 77843, USA
2The Translational Genomics Research Institute (TGen), 400 North Fifth Street, Suite 1600, Phoenix, AZ 85004, USA
Correspondence: *Erchin Serpedin: serpedin@ece.tamu.edu
[other] Recommended by Z. Wang

1. Introduction

Currently, one of the most important research problems in molecular biology and bioinformatics consists of finding out the mechanisms that govern the gene regulations, which are considered to play fundamental roles in the operation of all processes taking place in living cells. Learning the structure and machinery of gene regulations opens up the possibility for understanding and controlling the functioning of organisms at the molecular level, and for designing intelligent therapies and drugs. In a biological process such as cell cycle or environmental response, a gene's product, the protein, can serve as a transcription factor of a target gene by binding to the target gene's regulatory region on chromatin and affect its transcription. The protein can also influence another gene's expression in subsequent stages, for example, through splicing or translation. Alternatively, these protein-gene relationships can be viewed as gene-gene interactions, and are modeled in general as genetic regulatory networks.

Recent years have witnessed a number of different frameworks for modeling genetic regulatory networks, ranging from fine-scale modeling at the molecular level in terms of partial differential equations and stochastic equations, to large scale modeling at the gene and protein-level in terms of Boolean and probabilistic Boolean networks, and (dynamic) Bayesian networks; see, for example, [1?6] and their toolboxes [7?9]. The small scale modeling techniques are used to capture the detailed biochemical aspects of molecular interactions and are in general very computational demanding. On the other side, the large-scale models provide a global vision of the interactions among the constituent elements of genetic regulatory networks and are generally represented in terms of graphs.

In the middle of 1990s, the birth of DNA microarrays equipped the industry with the capability to measure simultaneously the concentration of genome-wide mRNA expressions. The gene expression data produced thereafter by gene chips have attracted extensive research on the inference of genetic regulatory networks based on various network models [10?18]. There are two types of DNA microarray data sets: time series (or time dependent) and time independent (also called steady-state or single point time series) data sets. In general, the time-independent gene expression profiles are capable of recovering steady-state attractors, but fail to recover the direct and oriented (temporal regulating) relationships. On the other side, time series data sets can improve the inference greatly in contrast to time-independent data sets [13]. However, the financial costs, ethnical concerns, and implementation issues prevent collecting beneficial time series data. Recent statistics show that about 70% of published data are time independent [19]. Therefore, the steady-state analysis is highly valuable despite the difficulty of making accurate inference of temporal relationships.

Inference of gene regulatory networks based solely on the information provided by microarray data is limited by a number of factors: number of available microarrays, quality of data samples, experimental noise, and errors (cross-hybridizations). It is also known that post-transcriptional modifications and transcripts that are present at low levels are generally not detectable by microarrays. Since the gene activity is measured by the mRNA level, the underlying assumption is that there is a significant correlation between the mRNA level and the amount of protein associated with mRNA. However, the magnitude of such a correlation varies significantly depending on the type of protein involved. Therefore, a combined approach which, besides gene expression data, exploits additional data sources is likely to enhance the inference process.

The advent of in vivo chromatin immunoprecipitation (ChIP) assays has enabled to test whether a protein acting as a transcription factor binds to a specific DNA segment. Hence, ChIP assays serve as a promising mechanism to examine the regulatory relationships. In ChIP experiments, the protein is immobilized on the chromatin, then the chromatin is broken into DNA fragments, and the DNA-protein complexes are immunoprecipitated by using antibodies corresponding to the tested protein. Afterwards the DNA bound by the protein in question can be isolated and identified by using a cDNA microarray chip. The whole process is also called a ChIP-chip experiment, and inherits several disadvantages. The protein to be tested has to possess a specific antibody, which might not be synthesized, discovered, or exist. In addition, the transcriptional regulation is a complex process that is expressed in several different aspects. The binding of the transcription factor to the promoter region of the target gene is the most pristine mode. Especially for eukaryotic organisms, some regulatory bindings take place at a region far away from the regulated gene. This fact makes the binding information questionable for determining the regulation relationships. Furthermore, the experimental results are represented by p-values and the determination of the binding relationship is achieved through threshold comparison. However, the selection of the p-value threshold introduces a dilemma. A high threshold not only identifies the most probable binding relationships but also might miss many true relationships with lower p-values, while a low threshold infers more relationships, among which more might be false alarms. A good tradeoff is not easy to make. Besides, the cost factor has also to be considered. Generally, ChIP-chip experiments are very expensive and testing thousands of proteins is not affordable.

A combination of both steady-state microarray data and ChIP-chip data might help in making more accurate inferences. Intuitively, these two different types of data complement the shortcomings of each other. This motivates us to propose a Bayesian approach to analyze jointly both data sets and to establish a confidence measure of gene interactions. The proposed scheme possesses six key features which make it different from the existing algorithms. First, gene expression data in steady-state are considered, while time course data are used in other works like [11, 13, 20]. Second, most of the current schemes recover a unique genetic network represented by a graph which best fits the observed data in a certain metric, while the proposed approach determines the posterior probabilities for all gene-pair interactions and avoids to make a dichotomous decision that classifies each gene interaction as being either connected or disconnected. The proposed approach can be easily transformed into a dichotomous scheme by only preserving the highly probable gene interactions. Third, the underlying structural model is assumed to be a directed cyclic graph, which allows cycles (feedback loops) and directed acyclic graphs are treated as special cases. This contrasts to Bayesian networks, which are directed acyclic graphs. Feedback loops are a common network motif in biological processes and their function is to yield the necessary redundancy and stability for the system [1]. Therefore, methods based on Bayesian networks, for example, [21?23], lose their validity in the inference of cyclic graphs. Fourth, the proposed approach assumes continuous-valued variables, and this prevents the information loss incurred by data quantization. This represents an advantage compared with the discrete-valued networks such as [21?23]. Fifth, the proposed connectivity score is oriented and has a very clear meaning, in the sense of posterior probabilities, while the existing scores based on the mutual information [14, 18, 24] are vague and lack orientation information. Sixth, in the proposed approach the system kinetics are assumed to be nonlinear, while linear models are commonly utilized for the purpose of simplification [12, 15]. Besides, the proposed scheme establishes a general framework whose components can be customized to fit the nature of the underlying biological system.

The rest of the paper is organized as follows. Section 2 discusses the graphical model and system dynamics that govern the genetic expressions. Section 3 translates the p-values of ChIP-chip experiments into regulation probabilities and formulates the inference algorithm through Bayesian analysis. In Section 4, the proposed algorithm and other three schemes are simulated on a set of artificial networks. Performance comparisons illustrate that the proposed algorithm exceeds in terms of several metrics. The robustness of kinetics model is also discussed via simulations. Realistic data sets are exploited in the proposed inference framework and a genetic network is presented to account for the genetic response to environmental changes. Finally, Section 5 concludes the paper with remarks on possible future works.


2. Methods

Genetic regulatory networks can be represented by a parameterized graph (G, ?), where G and ? stand for the graph structure and parameter set, respectively. The graph structure qualitatively explains the direct gene interactions, while the parameter set quantitatively describes the system kinetics.

2.1. Structural Model

The graph G(V, E) is employed to map gene interactions at transcriptional level, where V denotes the set of vertices (genes) and E stands for the set of edges (regulation relationships). If gene X regulates gene Y, graphically such a relation is represented in terms of an oriented edge X? Y, where X is a parent of Y and Y is considered a child of X. All genes that present incidence edges with gene X represent the set of parental genes of X, and are compactly denoted in terms of the notation ?X. If two genes X and Y interact with each other but the regulation orientation cannot be determined, an undirected edge is laid between the two genes as X-Y, which means both orientations are possible. A sequence of consecutive-oriented edges constitutes a directed path. If there is no directed path which starts and ends at the same vertex, in other words, the graph contains no loops, the graph is called a directed acyclic graph (DAG). DAGs lie at the basis of Bayesian networks, which are commonly employed to model causal relationships [25].

General directed graphs (with possibly cycles) will serve as our structural model since they are consistent with the features exhibited by biological systems, in which loops account for system redundancy and stability. Given the graph structure G, the parent set ? is specified for any gene X. For conciseness, the subscript X associated with some variables is omitted in the analysis procedure when the context has clearly specified the gene in question. Next, we discuss the system kinetics and parameters defined in ?.

2.2. System Kinetics

The system kinetics represents the dynamics that governs the gene mRNA concentrations in terms of gene-gene interactions. It can be modeled by a set of differential equations (DEs). A simplified form is a set of linear DEs. However, we accept the more complex model which was employed previously by [16, 17] since it is much more realistic and accounts for the expression saturation. Given a gene X, its parent set ? can be further partitioned into two disjoint subsets: the activator set A and the repressor set R, that is, ? = A ? R and A?R=?. The kinetics of gene X can be explained by the following differential equation:

[Formula ID: eq1]
(1) 
dxdt=??x+?+?i=1|A|ai?i1+?i=1|A|ai?i+?j=1|R|rj?j,
where x is the concentration of gene X's transcriptional product, namely, mRNA. In this paper, to simplify the exposition, the gene name and its expression are used interchangeably. The changing rate of gene X is controlled by its activating and repressing parents, denoted individually by ai ? A and rj ? R. ? and ? serve as the regulating factors corresponding to each activator and repressor. ? and ? assume positive values, and hence can be modeled by a gamma distribution with shape and scale parameters (?, ?). Here we can unbiasedly assume that the activators and repressers share the same gamma distribution for their regulation factors. Other light-tail distributions, such as Weibull and lognormal distributions, could also be employed. However, since gamma distribution is popular in modeling the reaction rate or molecular concentration [26], the gamma distribution is chosen here. ? stands for the gene degradation rate and the time scale can be properly chosen in order to normalize ? to the unit value (? = 1). ? represents the expression baseline rate, that is, the expression rate for X when there is neither activator nor repressor regulating the target gene X. Suppose that y represents the observation of x, then y assumes the form y = x + ?, where ? incorporates all noise sources and is modeled by an additive Gaussian random variable with zero mean and variance ?2.

As the response to environmental changes or incitations, a mature biological system always converges to a certain steady-state, in which all genes stay in equilibrium and do not change their expressions. In this context, the periodic processes, for example, cell cycle and circadian rhythm, are excluded from our research interest. By setting dx/dt = 0 and ? = 1, the observation y of the steady-state gene expression for gene X can be expressed as

[Formula ID: eq2]
(2) 
y?=??+?i=1|A|ai?i1+?i=1|A|ai?i+?j=1|R|rj?j?+??.

Given a parent structure ? for gene X, the parameters in ? can be summarized as follows.

  1. For each parent ? ? ?, a binary variable is demanded to specify whether the parent is an activator or repressor, that is, 1A(?), where 1 is the indicator function and it assumes the value 1 when ? ? A, and 0 otherwise. It can be modeled by a Bernoulli random variable with known success probability ?.
  2. For each activator a ? A and repressor r ? R, it is assumed that the regulating factors ?, ? ? Gamma(?, ?), where ?, ? are known.
  3. The baseline parameter ? is usually known.
  4. The noise ? ? N(0, ?2), where ?2 can be set to a specific value or estimated.

It is worth to note that the choice of nonlinear differential equation and parameter priors does not influence the flow of analysis. Our scheme stands for a general framework and the detailed parameters can be easily customized to fit different scenarios. There are various mathematical models for system kinetics, such as [27?29]. The kinetics in 1 is chosen as our dynamic model because it possess the property of saturation, a key idea of Michaelis-Menten kinetics [29]. Besides, it is fairly simple and it also takes account of most other biological properties. Therefore, in the simulation of the real data set, we are assuming the proposed kinetics is true.


3. Inference Method

Consider a system composed of n genes indexed by {1, 2,?, n}. ChIP-chip experiments can be conducted to examine whether gene i's corresponding protein binds gene j's regulatory region. Usually this regulatory sequence is a promoter region which is located within 600 base pairs upstream of the coding region of gene j. The experimental results are represented in terms of p-values. In the first step, it is necessary to translate the p-value p into the probability of existence of a regulation relationship from gene i to gene j, which is denoted as 𝒫(i?j ? p). This probability will act as the prior knowledge to integrate gene expression data.

3.1. Incorporating ChIP-Chip Data

The p-value is within the range of [0,1]. After studying the properties of the microarray data, Allison proposed to exploit mixed Beta distribution to model the p-value [30]. If the transcription factor i regulates gene j, it is assumed that the ChIP-chip experiment produces a p-value p which conforms to a Beta distribution with parameters (?, ?),

[Formula ID: eq3]
(3) 
f(p???i?j)?=?p??1(1?p)??1B(?,?),
where f(?) stands for the probability density function and B(?, ?) represents the beta function. On the other hand, if i does not regulate j, the p-value assumes a different Beta distribution with parameters (?, ?):
[Formula ID: eq4]
(4) 
f(?p???i?j)=p??1(1?p)??1B(?,?).
Based on the knowledge provided by established and verified genetic networks, one can infer a prior knowledge about the probability of connectivity between arbitrary genes, denoted as ?(i?j), for all i, j. Such statistics regarding the network connectivity can be found in the open literature, for example, the data sets for yeast [31], and Drosophila [32]. By applying Bayes theorem, we obtain
[Formula ID: eq5]
(5) 
𝒫(i?j???p)=?B(?,??)p??1(1?p)??1?B(?,??)p??1(1?p)??1+(1??)B(?,??)p??1(1?p)??1.
For simplicity, a uniform distribution can be alternatively employed to account for the p-value when i ? j. In this case, ? = 1, ? = 1, and (5) takes the form
[Formula ID: eq6]
(6) 
𝒫(i?j???p)=?p??1(1?p)??1?p??1(1?p)??1?+?(1??)B(?,?).

The determination of ? and ? depends on the experimental knowledge of the accuracy of selecting p-value thresholds. In the first step, a p-value threshold pt is imposed, then the validity of all bindings with p-values less than pt is corroborated by biological experiments. In this way, we can gain knowledge of the probability 𝒫(i?j ? p < pt), which can be written in the form of

[Formula ID: eq7]
(7) 
𝒫(i?j???p?<?pt)=?𝒫(p?<?pt???i?j)?𝒫(p?<?pt???i?j)?+?(1????)𝒫(p?<?pt???i?j)=??0ptp??1(1?p)??1dp??0ptp??1(1?p)??1dp?+?pt(1??)B(?,??).

Some works in the literature, for example, [33], have made the observation that at a p-value threshold of 0.001, the frequency of false positives is 6%?10%, that is, 𝒫(i ? j ? p < pt) ? [6%, 10%]. Taking into account these special points, we can determine the pair (?, ?) in a small range. In our case, ??0.1 and ? ? 100. Finally, a table can be set up to map the p-value into the edge existence probability, which can be computed only once. It is an overhead for the computational system but it does not assume much computational resource in the runtime.

3.2. Exploiting Steady-State Gene Expression Data

Assume that m observations of expression vector are obtained and stored in matrix Dn?m. Next, we develop a computational approach to establish the posterior probability of the regulation i?j, that is, the probability of the existence of the edge i?j, which is represented by 𝒫(i?j ? D, p). This posterior can be obtained through integration over the whole parental gene set and parameter space for gene j:

[Formula ID: eq8]
(8) 
𝒫(i?j???D,p)=??j??jf(i?j,??j,??j???D,?p)d?j=??j??j1?j(i)f(?j,??j???D,?p)d?j,
where the function 1?j(i) is the indicator function, which takes 1 if i ? ?j and 0 otherwise. Applying Bayes theorem, f(?j, ?j ? D, p) can be expressed as
[Formula ID: eq9]
(9) 
f(?j,??j???D,?p)=f(D????j,??j,?p)f(?j,??j???p)f(D???p)=f(D????j,??j)f(?j,??j???p)f(D)=f(D????j,??j)f(?j,??j???p)??j??jf(D????j,??j)f(?j,??j???p)d?j=f(Dj???Dj?,??j,?j)f(?j,??j???p)??j??jf(Dj???Dj?,??j,??j)f(?j,??j???p)d?j,
where Dj denotes the observations of gene Xj, and Dj? represents the collection of all the observations pertaining to all genes excluding those of gene Xj. f(?j, ?j ? p) denotes the probability density of the high-dimensional parental model given the observation of ChIP-chip data. f(Dj?Dj?,?j,?j) stands for the gene expression likelihood given the parental values and the graphical model. It is a Gaussian distribution with known variance and mean determined by the first part of (2). The second equality in (9) holds because we believe the ChIP-chip experiment and steady-state gene expression measurements are independent. By plugging (9) into (8), it can be inferred that
[Formula ID: eq10]
(10) 
𝒫(i?j???D,?p)=??j??j1?j(i)f(Dj???Dj?,??j,??j)f(?j,??j???p)d?j??j??jf(Dj???Dj?,??j,??j)f(?j,??j???p)d?j.
The integrations at the numerator and denominator of (10) cannot be generally performed in a closed-form expression. However, the Monte Carlo methods enable to numerically evaluate the posterior probabilities. We can generate Monte Carlo samples based on the model probability density f(?, ? ? p) and the integration can be obtained by averaging over these samples. Then the posterior probabilities can be estimated by
[Formula ID: eq11]
(11) 
𝒫(i?j???D,p)???j,??j1?j(i)f(Dj???Dj?,??j,?j)??j,??jf(Dj???Dj?,??j,?j).

Assuming that the selection of a parent as an activator is performed in an independent manner, and that the selection of the regulation factor value is also performed independently, the model probability density f(?, ? ? p) can be further expanded by using the chain rule

[Formula ID: eq12]
(12) 
f(?,????p)=f(?????)𝒫(????p)=?i=1|A|[?f(?i)]?j=1|R|[(1??)f(?j)]𝒫(????p).

Equation (12) conveys the idea that the random samples of graphical models can be sequentially created and processed. First the network structure is created based on the binding probability of gene regulation obtained in the ChIP-chip experiment, then each parent is randomly assigned to represent an activator or repressor, and finally regulation factors are generated.

3.3. Algorithm Formulation

Our computational procedure can be briefly formulated in terms of Algorithm 1, where the Matlab coding convention is used to write the pseudocode. There exist n genes in the system. An n ? n matrix is created to represent the p-values produced in the ChIP-chip experiment. We collect m steady-state gene expression samples. The output entry Cij stands for 𝒫(i?j ? D, p), and M denotes the number of Monte-Carlo iterations. Lines 1 and 2 deal with the ChIP-chip experimental data and translate p-values into the binding probabilities by using (5). The results are stored in matrix B. Lines 3 and 4 perform the preprocessing of the gene expression data. Let Y(1), Y(2), Y(3),?, Y(m?2), Y(m?1), Y(m) be the values of a specific gene expression in ascending order. The smallest two values, Y(1), Y(2), and the largest two values, Y(m?1), Y(m), are treated as outliers and discarded. The dynamic range is defined as R = Y(m?2) ? Y(3). The gene expressions are normalized as follows: the smallest two samples are assigned the null value and the largest two samples are assigned the unit value; the intermediary samples Y(i) are normalized as (Y(i) ? Y(3))/R; if there is a missing sample, it is recovered through interpolation by gene's mean expression. Lines 12 through 16 implement the numerator of (11), and Line 17 computes the denominator of (11).

The algorithm can be easily reorganized into a parallel form so that we can exploit efficiently the distributed computational resources. The entries of output matrix C represent the posterior probabilities of regulation relationships between any two genes. It is directional (asymmetrical), and it possesses a clear probabilistic meaning compared with other vague connectivity metrics, for example, mutual information. It grants the biologists the flexibility first to examine the most significant interactions, then to proceed with less evidenced edges. Therefore, it is advantageous relative to a purely dichotomous scheme, in which genes are treated as being either connected or disconnected. A probability threshold can be imposed to change the algorithm into a dichotomous classifier. Since the posterior probability has a universal meaning, this threshold can be easily selected, usually within the range of [0.3?0.9]. A tradeoff has also to be made for different performance metrics.


4. Results

The simulation consists of two parts. In the first part, artificial networks are created and the performance of the proposed algorithm is compared with other representative algorithms available in the literature, namely the relevance network (RN) method [14], Chow-Liu algorithm [24], and ARACNE [18]. In the second part, the algorithm is tested on the real Saccharomyces cerevisiae (budding yeast) data set and a biologically meaningful genetic network is inferred for the genetic response to environmental changes.

4.1. Simulation on Artificial Networks

The proposed algorithm is compared with other three algorithms to evaluate its capability of recovering genetic networks based on gene expression data alone. The relevance network (RN) model [14] represents a robust inference method based on gene expression profiles. In the first step, it computes the mutual information between any two genes X and Y, denoted as I(X; Y). Then it suggests two genes X and Y to be relevant if their mutual information assumes a larger value than a prespecified threshold and it lays down an undirected edge as X-Y. Hence, RN measures the significance of gene interactions in terms of mutual information between the gene expressions and produces an undirected cyclic graph. Chow-Liu algorithm [24] approaches the inference problem by finding the maximum spanning tree in which the edge weights stand for the mutual information. However, it loses validity if the underlying model is a cyclic graph. In addition, when the graph is densely connected, this scheme might falsely miss too many edges. ARACNE algorithm [18] exploits the data processing inequality (DPI). It starts with a fully connected graph and a predefined mutual information threshold. Whenever the mutual information between two genes X and Y, that is, I(X; Y), is less than a threshold, it disconnects the two genes. Next, in the preliminary graph if there exists Z so that I(X; Y) < min(I(X; Z),I(Y; Z)), then it disconnects X and Y. In our simulations, we resort to an already available but efficient Matlab toolbox [34] to estimate the mutual information.

4.1.1. Performance Definition

Before making performance comparisons, we define inference errors and performance metrics. Because RN, Chow-Liu, and ARACNE algorithms all construct undirected graphs, we have to disregard the orientation information inferred by the proposed algorithm. The synthetic and inferred graphs are represented by G(V, E) and G^(V,E^), respectively. The two graphs share the same set of vertices but differ in the set of edges.

There are two types of inference errors. The type-1 errors are false positives (FP) and are also called false alarms. If the inference algorithm determines an interaction for two vertices X and Y in the inferred graph, denoted as X?Y?E^, but there is no such edge in the synthetic graph, that is, X-Y ? E, then an FP is produced. The number of FPs, represented by NFP, can be counted as follows:

[Formula ID: am1.1]
(13) 
NFP=??X,Y((X?Y?E^)?(X?Y?E)),
where ? stands for the logic and operator. The type-2 errors are false negatives (FN) and also named misses. If the inference does not discover the connectivity X-Y which resides in the synthetic network, an FN is generated. The number of FNs, depicted by NFN, is obtained by
[Formula ID: am1.2]
(14) 
NFN=??X,Y((X?Y?E)?(X?Y?E^)).

Correct inference can also be divided into two categories. If X?Y?E^ and X-Y ? E, the correctness is defined as a true positive (TP). Its summation, annotated by NTP, is

[Formula ID: am1.3]
(15) 
NTP=??X,Y((X?Y?E^)?(X?Y?E)).
On the other hand, if X?Y?E^ and X-Y ? E, this correctness is called a true negative (TN). The number of TNs, represented by NTN, is defined as follows:
[Formula ID: am1.4]
(16) 
NTN=??X,Y((X?Y?E^)?(X?Y?E)).

Different performance metrics are proposed in the literature. Three most popular of them are considered here. The first metric, referred to as the Hamming distance, is the summation of all the inference errors and is given by

[Formula ID: am1.5]
(17) 
Hamming??distance?=?NFP?+?NFN.
The Hamming distance is widely accepted as a good measure of the distance between two graphs.

The second metric is called the sensitivity, and is defined as

[Formula ID: am1.6]
(18) 
Sensitivity?=?NTPNTP+NFN.
The sensitivity describes the inference algorithm ability to identify the regulation relationships among genes. The third metric is called the specificity, and it assumes the form
[Formula ID: am1.7]
(19) 
Specificity?=?NTNNTN+NFP.
The specificity represents the inference algorithm's capability to avoid falsely connecting two unrelated genes.

4.1.2. Simulation on the Proposed Kinetics

A set of artificial networks are created based on the system dynamic equation (1). Each network has 30 vertices and 60 oriented edges. Such a network scale is selected for the consideration of the computational resources and the biological network that we are going to infer. The steady-state data are sampled by emulating the gene knockout experiment. A gene's expression is mandatorily forced to 0 while all other genes are free to change their expressions. The initial values of the system are randomly generated. When the system converges to the equilibrium, a Gaussian noise N(0, 0.03) is added and a few samples are obtained. All genes are shut down one by one. An extra in silico experiment is performed and no genes are shut down. These samples correspond to the wild type strain.

Different numbers of steady-state samples were generated based on the adopted system kinetics. The transcription factor is assumed to be an activator or repressor with equal probability, that is, ? = .5. The baseline parameter ? = 0.5 and the gamma parameters of regulation factors are (? = 16, ? = 0.0625) so that the regulation factor has a unit mean. Chow-Liu algorithm creates a spanning tree; therefore, it preserves only 29 edges, while the original synthetic network possesses 30 vertices and 60 edges. In order to make comparisons, we tune the parameters for the other three schemes so that the number of inferred edges is around 30. For the RN method, we keep the 30 edges with the highest mutual information. For ARACNE, the mutual information threshold is adjusted. In our proposed algorithm, the posterior probability thresholds are changed in the range of [0.3, 0.9] so that approximately 30 edges are obtained. It has to be noted that RN, ARACNE, and Chow-Liu algorithms only preserve interactions but disregard the interaction orientation. Therefore, in order to make consistent comparisons, we have to sacrifice the orientation information offered by the proposed algorithm. Besides, these three schemes have no capability of processing ChIP-chip data. Therefore, we have to configure the proposed algorithm such that any two nodes are associated with a small prior probability of connection (0.1). This reflects the fact that the connection between two arbitrary nodes in the graph is very unlikely, but not impossible. This also exemplifies how the algorithm works in the absence of the ChIP-chip data.

Figure 1(a) compares the performance in terms of Hamming distance for the four schemes assuming different sample sizes. The proposed method provides much better inference accuracy because it achieves the lowest Hamming distance. Larger sample size rewards a better inference precision. Chow-Liu's algorithm and ARACNE do not perform well. This can be attributed to the assumption of the network. Our synthetic networks actually are cyclic networks in order to reflect the real world scenario. However, cycles in the network ruin the inference precisions of Chow-Liu and ARACNE. Figure 1(c) illustrates the impact of sample size on the sensitivity. The proposed scheme outperforms the other three schemes. The sensitivities of all algorithms are less than 0.5. This is mainly due to the constraint that we pose on the number of inferred edges, that is, 30 edges. If we relax the posterior probability threshold, the sensitivity will be improved by sacrificing the specificity. Figure 1(e) depicts specificity for all schemes. All of them have high specificities, which are all greater than 0.90. The proposed scheme still exceeds. This high specificity is mainly due to the stringent constraint posed on the number of inferred edges. When considering the orientation of the edges, we find that 90% true positives inferred by the proposed algorithm are actually oriented correctly. This represents a big advantage of the proposed algorithm compared with the other three schemes.

4.1.3. Robustness of Inference

In the previous simulations, the proposed inference algorithm assumes the system dynamic as depicted by (1). Actually, for different biological processes, there exist various mathematical models which achieve tradeoffs between the sophistication of the underlying molecular reaction and the simplification of the formula description (see [27, 29] for model comparisons). Savageau [28] proposed an alternative mathematical model to account for the gene control and various forms of coupling among elementary gene circuits. This model can be denoted as

[Formula ID: eq13]
(20) 
dxdt?=??A?i=1|A|ai?i??R?j=1|R|rj?j,
where two new symbols ?A and ?R are activation and degradation coefficients and all other symbols share the same meanings as in (1).

Although the proposed inference framework can ?plug and play" with different models, it is still necessary to examine its robustness against the underlying model. We evaluate this model dependence by the following steps: configure the model as 13 and create a set of synthetic data, then apply the proposed algorithm based on the dynamic equation (1), finally determine the performance metrics for different algorithms and compare the results with those in the previous section.

The simulation results are plotted in Figures 1(b), 1(d), and 1(f). Each figure corresponds to a different performance metric. All algorithms exhibit different values for performance values. This shows that the inference is dependent on the particular data sets and their underlying model. Compared with other three schemes, the proposed algorithm still achieves good performance in terms of three metrics. However, the advantage of the proposed algorithm are not significant now. ARACNE, Chow-Liu, and relevance method do not degenerate much. This attributes mainly to the nonparametric property of these three schemes. The persistent good performance of the proposed algorithm is due to the fact that both dynamic models have to convey the basic properties of the gene interaction kinetics, such as the activation and repression effects and the coupling of the circuitry.

4.2. Simulation on Saccharomyces Cerevisiae Data Sets

Saccharomyces cerevisiae (yeast) has been extensively studied in the literature of molecular biology because it is a unicellular eukaryotic organism, which shares similar cell structure with plants and animals. Also, yeast presents a short life cycle, which makes the experiments to be easily conducted. Lee et al. [33] performed the ChIP-chip experiment, in which 141 transcription factors were tested for binding intergenetic regions corresponding to 6270 genes. The gene expression data were published by Mnaimneh et al. [35], who created promoter shut-off strains for 2/3 of all essential genes. The data set contains 215 steady-state cDNA microarray samples. The model parameters are assumed the same as artificial networks.

The intracellular signalling pathway in response to environmental changes has been conserved through evolution. Therefore, a study of this biological subsystem on the Saccharomyces cerevisiae might help to decipher the cell survival mechanism of other organisms. We select 30 genes which are annotated to participate in the stress response process. The given ChIP-chip experiment did not provide full prior knowledge between any two genes (nodes in the graph). We believe that, among these genes, there are some genes whose protein products may also serve as transcription factors. Therefore, if the binding between two genes was not tested in the ChIP-chip experiment, a small probability value 0.1 is assigned as the prior knowledge. The proposed inference algorithm leads to the genetic network illustrated in Figure 2.

The inferred genetic regulatory network shows strong proneness toward a scale-free network instead of a random network. Some genes possess especially high degree of connectivity. Three hub genes CIN5, HSF1, MSN4 already connect with more than 60% of all selected genes. Each of them has a connectivity degree not less than 8 while on average each gene in the network is connected with no more than 4 genes. These hub genes constitute the backbone of the network and they are potential control targets. This scale-free property is advantageous in maintaining the system robustness because a failure in one subsystem will not be propagated to the whole body.

Multiple works, for example [36], have identified MSN4 and MSN2 as two of the most important genes in the response to environmental changes. A recent work [37] recognized the functionality of another crucial gene HSF1, which is a heat shock transcription factor and functions in a different domain than the one corresponding to MSN2/4. Our inferred network confers this experimental result by showing that HSF1 and MSN2/4 regulate different set of genes except a weak connectivity between HSF1 and MSN4. MSN2/4 are not conserved in humans, while HSF genes have been preserved for various organisms such as Drosophila melanogaster, chickens, and mammals. Therefore, a study of the HSF1 pathway opens up the possibility of understanding the mechanism that governs the survival of normal cells under austere conditions.

CIN5 (YAP4) and YAP6 are two genes that play key roles in controlling the resistance to drugs, for example, cisplatin [38]. CAD1(YAP2), CIN5, YAP1, and YAP6 share a structure motif called basic leucine zipper (bZIP) and they are located closely in the network. However, they are not neighboring the other two bZIP genes: YAP5 and YAP7. It is hypothesized that although they have similar molecular structures, their biological functionalities are in distinct domains.

Several edges, discovered by imposing a stringent p-value threshold 0.001 to the location data, were persevered in our inferred network. Actually, these connections constitute a small portion of the proposed network, and they are CIN5?MSN1, CIN5?YAP6, CIN5?ROX1, YAP1?YAP6, MAC1?CUP9, CUP9?YAP6, and HAL9?MSN4. Various evidences are found to corroborate the recovered interactions, which can not be obtained by employing a stringent p-value for the location data. For example, YAP5 is recovered to directly regulate STE50. This regulation relationship has also been reported in the work of Horak [39]. The relationship between MSN2 and SCH9 is studied in [40] in the context of extending the life span.

It is worthwhile to note that gene expression data mainly provide statistical relationships among genes, while location data offer physical binding interactions at the molecular level. By combining the two data sources, we are aiming to refine the inferred network to be biologically more meaningful. However, it also runs at a risk of confusing statistical regulatory relationships with real binding interactions. When such a case occurs, the proposed algorithm is capable of constraining the interacting genes within the same biological process and common functional relationships. A related discussion about the meaning of inferred network can also be found in [41].


5. Conclusions

A novel algorithm is proposed to recover the genetic regulatory networks in the light of knowledge of transcriptional kinetics, ChIP-chip, and gene microarray data. The analysis is based on the Bayesian methodology and Monte Carlo techniques. The proposed scheme is useful to compensate the shortcomings of the utilization of only one data set alone. Our in silico experiments corroborate that the algorithm outperforms in specificity, sensitivity and Hamming distance relative to three state-of-the-art schemes. A budding yeast genetic regulatory network is proposed to account for the stress response.

There are possible extensions to our current scheme. An analysis of the error estimation is desired for the Monte Carlo simulation in order to determine the appropriate number of iterations. Several other knowledge sources are to be integrated into the current framework. For example protein-protein interactions are useful to identify cobinding regulations. Genome sequencing data have been utilized to find regulatory motifs. Protein structure knowledge can be exploited to categorize the proteins and find similar functionality. A cross-species research is also highly desirable since similar regulation mechanisms are expected to be conserved. If a gene is conserved in both humans and mice, then the knowledge of the gene pathway in the mouse will be an excellent reference for the study of human genetic diseases. We expect a global distributed framework, in which each data source acts as a separate component and its absence does not interfere with the whole computational process.


Acknowledgments

This work was supported by the National Cancer Institute (CA-90301) and the National Science Foundation (ECS-0355227 and CCF-0514644).


References
1. Kauffman SA. Metabolic stability and epigenesist in randomly constructed genetic netsJournal of Theoretical Biology 1969;22(3):437–467. [pmid: 5803332]
2. Murphy K,Mia S. Modelling gene expression data using dynamic Bayesian networks1999Berkeley, Calif, USA: Computer Science Division, University of California; Tech. Rep.
3. Sebastiani P,Abad MM,Ramoni M. Bayesian networksThe Data Mining and Knowledge Discovery Handbook 2005New York, NY, USA: Springer; :193–230.
4. Shmulevich I,Dougherty ER,Kim S,Zhang W. Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networksBioinformatics 2002;18(2):261–274. [pmid: 11847074]
5. Tabus I,Astola J. On the use of MDL principle in gene expression predictionJournal on Applied Signal Processing 2001;2001(4):297–303.
6. ?ktem H,Pearson R,Yli-Harja O,Nicorici D,Egiazarian K,Astola J,et al. A computational model for simulating continuous time Boolean networksIn: Proceedings of IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '02)October 2002Raleigh, NC, USA
7. Murphy K. Bayes Net Toolbox for Matlab http://www.cs.ubc.ca/~murphyk/Software/BNT/bnt.html.
8. Leray P. Structure learning toolbox of Bayesian Networks http://bnt.insa-rouen.fr/ajouts.html.
9. Friedman N,Elidan G. Bayesian Network learning tool http://www.cs.huji.ac.il/labs/compbio/LibB/programs.html#LearnBayes.
10. Rao A,Hero AO,States DJ,Engel JD. Manifold embedding for understanding mechanisms of transcriptional regulationIn: Proceedings of IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '06)May 2006College Station, Tex, USA:3–4.
11. Liang S,Fuhrman S,Somogyi R. REVEAL, a general reverse engineering algorithm for inference of genetic network architecturesIn: Proceedings of the Pacific Symposium on Biocomputing (PSB '98)January 1998Maui, Hawaii, USA:18–29.
12. Luna IT,Yin Y,Huang Y,Padillo DPR,Perez MCC,Wang Y. Uncovering gene regulatory networks using variational Bayes variable selectionIn: Proceedings of IEEE International Workshop on Genomic Signal Processing and Statistics (GENSIPS '06)May 2006College Station, Tex, USA:111–112.
13. Zhao W,Serpedin E,Dougherty ER. Inferring gene regulatory networks from time series data using the minimum description length principleBioinformatics 2006;22(17):2129–2135. [pmid: 16845143]
14. Butte AJ,Kohane IS. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurementsIn: Proceedings of the Pacific Symposium on Biocomputing (PSB '00)January 2000Honolulu, Hawaii, USA:418–429.
15. Rogers S,Girolami M. A Bayesian regression approach to the inference of regulatory networks from gene expression dataBioinformatics 2005;21(14):3131–3137. [pmid: 15879452]
16. Rice JJ,Tu Y,Stolovitzky G. Reconstructing biological networks using conditional correlation analysisBioinformatics 2005;21(6):765–773. [pmid: 15486043]
17. Yeung MKS,Tegn?r J,Collins JJ. Reverse engineering gene networks using singular value decomposition and robust regressionProceedings of the National Academy of Sciences of the United States of America 2002;99(9):6163–6168. [pmid: 11983907]
18. Margolin AA,Nemenman I,Basso K,et al. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular contextBMC Bioinformatics 2006;7(supplement 1, S7):1–15. [pmid: 16393334]
19. Simon I,Siegfried Z,Ernst J,Bar-Joseph Z. Combined static and dynamic analysis for determining the quality of time-series expression profilesNature Biotechnology 2005;23(12):1503–1508. [pmid: 16333294]
20. Bernard A,Hartemink AJ. Informative structure priors: joint learning of dynamic regulatory networks from multiple types of dataIn: Proceedings of the Pacific Symposium on Biocomputing (PSB '05)January 2005The Big Island of Hawaii, Hawaii, USA:459–470.
21. Hartemink AJ,Gifford DK,Jaakkola TS,Young RA. Combining location and expression data for principled discovery of genetic regulatory network modelsIn: Proceedings of the Pacific Symposium on Biocomputing (PSB '02)January 2002Lihue, Hawaii, USA:437–449.
22. Cooper GF,Herskovits E. A Bayesian method for the induction of probabilistic networks from dataMachine Learning 1992;9(4):309–347.
23. Heckerman D,Geiger D,Chickering DM. Learning Bayesian networks: the combination of knowledge and statistical dataMachine Learning 1995;20(3):197–243.
24. Chow C,Liu C. Approximating discrete probability distributions with dependence treesIEEE Transaction on Information Theory 1968;14(3):462–467.
25. Pearl J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. 1988San Francisco, Calif, USA: Morgan Kaufmann;
26. Friedman N,Cai L,Xie XS. Linking stochastic dynamics to population distribution: an analytical framework of gene expressionPhysical Review Letters 2006;97(16):4 pages. Article ID 168302.
27. Wessels LF,van Someren EP,Reinders MJ. A comparison of genetic network modelsIn: Proceedings of the 6th Pacific Symposium on Biocomputing (PSB '01)January 2001The Big Island of Hawaii, Hawaii, USA:508–519.
28. Savageau MA. Rules for the evolution of gene circuitryIn: Proceedings of the 3rd Pacific Symposium on Biocomputing (PSB '98)January 1998Maui, Hawaii, USA:54–65.
29. Edelstein-Keshet L. Mathematical Models in Biology. 1988New York, NY, USA: Random House;
30. Allison DB,Gadbury GL,Heo M,et al. A mixture model approach for the analysis of microarray gene expression dataComputational Statistics & Data Analysis 2002;39(1):1–20.
31. Guelzim N,Bottani S,Bourgine P,K?p?s F. Topological and causal structure of the yeast transcriptional regulatory networkNature Genetics 2002;31(1):60–63. [pmid: 11967534]
32. Giot L,Bader JS,Brouwer C,et al. A protein interaction map of Drosophila melanogasterScience 2003;302(5651):1727–1736. [pmid: 14605208]
33. Lee TI,Rinaldi NJ,Robert F,et al. Transcriptional regulatory networks in Saccharomyces cerevisiaeScience 2002;298(5594):799–804. [pmid: 12399584]
34. Ihler A. Kernel density estimation software http://www.ics.uci.edu/~ihler/code/.
35. Mnaimneh S,Davierwala AP,Haynes J,et al. Exploration of essential gene functions via titratable promoter allelesCell 2004;118(1):31–44. [pmid: 15242642]
36. Gasch AP,Spellman PT,Kao CM,et al. Genomic expression programs in the response of yeast cells to environmental changesMolecular Biology of the Cell 2000;11(12):4241–4257. [pmid: 11102521]
37. Eastmond DL,Nelson HCM. Genome-wide analysis reveals new roles for the activation domains of the Saccharomyces cerevisiae heat shock transcription factor (Hsf1) during the transient heat shock responseJournal of Biological Chemistry 2006;281(43):32909–32921. [pmid: 16926161]
38. Furuchi T,Ishikawa H,Miura N,et al. Two nuclear proteins, Cin5 and Ydr259c, confer resistance to cisplatin in Saccharomyces cerevisiaeMolecular Pharmacology 2001;59(3):470–474. [pmid: 11179441]
39. Horak CE,Luscombe NM,Qian J,et al. Complex transcriptional circuitry at the G1/S transition in Saccharomyces cerevisiaeGenes & Development 2002;16(23):3017–3033. [pmid: 12464632]
40. Fabrizio P,Pozza F,Pletcher SD,Gendron CM,Longo VD. Regulation of longevity and stress resistance by Sch9 in yeastScience 2001;292(5515):288–290. [pmid: 11292860]
41. Hartemink AJ. Reverse engineering gene regulatory networksNature Biotechnology 2005;23(5):554–555. [pmid: 15877071]

Article Categories:
  • Research Article


Previous Document:  Microglia and Astrocyte Activation by Toll-Like Receptor Ligands: Modulation by PPAR-gamma Agonists.
Next Document:  A Novel Design of 4-Class BCI Using Two Binary Classifiers and Parallel Mental Tasks.