Document Detail

Identification of CpG islands in DNA sequences using statistically optimal null filters.
Jump to Full Text
MedLine Citation:
PMID:  22931396     Owner:  NLM     Status:  PubMed-not-MEDLINE    
: CpG dinucleotide clusters also referred to as CpG islands (CGIs) are usually located in the promoter regions of genes in a deoxyribonucleic acid (DNA) sequence. CGIs play a crucial role in gene expression and cell differentiation, as such, they are normally used as gene markers. The earlier CGI identification methods used the rich CpG dinucleotide content in CGIs, as a characteristic measure to identify the locations of CGIs. The fact, that the probability of nucleotide G following nucleotide C in a CGI is greater as compared to a non-CGI, is employed by some of the recent methods. These methods use the difference in transition probabilities between subsequent nucleotides to distinguish between a CGI from a non-CGI. These transition probabilities vary with the data being analyzed and several of them have been reported in the literature sometimes leading to contradictory results. In this article, we propose a new and efficient scheme for identification of CGIs using statistically optimal null filters. We formulate a new CGI identification characteristic to reliably and efficiently identify CGIs in a given DNA sequence which is devoid of any ambiguities. Our proposed scheme combines maximum signal-to-noise ratio and least squares optimization criteria to estimate the CGI identification characteristic in the DNA sequence. The proposed scheme is tested on a number of DNA sequences taken from human chromosomes 21 and 22, and proved to be highly reliable as well as efficient in identifying the CGIs.
Rajasekhar Kakumani; Omair Ahmad; Vijay Devabhaktuni
Related Documents :
10650916 - Suppression of apoptosis and induction of dna synthesis in vitro by the phthalate plast...
574926 - Natural changes in the cell division rate in normal tissues and the possible role of ad...
4028326 - Viscometric detection of liver dna fragmentation in rats treated with ten aromatic amin...
24272156 - Agroinfection and nucleotide sequence of cloned wheat dwarf virus dna.
21238426 - Study of interactions between dna and aflatoxin b1 using electrochemical and fluorescen...
1411506 - Interactions of small nuclear rna's with precursor messenger rna during in vitro splicing.
Publication Detail:
Type:  Journal Article     Date:  2012-08-29
Journal Detail:
Title:  EURASIP journal on bioinformatics & systems biology     Volume:  2012     ISSN:  1687-4153     ISO Abbreviation:  EURASIP J Bioinform Syst Biol     Publication Date:  2012  
Date Detail:
Created Date:  2013-02-13     Completed Date:  2013-02-14     Revised Date:  2013-02-15    
Medline Journal Info:
Nlm Unique ID:  101263720     Medline TA:  EURASIP J Bioinform Syst Biol     Country:  Germany    
Other Details:
Languages:  eng     Pagination:  12     Citation Subset:  -    
Department of Electrical and Computer Engineering, Concordia University, 1455 de Maisonneuve Blvd, West Montreal, QC H3G1M8, Canada.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Full Text
Journal Information
Journal ID (nlm-ta): EURASIP J Bioinform Syst Biol
Journal ID (iso-abbrev): EURASIP J Bioinform Syst Biol
ISSN: 1687-4145
ISSN: 1687-4153
Publisher: BioMed Central
Article Information
Download PDF
Copyright ©2012 Kakumani et al.; licensee Springer.
Received Day: 16 Month: 2 Year: 2012
Accepted Day: 23 Month: 7 Year: 2012
Print publication date: Year: 2012
Electronic publication date: Day: 29 Month: 8 Year: 2012
Volume: 2012 Issue: 1
First Page: 12 Last Page: 12
PubMed Id: 22931396
ID: 3570435
Publisher Id: 1687-4153-2012-12
DOI: 10.1186/1687-4153-2012-12

Identification of CpG islands in DNA sequences using statistically optimal null filters
Rajasekhar Kakumani1 Email:
Omair Ahmad1 Email:
Vijay Devabhaktuni2 Email:
1Department of Electrical and Computer Engineering, Concordia University, 1455 de Maisonneuve Blvd. West Montreal, QC H3G1M8, Canada
2Department of Electrical Engineering and Computer Science, University of Toledo, MS 308, 2801 W. Bancroft St., Toledo, OH 43606, USA


In the recent years, computational methods for processing and interpreting vast amount of genomic data, generated from genome sequencing, have gained a lot of scientific interest. Genomic sequences such as deoxyribonucleic acid (DNA) consist of biological instructions which are crucial for the development and normal functioning of almost all living organisms [1]. A DNA molecule has a complex double helix structure that involves two strands, consisting of alternating sugar and phosphate groups. Attached to these sugar groups of each DNA strand are one of the four chemical bases, namely, adenine (A), thymine (T), guanine (G), and cytosine (C). A unit comprising of base, sugar, and phosphate is referred to as a nucleotide. Hydrogen bonds between the nucleotides A and T (similarly between nucleotides G and C) from the opposite strands not only stabilize the DNA molecule, but also make the two strands complimentary. Nucleotides in a DNA strand exhibit short, recurring patterns (also called sequence motifs) that are presumed to have a biological function. Identification of these patterns helps in understanding the biological information hidden in a DNA sequence. A human DNA consists of about 3 billion nucleotides and completion of genome sequencing of numerous model organisms has further proliferated genomic databases. To completely decipher, the biological information in a DNA sequence is a daunting task and development of fast, efficient, and cost effective computational techniques for the same is a big challenge.

A sequence pattern that plays a crucial role in the analysis of genomes is CpG Island (CGI). A typical CGI consists high-frequency of CpG dinucleoetides, where ‘p’ refers to the phosphodiester bond between the adjacent nucleotides [1,2]. This bond is different from the hydrogen bond that exists between C and G across two strands in a DNA double helix. The length of a CGI varies from a few hundred to a few thousand base pairs (bp), but rarely exceeds 5000 bp. It is known that CpG Islands (CGIs) occur in and around the promoter regions of (50–60)% of human genes, including most housekeeping genes (the genes which are essential for general cell functions) [3]. Gene is a stretch of DNA sequence which has biological information for the synthesis of a protein. The promoter region in a gene regulates its functionality [4-7]. Due to the association of CGIs with promoters, CGIs play an important role in promoter prediction and consequently in the prediction of genes [8,9]. CGIs also contribute significantly in discovering the epigenetic causes of cancer. CGIs located in the promoter regions of certain tumor suppressor genes are normally unmethylated in healthy cells. DNA methylation is a biochemical modification resulting from addition of a methyl group to cytosine nucleotide (C). In cancer cells, CGIs usually undergo a dense hypermethylation leading to gene silencing as shown in Figure 1. Owing to this, they can be used as candidate regions for aberrant DNA methylation, for early detection of cancer [10-14]. For these reasons, identification of CGIs has become indispensable for genome analysis and annotation.

Despite their accuracy, experimental methods employed by biologists for identification of CGIs are extremely time-consuming, simply because of the enormity of genomic data. On the other hand, computational methods can be much more attractive for the identification of possible CGIs. The results obtained from computational methods can be used by biologists to validate and further enhance the accuracy of identified CGI locations. There are several computational methods [15-26] reported in the literature for identification of CGIs in DNA sequences. In one of the first computational attempts [15], a CGI is defined as a DNA segment fulfilling the following three conditions: (i) length of segment is at least 200 bp, (ii) G and C contents are ≥ 50%, and (iii) observed CpG to expected CpG ratio (o/e) is ≥ 0.6. Observed CpG is the number of CpG dinucleoetides in a segment and expected CpG is calculated by multiplying the number of ‘C’s and the number of ‘G’s in a segment and then dividing the product by length of the segment. This method however falsely identifies the other G and C rich motifs, e.g., Alu repeats, as CGIs. In subsequent methods, these three conditions were made more stringent in order to reduce false identification at the expense of missing some true CGIs [24]. Sophisticated methods utilizing two Markov chain models [27,28], one for CGIs and the other for non-CGIs, are proposed [2,25,26]. These two Markov models differ in their respective model parameters which characterize the difference in transition probabilities between successive nucleotides in CGIs and non-CGIs, respectively. In these methods, a DNA segment is defined as CGI, if the log-score [2] computed using Markov model for a CGI is greater than that computed using Markov model for a non-CGI. Consequently, the model parameters used for CGIs and non-CGIs play a crucial role in identifying the CGIs. However, different methods employing such models from time-to-time produce inconsistent results. Another criterion based on the physical distance distribution of CpG dinucleoetides in a DNA segment has also been proposed [23]. Methods based on this criterion are dependent on nucleotide composition of a DNA sequence being analyzed and suffer from low identification specificity.

Recently, digital signal processing (DSP)-based algorithms have gained popularity for the analysis of genomic sequences since they can be mapped to numerical sequences. Digital filters have successfully been employed for identification of protein coding regions (exons) in DNA sequences and hot-spots in protein sequences [29-33]. Digital filters have also been used for identification of CGIs with considerable success [25,26]. These methods are similar to Markov chain methods but use digital filters to compute weighted log-score to identify CGIs. The method proposed in [25] employs a bank of IIR low-pass filters (about 40 filters, each with different bandwidth) to identify the CGIs by looking at the weighted log-scores of all the filters together. The CGI identification sensitivity of this method is affected by the tradeoff between responsiveness of filter and stability of the output. Moreover, this method may become computationally demanding as it makes use of a large number of filters in the bank. Another DSP based algorithm in [26] employs an underlying multinomial statistical model [34] to estimate its Markov chain parameters followed by an FIR filter with Blackman window to compute the weighted log-score.

It is evident from above discussion that the CGI identification methods and more importantly the criteria used therein play a crucial role in identifying CGIs. As such, development of fast and efficient computational methods with highly reliable CGI identification criteria is a necessity. Statistically optimal null filters (SONF) have been proven for their ability to efficiently estimate short-duration signals embedded in noise [35]. In this article, we propose a new DSP algorithm for identification of CGIs using SONF which combines maximum signal-to-noise ratio and least squares optimization criteria to estimate the message signal, characterizing the CGI, embedded in noise. Normally, the CGI identification accuracy is a lot dependent on the Markov models used and sometimes produces contrasting results. Also, one of the main objectives of the article is to find a uniform yet effective alternative CGI identification measure replacing the current measure based on transition probabilities. In the proposed scheme, we have formulated a simple basis function to be used in SONF which characterizes the CGI. Our criterion is devoid of any ambiguities associated with the choice of transition probabilities used in some of the algorithms. The proposed scheme is tested on a large number of already annotated DNA sequences obtained from human chromosomes 21 and 22. It is shown that our scheme is simple to implement and yet able to identify CGIs reliably and efficiently.

The rest of the article is organized as follows: the following section briefly describes a few existing DSP-based algorithms for the identification of CGIs. In Section “Proposed scheme”, the proposed SONF-based scheme for identifying CGIs in DNA sequences is explained. Results obtained from the proposed scheme are depicted as well as tabulated in Section “Results and discussion”. Finally, “Conclusion” section concludes the article describing some of the significant features of the proposed scheme.

Related study

In this section, we give a brief review of some of the existing CGI identification methods as a preparatory groundwork for the method to be proposed in Section “Proposed scheme”.

Markov chain approach

In this method, a DNA sequence of length N, represented as X = {x(n),x(n + 1),…,x(n + N−1)} where each symbol x(n)∈{ACTG} is considered as a first-order Markov chain [27] due to its conditional independence property, i.e., the nucleotide occurring at the location (n−1) does not offer any information over and above that at n to predict the nucleotide occurring at (n + 1). In a CpG island, the probability of transition from the nucleotide base C to the base G is higher in comparison with that in a non-CGI. Let the probability of transition from a nucleotide β to a nucleotide γ in a CGI and a non-CGI be denoted as pβγ+ and pβγ− respectively. Tables 1 and 2 taken from [2] show the transition probabilities for CGI and non-CGI Markov models. These tables are derived from 48 putative CGIs in human DNA sequences. Each row in the tables contains transition probabilities from a specific nucleotide base to each of the four bases. These transition probabilities pβγ± are calculated using

[Formula ID: bmcM1]

where nβγ± is the number of dinucleoetides βγ in a DNA sequence. Naturally, every row in the tables adds up to unity. As expected, in Table 1, which corresponds to the CGI Markov model, the probability that a C is followed by a G is very high as compared with that in Table 2.

The CGIs in the DNA sequence X are identified by analyzing the windowed sequence Xn = {x(n),x(n + 1),…,x(n + L−1)} of length L, and those obtained by shifting the window by one position at a time. The probability of observing a windowed sequence assuming that it belongs to a CGI is given by

[Formula ID: bmcM2]

Similarly, the probability of observing this sequence assuming it belongs to a non-CpG island region is

[Formula ID: bmcM3]

If the value of P(Xn|CGI) > P(Xn|non-CGI), then, it is concluded that the DNA sequence Xn belongs to a CGI. Otherwise, it is more likely to be a non-CGI island. Alternatively, by formulating a log-likelihood ratio, given by

[Formula ID: bmcM4]

If S(n) > 0, the given DNA sequence is more likely to belong to a CGI, and if S(n) < 0 the sequence probably belongs to a non-CGI region.

IIR low-pass filter approach

Yoon and Vaidyanathan [25] have noted that the log-likelihood ratio given in (4) can be expressed as:

[Formula ID: bmcM5]

where y(n) is a sequence representing the log-likelihood ratio of a single transition given by

[Formula ID: bmcM6]

and, have(n) is a simple averaging filter defined as

[Formula ID: bmcM7]

Then, they proposed using a bank of M filters each having different bandwidth, instead of using simply one low-pass filter have(n). Specifically, the filter used in the kth (k = 0,…,M−1) channel has a transfer function given by

[Formula ID: bmcM8]

where 0 < α0 < α1 < ⋯ < αM−1 < 1. Since impulse response of a filter in the bank is havek=1−αkαkkun more recent inputs are given larger weights than the past ones in the averaging process of y(n). The filter bank consists of 40 channels (M = 40), and the filter parameter αk is chosen from 0.95 to 0.99 with an increment of 0.001. The log-likelihood ratio obtained from the output of the kth channel is given by

[Formula ID: bmcM9]

The values of Sk(n) obtained for all k and n are then used to obtain a two-level contour plot. The bands corresponding to Sk(n) > 0 determine the locations of CGIs.

In this method, the use of filter bank increases the computational overhead considerably. For fair comparison, instead of a bank on M filters, we have used one pole filter with optimized parameter α = 0.99 to compare with other methods (this reduces the number of computations considerably).

Multinomial statistical model

This method by Rushdi and Tuqan [26] differs from the previous method by the way the transition tables are obtained and the type of digital filter used to calculate the log-likelihood ratio. Instead of using (1) to obtain the transition probability tables, they are generated by comparing the frequency of each dinucleotide with the one expected under a multinomial model [34]. Transition probabilities pβγ± for the windowed sequence Xn are calculated using

[Formula ID: bmcM10]


[Formula ID: bmcM11]

This method uses a FIR digital filter with variable coefficients generated by Blackman window to calculate the log-likelihood ratio S(n) given in (4). The locations with S(n) greater than zero are the probable locations of CGIs.

All of the above-mentioned methods rely on the transition probability tables to calculate log-likelihood ratio used to identify CGIs. The methods [25,26] specifically vary by the way y(n), obtained from the respective transition tables, are averaged. It is shown later in Section “Results and discussion” that the choice of the transition tables may produces contrasting results. Hence, a more reliable and efficient scheme that is devoid of these transition tables is necessary for identifying CGIs.

Proposed scheme

In this study, we adopt the SONF approach, proposed in [35], to efficiently identify CGIs in DNA sequences. SONF is used for estimation of short duration signal, Sn = {s(m)}, embedded in noise Rn = {r(m)} by combining maximum signal-to-noise ratio and least squares optimization criteria. The implementation of the twofold optimization in SONF is shown in Figure 2, where an instantaneous matched filter (IMF) is first used to detect the presence of a short duration signal embedded in noise by maximizing the signal-to-noise ratio over variable-time observation interval m. The IMF output, In, is then scaled by a locally generated function n, by least squares (LS) optimization procedure, to obtain the signal Yn, an estimate of Sn. It has been shown that the SONF is equivalent to a Kalman filter with a much simpler implementation [35]. Also, SONF has the ability to track rapidly changing signals leading to more practical processing of short-duration signals [36,37]. Therefore, the proposed scheme is expected to perform better in situations even if the CGIs are of very short length of the order of 200 bp.

To be able to apply SONF approach to identify CGIs, the DNA sequence X, of length N, is first mapped to an appropriate binary numerical sequence XCG = {xCG(n)}. A sliding window of length L is used to evaluate if each of the windowed sequences, Xn = {xCG(m)}, where n = 1,2,…,NL + 1 and m = n,n + 1,…,n + L−1, belong to a CGI or not. Each of the windowed sequence Xn can be expressed as

[Formula ID: bmcM12]

where Sn = {s(m)} is a message signal corresponding to a CGI and Rn = {r(m)} is a residual signal. Sn and Rn are each of length L. Let Φ = {ϕ(m)} be a fixed binary basis sequence of length L having some characteristic property of CGI.

Now, the message signal corresponding to a CGI can be expressed as Sn = VnΦ, where Vn = {v(m)} and Φ are sequences each of length L. The sequence VnΦ is obtained by multiplying the corresponding elements of Vn and Φ. The sequence Vn is determined by minimizing Rn in least square sense. Let the message signal be denoted as Sn = {s(m)}. The objective of the proposed method is to choose the basis sequence such that Vn resulting from the optimization process has some discriminating feature of indicating whether the associated sequence Xn belongs to a CGI. The following subsections explain in detail the steps involved in identification of CGIs in a DNA sequence using SONF.

Numerical mapping of DNA sequences

As DNA sequences are alphabetical in nature, they need to be mapped to numerical sequences in order to employ the DSP techniques for DNA sequence analysis. There are several mapping techniques reported in the literature. One of the earliest and a popular mapping is that of Voss’s binary indicator sequences [38]. A DNA sequence X can be mapped to a set of four digital signals by forming four binary indicator sequences, namely, XA, XT, XG, and XC. In each of these binary indicator sequences, ’1’ represents the presence and ’0’ absence of the corresponding bases A, T, G, and C in X. For instance, considering a DNA sequence X = {ATCCGAAGTATAACGAA}, the binary indicator sequence corresponding to G, i.e., XG can be expressed as XG = {00001001000000100}. Indicator sequences for the remaining three nucleotides can be represented in a similar fashion.

The problem of CGI identification deals with G and C content in a DNA sequence. Hence, we define a new indicator sequence XCG = {xCG(n)}, which indicates the presence of the nucleotides C and G in the DNA sequence. For example, the binary indicator sequence XCG of the DNA sequence above is XCG = {00111001000001100}.

Choosing the basis sequence

In this study, we have noticed that the dinucleotides CC, CG, GC, and GG occur more frequently in a CGI as compared to a non-CGI. For this study, we have calculated the occurrence of these four dinucleotides in the sequence L44140 taken from the chromosome X of Homo sapiens. The sequence L44140 is of length 219447 bp and has 17 CGIs whose locations are obtained from [39]. Figure 3 depicts the relative occurrence of the above four dinucleotides as compared to the remaining dinucleotides (AA, AC, AG, AT, CA, CT, GA, GT, TC, TG, TT, and TA) in CGIs and non-CGIs of L44140. Here, the relative occurrence of a particular dinucleotide is equal to the number of times that dinucleotide occurs in the sequence divided by the sequence length. It is evident that the dinucleotides CC, CG, GC, and GG occur more frequently in CGIs whereas the other dinucleotides occur more frequently in non-CGIs. This observation can also be inferred from the transition probability tables (Tables 1 and 2) as the values of pβγ+ are greater than pβγ−, where β and γ are either G or C. In Figure 3, the darker bars corresponding to the dinucleotides CC, CG, GC, and GG are taller in CGIs, whereas the darker bars corresponding to the other dinucleotides are shorter. Hence, instead of just considering the difference in relative occurrence of CG, it is more productive to consider the relative occurrence of the dinucleotides CC, CG, GC, and GG to distinguish between a CGI and a non-CGI.

Moreover, we have studied the difference in gap sizes between the dinucleotides CC, CG, GC, and GG in CGIs and non-CGIs of L44140. The shortest possible gap is of size 0 when the dinucleotides are adjacent to each other. Figure 4 shows the relative occurrence of gaps of various sizes in a CGI and a non-CGI. Here, relative occurrence of a particular gap size is equal to the number of times that gap size occurs in the sequence divided by the sequence length. Obviously, the gap of size 0 occurs more frequently in a CGI as compared to that of a non-CGI. And, it is found that the gap size in a non-CGI can go up to 40 where as in CGIs the maximum gap size was found to be 19. It can also be seen that the gaps of sizes 0, 1, and 2 occur more frequently in a CGI and the gap sizes of 3 and greater occur more frequently in a non-CGI. A gap of size 2 is the largest gap which can distinguish between a CGI and a non-CGI.

Based on the above observations, the basis sequence which characterizes a CGI can be formulated as Φ = {1100110011…001100}. The 1’s in Φ represent either the nucleotide C or G. The 1’s always appear in pairs where each pair representing one of the dinucleotide CC, CG, GC, or GG. The 0’s in Φ form the gap between the dinucleotides. A gap size of 2 is chosen between the dinucleotides. This choice of Φ is also satisfies the basic criteria of a CGI, i.e., at least 50% of the nucleotide content in a CGI is due to C and G.

Now, in order to obtain the length of Φ (window size), we have analyzed CGIs and non-CGIs of different lengths for the relative occurrence of various gap sizes. Figure 5 shows the plot of Δ versus window size for various gap sizes. Here, Δ is the difference of relative occurrence of a particular gap in a CGI and a non-CGI for a fixed window length. It can be seen that Δ is maximum for gap size 0. As the window size increases Δ also increases before it reaches a steady value. Δ is negative for gap sizes of 3 and greater signifying that the gap sizes of 3 and higher are more probable in non-CGIs compared to CGIs. For the gap size 2, Δ stabilizes for window sizes greater than 200. Larger the window size, larger the number of computations, and hence in the proposed method we have used the length of Φ (window size) to be equal to 200.


The objective of IMF, which is the first stage of SONF shown in Figure 2, is to detect the presence of the waveform Φ in the input sequence Xn. IMF is an improvement over a matched filter, the difference being, in IMF optimal SNR is repeatedly calculated at every sample m, over an observation interval m∈[n,n + L−1]. IMF takes Xn and Φ as inputs and produces an output sequence In = {ι(m)} where

[Formula ID: bmcM13]

for m = n,n + 1,…,n + L−1. It can be seen that at each sample m, ι(m) is calculated over a varying interval i∈[n,m]. Note that, assuming ι(n−1) = 0, ι(m) can also be calculated using the recursive relation given by

[Formula ID: bmcM14]

The output ι(m) leads to an optimal detection of Φ at each sample m, and can be expressed as

[Formula ID: bmcM15]

where r0′(m) is the residual signal in IMF output, and c(m) is given by

[Formula ID: bmcM16]

The v(m)∈Vn in (15) is an unknown gain.

Least square optimization of the IMF output

The objective of the second stage in SONF is to determine a sequence = {λ(m)}, which when used to scale the IMF output In, produces the SONF output, Yn, such that Yn→VnΦ. Here, Yn is an element wise product of Vn and Φ. Yn is an estimate of Sn, which is the message signal corresponding to CGI.

Let us consider the suboptimal case in which a sample of the IMF output ι(m) in (15), when scaled by λ(m)=ϕ(m)/c(m), generates

[Formula ID: bmcM17]

where y(m) is an element of the SONF output, Yn. As we desire optimal null filtering, i.e., y(m) = s(m), the residual element, r0(m), needs to be entirely eliminated.

Before determining the optimal n, corresponding to ideal null filtering, we define the sequence Zn = {z(m)} such that,

[Formula ID: bmcM18]

Ideally, y(m) = s(m) and from (18), zideal(m) = r(m). Now, the optimal n={λopt(m)} is determined by minimizing the mean square error, E[eλ2(m)], with respect to λ(m) where

[Formula ID: bmcM19]

The optimal post IMF scaling sequence λopt(m) obtained by carrying out the above mean square minimization is [35]

[Formula ID: bmcM20]

where SNR is the input signal-to-noise ratio (considering r(m) to be noise).

In order to implement SONF, the value of the input SNR should be known. To circumvent this problem, a suboptimal case, as shown in (17), is assumed considering c(m) >> 1/SNR, leading to

[Formula ID: bmcM21]

It can be shown that as m increases, λsubopt(m)→λopt(m) because the second term in the equation

[Formula ID: bmcM22]

approaches zero (as the value of c(m) progressively increases with m). So, the value of initial SNR in (20) will influence only the starting few samples in Yn.

The SONF can easily be implemented recursively using the following equations [35]

[Formula ID: bmcM23]

In this case of DNA analysis, one may choose the initial value of the gain P(0) to be 1 and ι(0) = ι(1).

The proposed SONF-based CGI identification algorithm for a DNA sequence of length N can now be summarized as follows:

Initialization: Set the base location index n = 0.

Step 1: Apply a rectangular window of length L = 200 starting at the base location n of the DNA sequence X to obtain the windowed sequence Xn.

Step 2: Obtain the binary indicator sequence XCG for the windowed sequence, Xn, from Step 1.

Step 3:XCG from Step 2, along with the binary basis sequence Φ, form the inputs to SONF. The corresponding SONF output sequence, Yn, is evaluated using the recursive relations given in (23), by assuming P(0) = 1 and ι(0) = ι(1).

Step 4: Compute the SNR power gain G(Xn), which is the ratio of the variance of the SONF output, Yn, to the variance of the corresponding input Xn.

Step 5: Increment the value of n by 1, i.e., n = n + 1. If n ≤ (NL) go to Step 1, else go to Step 7.

Step 6: Plot G(Xn) as a function of n + L and get its upper envelope. The peaks in the resulting plot which are above the threshold, η, indicate the locations of CGIs identified in X.

Step 7: Exit the algorithm.

Figure 6 shows the SONF implementation for better understanding of the proposed approach. Figure 6a,b shows an example of a CGI and a non-CGI with 80 bp. Naturally, in Figure 6a there are greater number of ones. Figure 6c,d shows the IMF output for a CGI and a non-CGI, respectively. It can be seen that the IMF output corresponding to a CGI progressively increases to a greater value of 35 as compared to 6 of that of a non-CGI. Figure 6e,f is the scaling functions for a CGI and a non-CGI, respectively. They are obtained using the relation λ(m) = P(m)ϕ(m) in (23). Finally, Figure 6g,h shows the estimated CGI characteristic in a CGI and a non-CGI, respectively. The SONF output corresponding to a CGI has greater amplitude as compared with that of a non-CGI.

Prediction measures

The identification of CGIs can have four possible outcomes; true positive (TP), true negative (TN), false positive (FP), or false negative (FN) as shown in Figure 7. Two basic measures of determining the accuracy of prediction are sensitivity (Sn) and specificity (Sp) [40]. Sensitivity, given by

[Formula ID: bmcM24]

and is defined as the proportion of CGIs that have been predicted correctly. Whereas, specificity given by

[Formula ID: bmcM25]

is defined as the proportion of the predicted CGIs that are real. Sensitivity and specificity can take on values from 0 to 1. For a perfect prediction, Sn = 1 and Sp = 1. Neither sensitivity nor specificity alone can provide a good measure of the global accuracy, because high sensitivity can be achieved with little specificity and vice versa. A measure that combines sensitivity and specificity values is called the correlation coefficient (CC) and is given by

[Formula ID: bmcM26]

The value of CC ranges from −1 to 1, where a value of 1 corresponds to a perfect prediction; a value of −1 indicates that every CGI has been predicted as non-CGI, and vice versa.

Another measure, called the performance accuracy (Acc), used in our analysis is given by

[Formula ID: bmcM27]

In this article, we have evaluated the performance of different CGI identification methods at the nucleotide level. For example, the value of TP is obtained by adding all the nucleotides predicted to to true positive, and the other outcomes are calculated in the similar manner. At the CGI level, even if one nucleotide (or a threshold of a minimum number of nucleotides) corresponding to a CGI is predicted to be true positive the entire CGI is assumed to be predicted correctly.

Results and discussion

The proposed CGI prediction scheme is tested on several genomic sequences of varying lengths taken from the human chromosomes 21 and 22. More precisely, we have used the three contigs, NT_113952.1, NT_113954.1, and NT_113958.2 from chromosome 21, and the contig NT_028395.3 from chromosome 22 for our analysis. All the sequence data considered for this study are obtained from the GenBank Database [39]. The performance of the proposed scheme is compared with the other popular DSP-based approaches such as Markov chain [2], IIR low-pass filters [25], and multinomial model [26].

First, a DNA sequence from human chromosome X with the GenBank accession number of L44140 is analyzed for illustrative purpose. The sequence is of length 219447 bp and is already annotated, i.e., the locations of its CGIs are already known and can be obtained from [39]. The sequence L44140 is also used to obtain the values of threshold, η, used by the DSP-based methods being compared in this article.

Figure 8 shows the comparative performance of CGI prediction by the above-mentioned four approaches. Figure 8a shows the performance of Markov chain approach, where log-likelihood ratio S(n) is plotted against base index of the sequence. The transition probability tables given in Tables 1 and 2 are used to calculate S(n). All the base locations, n, with S(n) > 0 imply that they are very likely to be a part of a CGI. A window length of 200 bp is considered for the method. Markov chain method is able to detect most of the CGIs in the DNA sequence and it can be seen that the CGIs and non-CGIs can reasonably be differentiated by looking at the sign of S(n). However, one of the major drawbacks of this method is the presence of a lot of false positives that falsely categorize non-CGIs into CGIs.

Figure 8b shows the performance of IIR low-pass filter approach where the log-likelihood ratio, S(n), is plotted against base index of the sequence, n. The transition probability tables given in [25] are used to calculate S(n). For fair comparison, instead of a bank on M filters, we have used one pole filter with optimized parameter α = 0.99 for this method. All the base locations, n, with S(n) > 0 imply that they are very likely to be a part of a CGI. A window length of 200 bp is considered for the method. Similar to the Markov chain method, this method also produces a lot of false positives affecting the prediction accuracy.

Figure 8c shows the prediction of CGIs using the multinomial model in [26]. An underlying multinomial statistical model is employed to estimate the Markov chain model parameters that result in the transition probability tables given in [26]. A Blackman window of length 100 bp is employed for calculating the filtered log-likelihood ratio. The Blackman window gives larger weights for central samples of the window, thus reducing the edge effects. Windows with the positive filtered log-likelihood ratio are considered to be a part of a CGI. This method shows considerably high false positives making the CGI prediction unreliable.

Figure 8d shows performance of the proposed SONF scheme in predicting the CGIs. Unlike the above-mentioned methods, our scheme utilizes the binary basis sequence, Φ, instead of the probability transition tables. The proposed scheme first maximizes SNR of the output at each time instant using IMF, then it further enhances the estimated signal using least-square optimization criterion, to estimate the presence of Φ in the input windowed DNA sequence. A window size of 200 is used for the proposed method. Effectiveness of the proposed scheme is clearly visible in Figure 8d, which depict more contrasting peaks as compared to the other three approaches. These contrasting peaks make the identification process comparatively easier resulting in less number of false positives.

It can be seen from Figure 8 that the default threshold on η = 0 produces a lot of false positives for the methods using transition probability tables. The optimal threshold values for the methods is obtained by calculating the prediction Acc for varying thresholds for each method (Figure 9). The optimal values of thresholds obtained for the Markov chain method, IIR filter method, and the proposed SONF approach are 0.1, 0.05, and 0.6, respectively. The actual locations of the CGIs, obtained from NCBI website, present in the sequence L44140 are represented by red horizontal spots in Figure 8. Figure 10 is receiver operating characteristic (ROC) curves plotted for the four methods. It can be seen that the proposed approach has better overall performance for the sequence L44140 with the area under the curve 0.7460. The Markov chain method is next with the area under the ROC curve 0.6072. The area under the curve for IIR filter method is 0.3106. It can be seen that the multinomial model method has the least area under the ROC curve. The dismal performance of the multinomial model does not indicate anything about the method in itself but merely implies that the transition probability tables used may not be appropriate for the example considered.

We have evaluated the time complexity of the proposed method using the tic-toc function in MATLAB. Taking the necessary precautions (such as all applications except MATLAB were closed, a fresh session of MATLAB was started for each task, and MATLAB was warmed up with the code, i.e., the first run of the code was ignored), the CPU time for processing a fixed length of sequence, the Markov chain method was found to be the least followed by SONF, IIR and multinomial approaches with an additional CPU time of 1.29%, 1.78%, and 1.82%, respectively. This difference is not substantial considering today’s computing resources.

Figure 11 shows the performance of the four methods for the prediction of CGIs in the first 15000 bps of L44140. The red horizontal lines are the actual locations of CGIs. The blue binary decision curve depicts the locations of the predicted CGI by the methods. As can be seen from Figure 11c, the multinomial-based approach fails to detect the CGI located between base pairs 3095 and 3426 as opposed to other three methods implying that the probability transition parameters used for the CGI identification play a crucial role. Hence, it is important to have a CGI identification characteristic which is devoid of any ambiguity with the choice of different probability transition tables available. The binary basis sequence Φ in the proposed scheme successfully identifies the CGIs and can be reliably used as CPG identification characteristic.

Table 3 presents the summary of performance measures Sn, Sp, CC, and Acc obtained for the analysis of four contigs NT_113952.1, NT_113954.1, NT_113958.2, and NT_028395.3. The performance of the proposed scheme is also compared with that of CpGCluster [23], which uses the distance between CpG dinucleotides (and not the transition probability tables) for identifying CGIs. The proposed approach has the highest values of Sn for all the contigs (shown in bold) and has the highest values of CC for the contigs NT_113954.1 and NT_113958.2. The performance accuracy is also quiet high, consistently above 97% which is a good sign. This shows that the proposed method is reliable and the proposed binary basis sequence Φ is an alternative CGI identification characteristic. The multinomial method did not identify any of the CGIs in the contig NT_028395.3 and hence its Sn and Sp values are zero. The corresponding Acc value is high because the method predicting most of the true negatives correctly. The contig NT_028395.3 has short CGIs of the order of 200 bps and the proposed approach with better sensitivity is capable of identifying them.


In this article, a new DSP-based technique using SONFs is proposed for the prediction of CGIs in DNA sequences. A novel CPG identification characteristic is presented in the form of a binary basis sequence which is shown to identify CGIs reliably. It has also been shown that the performance of the existing methods which use discriminating transition probability tables for CGIs/non-CGIs is not consistent. The prediction accuracy of these methods are highly dependent on the training data used to obtain the transition probabilities of CGIs and non-CGIs. The inability of finding a unique CGI identification characteristic has resulted in failure in predicting many of the CGIs. This article makes an attempt to present a unique CGI identification characteristic which does not require any training. Furthermore, the ability of SONF to track short duration signals is exploited in identifying the CGIs in DNA sequences. SONF combines maximum signal-to-noise ratio and least squares optimization criteria to estimate the CGI identification characteristic in the DNA sequence. The performance of the proposed technique is tested on four randomly chosen contigs in chromosomes 21 and 22 of human beings. The simulation results comparing the performance of the proposed technique with the other three DSP-based CGI prediction techniques have shown that the proposed approach enjoys superior prediction accuracy in terms of sensitivity. The overall predicting accuracy of the proposed approach is also consistently above 97% and is comparable to that of the Markov chain method making it a reliable method.

Competing interests

The authors declare that they have no competing interests.


This study was supported in parts by the Natural Sciences and Engineering Research Council (NSERC) of Canada and in part by the Regroupement Strategic en Microelectronique du Quebec (ReSMiQ).

Lodish H,Berk A,Zipursky S,Matsudaira P,Baltimore D,Darnell J,Molecular Cell biologyYear: 1995Scientific American, New York,
Durbin R,Eddy S,Krogh A,Mitchison G,Biological sequence analysisYear: 1998Cambridge University Press, Cambridge,
Antequera F,Bird A,Number of CpG islands and genes in human and mouseProc. Natl Acad. Sci. USAYear: 19939024119951199910.1073/pnas.90.24.119957505451
Antequera F,Bird A,CpG islands as genomic footprints of promoters that are associated with replication originsCurr. BiolYear: 1999966166710.1016/S0960-9822(99)80290-510375532
Ioshikhes I,Zhang M,Large-scale human promoter mapping using CpG islandsNat. GenetYear: 200026616310.1038/7918910973249
Antequera F,Structure, function, evolution of CpG island promotersCell. Mol. Life SciYear: 20036081647165810.1007/s00018-003-3088-614504655
Saxonov S,Berg P,Brutlag D,A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promotersProc. Natl Acad. Sci. USAYear: 200610351412141710.1073/pnas.051031010316432200
Larsen F,Gundersen G,Lopez R,Prydz H,CpG islands as gene markers in the human genomeGenomics (San Diego, CA)Year: 199213410951107
Wang Y,Leung F,An evaluation of new criteria for CpG islands in the human genome as gene markersBioinformaticsYear: 2004207117010.1093/bioinformatics/bth05914764558
Bird A,DNA methylation patterns and epigenetic memoryGenes DevYear: 20021662110.1101/gad.94710211782440
Herman J,Baylin S,Gene silencing in cancer in association with promoter hypermethylationNew Engl. J. MedYear: 200334921204210.1056/NEJMra02307514627790
Issa J,CpG island methylator phenotype in cancerNat. Rev. CancerYear: 200441298899310.1038/nrc150715573120
Illingworth R,Kerr A,DeSousa D,Jorgensen H,Ellis P,Stalker J,Jackson D,Clee C,Plumb R,Rogers J,A novel CpG island set identifies tissue-specific methylation at developmental gene lociPLoS BiolYear: 20086e2210.1371/journal.pbio.006002218232738
Heisler L,Torti D,Boutros P,Watson J,Chan C,Winegarden N,Takahashi M,Yau P,Huang T,Farnham P,CpG Island microarray probe sequences derived from a physical library are representative of CpG Islands annotated on the human genomeNucleic Acids ResYear: 2005339295210.1093/nar/gki58215911630
Gardiner-Garden M,Frommer M,CpG islands in vertebrate genomesJ. Mol. BiolYear: 1987196226110.1016/0022-2836(87)90689-93656447
Rouchka E,Mazzarella R,States David J,Computational detection of CpG islands in DNA, Report: WUCS-97-39Year: 1997
Rice P,Longden I,Bleasby A,EMBOSS: the European molecular biology open software suiteTrends GeneticsYear: 200016627627710.1016/S0168-9525(00)02024-2
Ponger L,Mouchiroud D,CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequencesBioinformaticsYear: 200218463110.1093/bioinformatics/18.4.63112016061
Dasgupta N,Lin S,Carin L,Sequential modeling for identifying CpG island locations in human genomeIEEE Signal Process. LettYear: 2002912407409
Luque-Escamilla P,Martínez-Aroza J,Oliver J,Gómez-Lopera J,Román-Roldán R,Compositional searching of CpG islands in the human genomePhys. Rev. EYear: 200571661925
Bock C,Walter J,Paulsen M,Lengauer T,CpG island mapping by epigenome predictionPLoS Comput. BiolYear: 200736e11010.1371/journal.pcbi.003011017559301
Sujuan Y,Asaithambi A,Liu Y,CpGIF: an algorithm for the identification of CpG islandsBioinformationYear: 20082833533810.6026/9732063000233518685720
Hackenberg M,Previti C,Luque-Escamilla P,Carpena P,Martínez-Aroza J,Oliver J,CpGcluster: a distance-based algorithm for CpG-island detectionBMC BioinformYear: 2006744610.1186/1471-2105-7-446
Takai D,Jones P,Comprehensive analysis of CpG islands in human chromosomes 21 and 22Proc. Natl Acad. SciYear: 20029963740374510.1073/pnas.05241009911891299
Yoon B,Vaidyanathan P,Identification of CpG islands using a bank of IIR lowpass filtersProceedings of 11 th Digital Signal Processing WorkshopYear: Aug. 2004Taos Ski Valley, New Mexico
Rushdi A,Tuqan J,A new DSP-based measure for CpG islands detectionDigital Signal Processing Workshop, 12th-Signal Processing Education Workshop, 4thYear: 2006IEEE, Teton National Park, Wyoming
Rabiner L,A tutorial on hidden Markov models and selected applications in speech recognitionProc. IEEEYear: 198977225728610.1109/5.18626
Won K,Prugel-Bennett A,Krogh A,Evolving the structure of hidden Markov modelsIEEE Trans. Evol. ComputYear: 2006103949
Anastassiou D,Genomic signal processingIEEE Signal Process. MagYear: 200118482010.1109/79.939833
Vaidyanathan P,Yoon B,The role of signal-processing concepts in genomics and proteomicsJ. Franklin InstYear: 20043411–2111135
Ramachandran P,Antoniou A,Identification of hot-spot locations in proteins using digital filtersIEEE J. Sel. Topics Signal ProcessYear: 200823378389
Rao K,Swamy M,Analysis of genomics and proteomics using DSP techniquesIEEE Trans. Circuits Syst. 1: Regular PapersYear: 200855358
Song N,Yan H,Short exon detection in DNA sequences based on multifeature spectral analysisEURASIP J. Adv. Signal ProcessYear: 20112011210.1186/1687-6180-2011-2
Liu B,Statistical Genomics: Linkage, Mapping, and QTL AnalysisYear: 1998CRC Press, Boca Raton,
Agarwal R,Plotkin E,Swamy M,Statistically optimal null filter based on instantaneous matched processingCircuits Syst. Signal ProcessYear: 200120376110.1007/BF01204921
Kakumani R,Devabhaktuni V,Ahmad M,Prediction of protein-coding regions in DNA sequences using a model-based approachIEEE International Symposium on Circuits and SystemsYear: 2008Seattle
Yadav R,Agarwal R,Swamy M,A new improved model-based seizure detection using statistically optimal null filterEngineering in Medicine and Biology Society, 2009. EMBC 2009. Annual International Conference of the IEEEYear: 2009Minneapolis, Minnesota
Voss R,Evolution of long-range fractal correlations and 1/f noise in DNA base sequencesPhys. Rev. LettYear: 199268253805380810.1103/PhysRevLett.68.380510045801
National Centre for Biotechnology Information
Burset M,Guigo R,Evaluation of gene structure prediction programsGenomicsYear: 199634335336710.1006/geno.1996.02988786136


[Figure ID: F1]
Figure 1 

Difference between mythelated and unmythelated CpG Island.

[Figure ID: F2]
Figure 2 

Statistically optimal null filter.

[Figure ID: F3]
Figure 3 

Comparison of relative occurrence of dinucleotides in CGIs and non-CGIs of L44140.

[Figure ID: F4]
Figure 4 

Relative occurrence of various gap sizes in CGIs and non-CGIs of L44140.

[Figure ID: F5]
Figure 5 

Difference of relative occurrence of a particular gap in a CGI and a non-CGI for different window lengths.

[Figure ID: F6]
Figure 6 

SONF implementation: (a) An example of a CGI; (b) An example of a non-CGI; (c) IMF output for CGI; (d) IMF output for non-CGI; (e) Scaling function for CGI; (f) Scaling function for non-CGI; (g) SONF output for CGI; and, (h) SONF output for non-CGI.

[Figure ID: F7]
Figure 7 

Four possible outcomes of CGI prediction.

[Figure ID: F8]
Figure 8 

CGI prediction in the DNA sequence L44140 using (a) Markov chain method; (b) IIR Filter method; (c) Multinomial model; (d) SONF scheme.

[Figure ID: F9]
Figure 9 

Relation between the performance Acc and threshold.

[Figure ID: F10]
Figure 10 

ROC curves obtained for the sequence L44140.

[Figure ID: F11]
Figure 11 

CGI prediction in the first 15000 bps of L44140 using (a) Markov chain method; (b) IIR filter method; (c) multinomial model; (d) SONF scheme. Binary decision based on respective threshold is plotted against the base location index

[TableWrap ID: T1] Table 1 

Transition probabilities inside a CGI

pβγ+ A C G T
T 0.079 0.355 0.384 0.182

[TableWrap ID: T2] Table 2 

Transition probabilities inside a non-CGI

pβγ− A C G T
T 0.177 0.239 0.292 0.292

[TableWrap ID: T3] Table 3 

Comparison of different methods for identification of CGIs

    Markov Chain IIR Filter Multinomial model CpGCluster SONF
Length = 184355
Length = 129889
Length = 209483
Length = 647850
  Acc 0.9945 0.9932 0.8710 0.9532 0.9887

Article Categories:
  • Research

Previous Document:  Are Hydrodynamic Interactions Important in the Kinetics of Hydrophobic Collapse?
Next Document:  Lack of associations between ultrasonographic appearance of parenchymal lesions of the canine liver ...