Document Detail

Cross chromosomal similarity for DNA sequence compression.
Jump to Full Text
MedLine Citation:
PMID:  18795115     Owner:  NLM     Status:  PubMed-not-MEDLINE    
Current DNA compression algorithms work by finding similar repeated regions within the DNA sequence and then encoding these regions together to achieve compression. Our study on chromosome sequence similarity reveals that the length of similar repeated regions within one chromosome is about 4.5% of the total sequence length. The compression gain is often not high because of these short lengths. It is well known that similarity exist among different regions of chromosome sequences. This implies that similar repeated sequences are found among different regions of chromosome sequences. Here, we study cross-chromosomal similarity for DNA sequence compression. The length and location of similar repeated regions among the sixteen chromosomes of S. cerevisiae are studied. It is found that the average percentage of similar subsequences found between two chromosome sequences is about 10% in which 8% comes from cross-chromosomal prediction and 2% from self-chromosomal prediction. The percentage of similar subsquences is about 18% in which only 1.2% comes from self-chromosomal prediction while the rest is from cross-chromosomal prediction among the 16 chromosomes studied. This suggests the importance of cross-chromosomal similarities in addition to self-chromosomal similarities in DNA sequence compression. An additional 23% of storage space could be reduced on average using self-chromosomal and cross-chromosomal predictions in compressing the 16 chromosomes of S. cerevisiae.
Choi-Ping Paula Wu; Ngai-Fong Law; Wan-Chi Siu
Related Documents :
18640135 - A novel functional polymorphism in the cdc6 promoter is associated with the risk for he...
2026425 - Determination of the origin of nondisjunction in a 49,xxxxy male using hypervariable di...
7847375 - A gene for familial total anomalous pulmonary venous return maps to chromosome 4p13-q12.
7824105 - Dna analysis in hereditary dentatorubral-pallidoluysian atrophy: correlation between ca...
8914635 - Detection of chromosomal abnormalities of chromosome 12 in uterine leiomyoma using fluo...
8449035 - Confirmation of 15q26.1 as the site of the fes protooncogene by fluorescence in situ hy...
Publication Detail:
Type:  Journal Article     Date:  2008-07-14
Journal Detail:
Title:  Bioinformation     Volume:  2     ISSN:  0973-2063     ISO Abbreviation:  Bioinformation     Publication Date:  2008  
Date Detail:
Created Date:  2008-09-16     Completed Date:  2010-06-09     Revised Date:  2013-05-23    
Medline Journal Info:
Nlm Unique ID:  101258255     Medline TA:  Bioinformation     Country:  Singapore    
Other Details:
Languages:  eng     Pagination:  412-6     Citation Subset:  -    
Centre for Signal Processing, Department of Electronic and Information Engineering, The Hong Kong Polytechnic University.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Full Text
Journal Information
Journal ID (nlm-ta): Bioinformation
Journal ID (publisher-id): Bioinformation
ISSN: 0973-2063
Publisher: Biomedical Informatics Publishing Group
Article Information
Download PDF
? 2008 Biomedical Informatics Publishing Group
open-access: This is an open-access article, which permits unrestricted use, distribution, and reproduction in any medium, for non-commercial purposes, provided the original author and source are credited.
Received Day: 30 Month: 5 Year: 2008
Revision Received Day: 27 Month: 6 Year: 2008
Accepted Day: 06 Month: 7 Year: 2008
collection publication date: Year: 2008
Electronic publication date: Day: 14 Month: 7 Year: 2008
Volume: 2 Issue: 9
First Page: 412 Last Page: 416
ID: 2533061
PubMed Id: 18795115
Publisher Id: 008900022008

Cross chromosomal similarity for DNA sequence compression
Choi-Ping Paula Wu1*
Ngai-Fong Law1
Wan-Chi Siu1
1Centre for Signal Processing, Department of Electronic and Information Engineering, The Hong Kong Polytechnic University
*Corresponding author: E-mail:


A DNA sequence is a long stretch consisting of four types of nucleotides: Adenine (A), Cytosine (C), Guanine (G) and Thymine (T). The lengths of the 24 chromosomes in human range from 50 to 250 million base pairs [1]. Compression is desirable to uncover similarities among sequences, and provide a means to understand their properties in addition to reduce storage requirement [2,3]. State-of-the-art compression algorithms work by finding approximate repeats and approximate reverse complement repeats in the current DNA sequence. The approximate repeats refer to those repeats that contain errors, i.e., with certain unmatched nucleotides between two subsequences. The reverse complement means nucleotides in a sequence is the reverse ordering of nucleotides in another sequence, but with each nucleotide replaced with its complement. For example, the subsequences AAACGT and ACGTTT are reverse complement repeats as (A, T) and (G, C) are complement bases.

Most DNA-based compression algorithms rely on encoding together similar repeated regions found within the sequence. Biocompress is the first algorithm designed specifically for compressing DNA sequences [4]. Both Biocompress and its second version Biocompress-2 are based on a sliding window algorithm known as LZ77 [4-6]. Exact matches and complementary palindromes are found so that the matched subsequences can be encoded with respect to identical subsequences occurred in the past. The matched sequences are replaced by two parameters: the start position of the previously occurred subsequence and the repeat length in the analysis. An order-2 arithmetic coding (arith-2) is used for or non-repeated regions.

Cfact [7] utilizes a two pass algorithm. A suffix tree is used for finding exact matches in the first pass. The matched subsequences are encoded using previous references if there is a compression gain. Otherwise, they are kept uncompressed in the second pass. Unlike Biocompress and Cfact, GenCompress [2,8] used approximate matches in addition to exact matches. GenCompress-1 uses substitutions while GenCompress-2 uses deletions, insertions and substitutions to encode repeats. CTW+LZ method is based on the context tree weighting method (CTW) and LZ based compression [3]. Long exact/approximate repeats and complementary palindromes repeats are encoded by the LZ-based algorithm, whereas short subsequences are compressed using CTW. Execution time is too high for long sequences despite obtaining good compression.

DNACompress [9] consists of two phases. The first phase finds all approximate repeats including complementary palindromes by a separate software tool called PatternHunter [10]. The second phase encodes those approximate repeats and non-repeating regions. DNACompress not only provides good compression, but is also significantly faster than GenCompress. DNAC [11] consists of four phases. The suffix tree is built in the first phase to locate exact matches. In the second phase, all the exact repeats are extended to approximate repeats by dynamic programming. In the third phase, the optimal non-overlapping repeats are extracted from the overlapping regions. All the repeats are then encoded in the last phase. Similar to DNAC, DNAPack uses dynamic programming approach for identification and encoding of repeats [12].

It is seen that all DNA-based compression algorithms find repetitions within the DNA sequence. Longer repetitive length implies higher compression gain. The compression ratio attained is high if highly similar subsequences are found. It is well known that there are similarities among different chromosome sequences. However, cross-chromosomal similarities are seldom exploited in DNA sequence compression. The objective of this paper is to study self-chromosomal and cross-chromosomal similarities; to investigate use of cross-chromosomal similarity for compression; and to demonstrate the advantage of cross-chromosomal similarity in multiple sequence compression. It should be noted that similar subsequences located within the chromosome sequence are called self-(chromosomal) similarity/ self-reference while those located in other chromosome sequence are called cross-(chromosomal) similarity/ cross-reference in this analysis.


The sixteen chromosomes of S. cerevisiae are used to investigate chromosome similarities. They are downloaded from elsewhere [13]. The search engine PatternHunter is employed to search for repeats. All repeats are ranked by a score. It defines similarity between two subsequences. A large score means that the two subsequences are similar to each other.

Similarity between two chromosome sequences

Repetitions between two chromosome sequences are first investigated. We found that the repetitive lengths found within a chromosome sequence are not necessarily longer than that found in another chromosome. In an example, the top three longest repetitive regions found within Chr I are about 15000, 2600 and 2300 bases long. However, the lengths of the repetitive regions found between Chr I and Chr VIII are 17000, 14000 and 6800. This example shows that the lengths of the repetitive regions found between Chr I and Chr VIII are always larger than those found within Chr I alone. The lengths of the repetitive regions found between Chr I and other chromosomes including Chr II, Chr IV, Chr VII, Chr X, Chr XII, Chr XIII, Chr XV and Chr XVI are also significant. Similar observations are found for other chromosomes. This shows that cross-chromosomal similarities between two sequences are often significant. They are exploited beneficially for compression purposes.

Cross-chromosomal predictions

The potential gain in cross-chromosomal compression is obtained by finding the total lengths of subsequence in the current chromosome sequence that is predicted from another chromosome sequence. The lengths of these cross-reference subsequences determine the potential compression gain in multiple DNA sequences compression. Long length implies a high compression ratio.

Chromosomes are classified into three classes as shown in Column 2 of Table 1 (supplementary material) using comparison of self-chromosomal and cross-chromosomal prediction length. The first class, consisting of Chr III, Chr V, Chr VIII, Chr XI and Chr XIV, has high similarities with chromosomes other than itself. More than 8 chromosomes show cross-repetitive lengths longer than the self-repetitive length. This implies that a potentially high compression gain can be obtained if these sequences employ cross-referencing strategy with subsequences predicted from other chromosomes.

The second class consists of Chr VII, Chr XIII, Chr XV and Chr XVI. Although just 3 to 7 chromosomes show cross-repetitive lengths longer than the self-repetitive length, a potential compression gain is also expected since the cross-repetitive lengths are still large. The last class consists of Chr I, Chr IV and Chr XII. There are less than 3 chromosomes having cross-repetitive lengths longer than the self-repetitive length. In Chr I, the number of nucleotides predicted from Chr VIII is almost doubled of the self-repetitive length within Chr I. In addition, in Chr XII and Chr IV, self-referencing and cross-referencing are indeed significant since the number of nucleotides respectively predicted from Chr IV and Chr XII is comparable to the self-repetitive length. Thus, the combination of self-repetitive and cross-repetitive lengths would still contribute to better compression.


Besides considering the total length of subsequences within a chromosome that can be referenced from other chromosomes, their distribution within the sequence are also important. Let the subsequence in a sequence S that is similar to a subsequence in sequence i be S(i) and the subsequence in S that is similar to a subsequence in sequence j be S(j), the total length of subsequences within S that can be referenced from i and j is given by T=|S(i)|+|S(j)|?|S(i)?S(j)|. Obviously if these subsequences are well spread out such that |S(i)?S(j)| is zero, i.e., they do not overlap in position, T is maximized. This implies that a high proportion of the nucleotides within S can be predicted by cross-referencing among chromosomes, resulting in a high compression gain.

Locations of similar subsequences among chromosomes

Figure 1 shows locations of similar subsequences found among chromosomes. The five chromosomes in Figure 1a prove that the portions of self-repetitive regions are very small, as compared to that of cross-repetitive regions with other chromosomes. In the case of Chr XI, Chr XIV, Chr VIII and Chr V, the self-repetitive subsequence is not seen. Similar subsequences predicted from other chromosomes appear in different locations. For example, in Chr XI, the four similar subsequences appear in four different regions. Similar observations are also seen from Figure 1b for second class.

Figure 1c shows locations of similar subsequences for the third class. In Chr I, we can see that the portions of cross-repetitive regions with either Chr VIII or Chr XV are much larger than that of self-repetitive region. In Chr XII, the portions of cross-repetitive regions with Chr XIII or Chr IV are comparable to that of self-repetitive region. In Chr IV, the portions of cross-repetitive regions with Chr XII are also comparable to that of self-repetitive region. Figure 1 illustrates that cross-repetitive regions are often significant when compared with self-repetitive regions. Furthermore, subsequences that are cross-referenced from other chromosomes appear in different locations within the chromosome.

Cross-chromosomal predictions

We considered two cases for cross-chromosomal prediction. In the first case named prediction-2, the prediction is restricted to only two chromosome sequences including the current chromosome sequence. In the second case named prediction-16, the prediction is from the current chromosome and the other 15 chromosome sequences. The self-prediction and cross-predictions are examined to remove all those overlapping regions and are sorted to produce a combined list. This combined list is then used to show all the repetitive regions including both self-chromosomal and cross-chromosomal repetitions. Table 1 (supplementary material) shows the results of the analysis.

In prediction-2, the cross-predictions come from another chromosome that gives the longest repetitive regions. In column 5(a) and (b), it is clear that the cross-predictions are always significant, as compared to the self-predictions. In particular, the cross-predictions are in the range of 5% to 22%. In contrast, the self-predictions are always less than 3.5%. In prediction 16, the cross-predictions from the other 15 chromosomes are listed in column 5(c). The cross-predictions are in the range of 12.5% to 32%, whereas the self-predictions are always less than 3%. As a result, our study indicates that it would be advantageous to compress different chromosomes together to take into account both self-similarity and cross-similarities.

Self-chromosomal and cross-chromosomal similarities for compression

Two chromosome sequences are compressed by considering self-chromosomal similarities or by both self-chromosomal and cross-chromosomal similarities using GenCompress.

Column 6(b) shows the number of bits used if two chromosomes are compressed separately (consider only self-chromosomal similarities). Column 6(c) shows the number of bits if the two chromosomes are concatenated and compressed together. The savings are shown in Column 7. Column 7(a) is the savings resulting from self-chromosomal predictions as compared to the no compression case. Column 7(b) is the savings from cross-chromosomal predictions as compared to the self-chromosomal predictions. We can see that there is always extra savings by considering cross-chromosomal predictions in addition to self-chromosomal predictions. Since the cross-prediction found between Chr I and Chr VIII is the highest as shown in Column 5(a), the saving from cross-chromosome predictions is the largest. While the size of repetitive regions in cross-predictions ranged from 5% to 22%, their savings are between 9% and 60%.


The state-of-the-art DNA compression algorithms consider repetitions within the current sequence. However, similarities exist across different chromosome sequences. Here, we described cross-chromosomal similarities in S. cerevisiae.

We find that cross-chromosomal similarities are always significant as compared to self-chromosomal similarities. For example, the average percentage of similar subsequences between two chromosome sequences is about 10% in which 8% comes from cross-chromosomal prediction and 2% from self-chromosomal prediction. For 16 chromosome sequences of S. cerevisiae, the average percentage is about 18% in which 16.8% comes from cross-chromosomal prediction and 1.2% from self-chromosomal prediction. Therefore, it would be advantages to compress different chromosome sequences together to take advantage of cross-chromosomal similarities.

Our experimental results demonstrate that on average an additional 23% of storage is reduced in cross-chromosomal predictions as compared to self-chromosomal predictions. Therefore, a high compression ratio is obtained by considering both self-prediction and cross-predictions for the entire set of chromosomes. Our future work is to extend this analysis to cross-similarities between species and to develop a systematic approach for incorporating both self-chromosomal and cross-chromosomal predictions into DNA sequence compression.

This work is supported by the Centre for Signal Processing (1BB9F), the Hong Kong Polytechnic University. Paula Wu acknowledges the research studentships provided by the University.

2. Li M,et al. Bioinformatics 2001;17:149. [pmid: 11238070]
3. Matsumoto T,et al. Genome Inform Ser Workshop Genome Inform 2000;11:43. [pmid: 11700586]
4. Delgrange O,et al. Pac Symp Biocomput 1999;254 [pmid: 10380202]
5. Gusev VD,et al. Bioinformatics 1999;15:994. [pmid: 10745989]
6. Chen X,et al. IEEE Eng Med Biol Mag 2001;20:61. [pmid: 11494771]
7. Rivals ?,et al. Biochimie 1996;78:315. [pmid: 8905150]
8. Chen X,et al. Genome Inform Ser Workshop Genome Inform 1999;10:51. [pmid: 11072342]
9. Chen X,et al. Bioinformatics 2002;18:1696. [pmid: 12490460]
10. Ma B,et al. Bioinformatics 2002;18:440. [pmid: 11934743]
12. Pinho AJ,et al. IEEE Trans Biomed Eng 2006;53:2148. [pmid: 17073319]

Article Categories:
  • Hypothesis

Keywords: DNA, sequence, chromosome, prediction, S. cerevisiae.

Previous Document:  Effect of single nucleotide polymorphisms on Affymetrix match-mismatch probe pairs.
Next Document:  SOSUI-GramN: high performance prediction for sub-cellular localization of proteins in gram-negative ...