Document Detail

Origin and fate of repeats in bacteria.
Jump to Full Text
MedLine Citation:
PMID:  12087185     Owner:  NLM     Status:  MEDLINE    
We investigated 53 complete bacterial chromosomes for intrachromosomal repeats. In previous studies on eukaryote chromosomes, we proposed a model for the dynamics of repeats based on the continuous genesis of tandem repeats, followed by an active process of high deletion rate, counteracted by rearrangement events that may prevent the repeats from being deleted. The present study of long repeats in the genomes of Bacteria and Archaea suggests that our model of interspersed repeats dynamics may apply to them. Thus the duplication process might be a consequence of very ancient mechanisms shared by all three domains. Moreover, we show that there is a strong negative correlation between nucleotide composition bias and the repeat density of genomes. We hypothesise that in highly biased genomes, non-duplicated small repeats arise more frequently by random effects and are used as primers for duplication mechanisms, leading to a higher density of large repeats.
G Achaz; E P C Rocha; P Netter; E Coissac
Related Documents :
8005595 - Polymorphisms in the 3' untranslated region of the i kappa b/mad-3 (nfkbi) gene located...
1612595 - A human moderately repeated y-specific dna sequence is evolutionarily conserved in the ...
15211635 - Expanded red products and loci containing cag/ctg repeats on chromosome 17 (erda1) and ...
12454835 - Effect of interleukin 1 polymorphisms on gastric mucosal interleukin 1beta production i...
1093825 - Lampbrush-type chromosomes in the primary nucleus of the green alga acetabularia medite...
18481965 - Co-ordination of cytokinesis with chromosome segregation.
Publication Detail:
Type:  Journal Article; Research Support, Non-U.S. Gov't    
Journal Detail:
Title:  Nucleic acids research     Volume:  30     ISSN:  1362-4962     ISO Abbreviation:  Nucleic Acids Res.     Publication Date:  2002 Jul 
Date Detail:
Created Date:  2002-06-27     Completed Date:  2002-07-19     Revised Date:  2009-11-18    
Medline Journal Info:
Nlm Unique ID:  0411011     Medline TA:  Nucleic Acids Res     Country:  England    
Other Details:
Languages:  eng     Pagination:  2987-94     Citation Subset:  IM    
Structure et Dynamique des Génomes, Institut Jacques Monod, Tour 43-44, 1 degrees Etage, 4 Place Jussieu, F-75251 Paris Cedex 05, France.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Archaea / genetics
Bacteria / genetics*
Evolution, Molecular*
Genome, Archaeal
Genome, Bacterial
Models, Genetic
Repetitive Sequences, Nucleic Acid / genetics*

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Full Text
Journal Information
Journal ID (nlm-ta): Nucleic Acids Res
ISSN: 0305-1048
ISSN: 1362-4962
Publisher: Oxford University Press, Oxford, UK
Article Information
Download PDF
Copyright ? 2002 Oxford University Press
Received Day: 12 Month: 12 Year: 2001
Revision Received Day: 12 Month: 4 Year: 2002
Accepted Day: 8 Month: 5 Year: 2002
Print publication date: Day: 1 Month: 7 Year: 2002
Volume: 30 Issue: 13
First Page: 2987 Last Page: 2994
ID: 117046
Publisher Id: gkf391
PubMed Id: 12087185

Origin and fate of repeats in bacteria
G. Achaza
E. P. C. Rocha12
P. Netter
E. Coissac
Structure et Dynamique des G?nomes, Institut Jacques Monod, Tour 43-44, 1 ?tage, 4 Place Jussieu, F-75251 Paris Cedex 05, France, 1Atelier de Bioinformatique, Universit? Pierre et Marie Curie, Paris, France and 2URA2171, Unit? GGB, Institut Pasteur, Paris, France
aTo whom correspondence should be addressed. Tel: +33 1 44 27 76 94; Fax: +33 1 44 27 82 05; Email:


DNA repeats can be defined as sequences sharing extensive similarity with other sequences of the same genome. It is usually supposed that repeats arise by successive duplications and several causal mechanisms, including hyperploidisation (even polyploidisation), tandem duplication, double-strand break repair by insertion or transposition, have been proposed to be involved. The underlying mechanisms are thought to act at different levels depending on the kingdom, or even on organism [i.e. polyploidisation has been proposed to explain the presence of large repeats in eukaryotes (1,2), but is probably absent in Archaea and Bacteria]. Once a repeat is created, it can be targeted by the recombination apparatus and be subject to deletion. Thus, genome size results from a balance between duplication and deletion events. The importance of deletion processes seems crucial in compact genomes, especially in those of intracellular endosymbionts or pathogens (3).

Usually, repeats in Bacteria are divided into two subclasses: low complexity repeats (sometimes mislabeled ?tandem repeats?) and longer repeats (the centre of our interest). The first category is constituted of small oligonucleotides (typically ranging from mononucleotide to pentanucleotide in size) repeated many times in a head-to-tail configuration. These low complexity repeats, e.g. microsatellites, are very abundant in the genomes of eukaryotes, in which they have been widely studied (4). Although less abundant in bacterial and archaeal genomes (5), the mechanisms of their origin (6), their function (7), the consequences for genome dynamics (8) and the structural constraints imposed on the chromosome (9) have all been studied

Longer repeats include transposable elements, minisatellites (mostly in Eukarya), large tandem repeats and spaced repeats. DNA transposable elements (like IS) are widely distributed among the Archaea and Bacteria. As specific mechanisms for the duplication of mobile elements have been identified (10), such self-replicating elements have to be considered separately when the origin of repeats is analysed. However, they must be taken into account when the influence of repeats on genome stability is considered.

Several mechanisms have been proposed for the genesis of tandem repeats: slipped strand mispairing, unequal crossover (by homologous recombination), rolling circle and circle excision with reinsertion (11). Some of these mechanisms could also result in a tandem repeat deletion. These mechanisms render tandem repeats unstable, easy to create but also easy to delete. In contrast, distant repeats can almost only be deleted by homologous recombination and at the cost of large deletions of genetic material. As a consequence, they may persist more easily during genome evolution. Two mechanisms have been envisaged to create spaced repeats ex nihilo. The first, known as Campbell-like insertion, creates repeats by inserted exogenous sequences and has been proposed to explain the peculiar distribution of many repeats in Bacillus subtilis (12). The second, referred to as ?conversion? or ?insertion?, repairs a double-strand break by copying a sequence sharing similarity with the edges of the broken sequence: this mechanism works either by break-induced replication or by gap repair (for reviews in yeast see 13,14).

The first question we tackled in this work concerns the origin of interspersed repeats (excluding transposable elements). Our previous studies (15,16) had led us to propose a model (Fig. 1) for the origin of eukaryote intrachromosomal repeats based on the permanent genesis of close direct repeats (CDR, repeats with copies separated by <1 kb). Since our model is compatible with all mechanisms, we do not assume any particular one for the creation of CDR. Newly created CDR are then subject to a strong rate of exchange (conversion and deletion). Experimental studies undertaken on B.subtilis (17) and Escherichia coli (18?20) have shown that the rate of illegitimate recombination is negatively correlated with the distance between the copies (spacer size) and positively correlated with repeat length. Recombination between close repeats tends to maintain neighbouring repeats identical (by conversion) but also to eliminate them (by deletion). At each round of exchange, both events are possible (although we ignore whether they are equally likely). If conversion can be followed by deletion, the opposite is not true: a deletion event cannot be followed by conversion. Over a long time, this will result in a bias in favour of deletions, with CDR disappearing sooner or later (depending on the relative rates of conversion and deletion). Thus, in the absence of strong selective pressure, long CDR are too unstable to persist, except if the copies are moved further apart by chromosomal rearrangements (i.e. insertion, translocation and inversion). In this case, the rate of illegitimate recombination will drop severely and the repeats may be maintained.

In this context, one expects CDR to be more similar than distant repeats, since either they are more recent or they are more subject to conversion. On the other hand, one expects that larger repeats will only survive fast deletion by frequent illegitimate recombination if they are placed distantly. Thus, under our model, CDR tend to have smaller and more identical repeats whereas distant repeats tend to be longer and less similar. This matches the observations we have made in eukaryote genomes, where repeats are both more identical and smaller when they are closer (15). The main goal of this work was to test if this model, first established in Eukarya, could be applied to Bacteria and Archaea.

The second focus of our attention concerns the factors influencing the dynamics of our model, i.e. rates of duplication, deletion and rearrangement. Here we analyse precisely the relation between the origin of tandem repeats and the genome composition biases. Duplication mechanisms typically require the pre-existence of a region of similarity. Levinson and Gutman (8) proposed that small non-duplicated repeats (afterwards referred to as repeats appearing by chance) are primers for mechanisms such as slipped strand mispairing, thus creating larger repeats. We have tried to analyse this proposition by deciphering the relations between repeat density and the relative frequencies of nucleotides in the chromosome.


We analysed the complete genomes of 40 Bacteria and 11 Archaea (Table 1). All sequences were extracted from GenBank (, except for those of Pyrococcus furiosus, downloaded from

Construction of the repeats database

We followed the methodology previously developed to detect repeats in eukaryote genomes (15,16), but made an extra effort to detect smaller, but significant, repeats, since bacterial chromosomes are smaller. The methodology is described below and follows four main steps.

First step: detection of seeds. In this step, exact direct and inverse repeats (seeds) of 15 bp were detected using the REPuter software (21). Many seeds with lengths that are not statistically significant according to Karlin and Ost statistics were retained (22). The second step is intended to further extend these seeds into larger, non-strict repeats.

Second step: from seeds to repeats. Local alignment (23) is used to extend the edges of the seeds into larger repeats. Except for the construction of the score matrix, the extension process is the same we used to analyse eukaryote chromosomes (16). This method produces non-exact repeats by extending a seed on both sides when similarity is high. To do so, we used an algorithm based on a local alignment procedure (23).

Nucleotide frequencies differ widely between species genomes, from 25 to 75% (24). Therefore, if an identity matrix is used for the local alignment, seeds of the same size in chromosomes with a very unbalanced distribution of nucleotides (e.g. Ureaplasma urealiticum where A ? T ? 0.37 and C ? G ? 0.13) tend to produce larger repeats than in genomes with equal frequencies (e.g. E.coli). In order to avoid this effect, we used an empirical scoring matrix for each chromosome, which takes into account its specific composition. These matrices provide a better score for matches between rare nucleotides:

matchi/i = 100 ? (1 ? pi2); matchN/i = 25

mismatchi/j = ?100 ? (1 ? pi ? pj); gapopen = ?400; gapext = ?100

where pi is the frequency of nucleotide i. By building these matrices for all species, we observed scores for matches ranging from 86 to 98 and scores for mismatches ranging from ?98 to ?86. Thus the score of gapopen is always less than 4??mismatch and the score of gapext always less than 1??mismatch. We also tried other matrices that gave similar results.

Third step: removing repeats that are not statistically significant. Since seeds are rather small, many repeats may not have statistically significant lengths. To remove these non-significant repeats, we built, for each chromosome, 10 additional random chromosomes by shuffling it with respect to its trinucleotide composition (Markov chains of order 2). In these random sequences, repeats were detected as in real sequences (steps 1 and 2). Afterwards, we built a distribution of observed alignment scores from the set of repeats detected in the 10 random chromosomes. We then defined a threshold of significance, corresponding to 0.001 of this distribution. Below this minimal score (Smin), repeats were regarded as non-significant and removed from further analysis. Smin depends essentially on the size and composition of the genome (and naturally on our choice of scoring system) and ranges from 2052 (Chlamydia pneumoniae) to 2258 (Mycoplasma pulmonis). Using score (S), length (L) and identity (Id), characteristics of some pertinent repeats from these two organisms are given with more details: (i) for C.pneumoniae, the smallest score corresponds to S = 2052, L = 36 and Id = 80.6%; the medians of the distributions being S = 4505, L = 220 and Id = 63.1%; (ii) for M.pulmonis, the smallest score corresponds to S = 2258, L = 82, Id = 71.7%; the medians being S = 3005, L = 90 and Id = 68.9%.

Fourth step: determining family sizes. At this stage, all significant repeats are given as a series of pairs. However, many repeats are organised in multicopy families (i.e. IS and rRNA operons). Hence, we developed a procedure to detect such multicopy families in our data set.

To do so, we built, for each chromosome, a map in which each position is linked to its ?n-plication? degree: unique, duplicated, triplicated, etc. These maps were built by counting, for each chromosome position, the number of times this position is found in repeats (direct and inverted ones were pooled together). Each pair was then associated with the map and the family size of each repeat was determined.

Density of repeats

In order to characterise the repeats, we used two measures of density, the density in number and the density in length. They are defined as:

DN = no. of copies/size of chromosome (Mb)

DL = 100 ? [size of repeat sequence (bp)/size of chromosome (bp)]

Nucleotide complexity

Complexity is frequently used as a compact measure of the difference of the nucleotide distribution to equal repartition. In this context, information entropy has been proposed to describe biases of mononucleotide distributions (25):

where pi is the frequency of nucleotide i. If a sequence exhibits an equal repartition of its four nucleotides (maximum complexity), its entropy is 1. In bacterial chromosomes it ranges from 0.91 to 1.

Proportion of CDR

CDR were originally defined as repeats with a distance between their two copies of <1 kb. We estimated the proportion of CDR expected if repeats are spread randomly along a chromosome. The proportion of CDR is calculated as the ratio between the number of CDR and the total number of repeats. Two cases were taken into account. (i) If the chromosome is circular, the largest spacer size is L/2, where L is the chromosome length. The distribution of spacer size is constant from 0 to L/2. So, the proportion of CDR in a circular chromosome is 1000 ? 2/L. (ii) If the chromosome is linear, the largest spacer size is L and the spacer distribution decreases linearly from 0 to L. Using the intercept theorem of Thales (or any analytical demonstration), it could easily be demonstrated that the proportion of CDR is 1000/L ? (2 ? 1000/L).

What repeats have we detected?

We have found a large number of repeats in most (but not all) bacterial genomes (Table 2). In order to characterise these repeats, we used two measures of repeat density, DN and DL (see Materials and Methods). As expected, both densities were positively correlated (? = 0.63, P < 10?4, Kendall ? rank test): a chromosome with many repeats also exhibits a high proportion of duplications in its chromosome. However, the biological interpretation of these measures may be quite different: DN can be assimilated to the rate of amplification (a balance between duplication and deletion processes) and DL to the history of the chromosomes, a measure of the redundancy tolerated by a chromosome. Thus, DN and DL should be analysed in parallel as they give complementary information on chromosomal redundancy. The data in Table 2 brings to the fore two issues. (i) Chromosomes of related organisms often exhibit similar densities of repeats: both Chlamydia trachomatis strains, the three C.pneumoniae strains, the three Pyrococcus strains, both Mycobacterium tuberculosis strains, both Staphylococcus aureus strains, both Neisseria meningitidis strains and both Helicobacter pylori strains. However, exceptions do exist. Escherichia coli O157:H7 is more repeated than K12, in agreement with previous observations (26). Also, when we broaden the phylogenetic range, we observe that the four Mycoplasma spp. show very different densities (DN and DL), indicating fast divergence, possibly due to their rudimentary repair mechanisms and to the selective pressure for variation in these pathogens (27). (ii) Both DN and DL exhibit a positive correlation with chromosome size (? = 0.24, P < 10?3 for DN and ? = 0.37, P < 10?4 for DL),. These observations are in good agreement with previous observations on parts of both bacterial genomes and eukaryote genomes (16,28) (Fig. 2).

Since we were interested in the repeats? origins and in the supposition that it proceeds by duplication, we determined the proportions of two-copy repeats (and respective densities DN2 and DL2) among all repeats (Table 2). As expected, DN2 is positively correlated with DN (? = 0.77, P < 10?4) and DL2 with DL (? = 0.73, P < 10?4). It could be noticed that, in contrast to eukaryote genomes in which DN2 is similar for chromosomes of the same species (16), densities varied between the two chromosomes of Deinococcus radiodurans and also between the two chromosomes of Vibrio cholerae.

Chromosomes containing transposable elements exhibit lower DN2/DN and DL2/DL ratios (P < 0.01, Mann?Whitney rank tests). Since transposable elements are mostly multicopy families, this can be easily understood. We observed few exceptions (low ratios in the absence of transposable elements), involving small genomes and, in particular, Mycoplasma genitalium and Mycoplasma pneumoniae. These repeats are associated with the immunodominant proteins of these genomes and are related to antigenic and tissue tropism variation (27).

Did interspersed repeats originate from tandems?

In order to test whether our model holds for Bacteria and Archaea we have tested its four major predictions. If interspersed repeats originate massively from tandem repeats, one might expect that (i) direct repeats are more numerous than inverted ones and that (ii) CDR are in large excess. Since the exchange rate between CDR is expected to be negatively correlated with spacer size and positively correlated with repeat length there should be (iii) a negative correlation between repeat similarity and spacer size and (iv) a positive correlation between repeat length and spacer size. Since we are interested in the origin of repeats, we decided to analyse only two-copy repeats further. This removed all low complexity repeats from our data set. Based on the annotations, we show that repeats located at least half in rRNA, tRNA or functional transposase represent ?5% of our two-copy repeats, except for C.trachomatis (14%, four of 28) and the second chromosome of V.cholerae (7%, two of 29) (data not shown).

Direct repeats are more numerous than inverted ones. The large majority of the chromosomes (47 of 53) exhibit a higher density of two-copy direct repeats as compared with inverted ones (P < 0.001, binomial test), although sometimes the relative difference is not very high (Fig. 3). It is worth noticing that the two chromosomes that exhibit the largest excess of direct repeats are M.genitalium and M.pneumoniae. This is due to the previously described repeats located inside the adhesin genes.

CDR are over-represented. We estimated the numbers and densities of two-copy CDR, N2CDR and DN2CDR, respectively, and the theoretical number of CDR as a function of the number of direct repeats in linear and circular chromosomes (see Materials and Methods). As predicted by the model, CDR are over-represented in all chromosomes, taking into account the number of repeats (Table 3). The only exception is Buchnera sp., for which there are few CDR repeats, but it is unclear if this is a statistical artifact or has biological meaning. The Buchnera sp. genome is thought to be undergoing reductive evolution (3) and lacks an evident RecA homologue (29). Further, there is evidence that intracellular bacteria are subject to weaker selection (30). Thus, the absence of CDR could be the result of the reductive evolution process. Even if CDR are created, selection will not prevent them from being deleted. This deletion could arise easily since CDR deletion is mainly RecA independent.

Identity and length are constrained by spacer size. We looked for correlations between identity and spacer size within two-copy CDR for species in which there were at least 20 CDR (24 chromosomes). In 18 chromosomes identity was significantly negatively correlated with spacer size (P < 0.01, Table 4). In order to extend our analysis, we also took into account multicopy repeats for chromosomes with less than 20 two-copy CDR or for those exhibiting a non-significant correlation for two-copy CDR (17 + 6 chromosomes). However, because the number of couples increases when families become very large [c = n ? (n ? 1)/2, where c is the number of couples and n the number of copies], we retained only repeats with between two and five copies. This test identified significant positive correlations for 15 additional chromosomes (P < 0.01). Thus, out of the 41 chromosomes tested, 33 exhibited a significant negative correlation between identity and spacer size. Table 4 suggests that many others are weakly correlated.

Correlations between length and spacer size were tested under the same conditions as for identity (Table 5) and were also in agreement with the model. A negative correlation was found in 24 of the 41 chromosomes at P < 0.01 and in nine further chromosomes at a less significant ? level (P < 0.05). Although very significant, these results are weaker than for the correlation between identity and spacer size and this deserves some comment. In the model, interspersed repeats are mostly created as identical tandem repeats, but their size can vary. Successive rounds of recombinational exchange constrain these repeats to be both highly identical and small due to the deletion bias mentioned above. Therefore, while the conversion process only maintains the pre-existing characteristics of the repeats (a high identity), the deletion process establishes an additional new constraint (small length). It is then conceivable that more rounds of exchange are required to establish the correlation between length and spacer size, thereby justifying weaker correlations.

Is tandem repeat creation modulated by chromosomal characteristics?

Since the previous results suggest the adequateness of our model, we proceeded to test the influence of chromosomal features on the duplication process, and in particular of nucleotide composition biases. Bacterial chromosomes exhibit large differences in their nucleotide composition, especially in terms of G + C composition, which can vary from 25 to 75% (24). We used the information entropy to measure the composition bias and found a significant negative correlation between entropy (and then composition bias) and the density of two-copy repeats, DN2 (? = ?0.34, P < 10?3, Fig. 4), as well as with total repeat densities, DN (? = ?0.34, P < 10?3, Fig. 4). One would expect more biased random chromosomes to be more repetitive, since they use a subset of the possible symbols more frequently. However, our methodology to search for repeats already tackles this effect: we determined threshold scores based on empirical distributions for each genome and also defined specific scoring matrices, calculated taking into account the nucleotide compositions of the genomes (see Materials and Methods). This is why the minimal significant alignment score is larger for more biased genomes, such as some Mycoplasma spp. Since methodological biases were taken into account in the search for repeats, one is inclined to explain these results from a biological point of view.

Whatever the mechanism of tandem repeat genesis, it always requires pre-existing small repeats (11). Levinson and Gutman (8) have proposed that small repeats appear by chance and are at the origin of larger repeats that are created by slipped strand mispairing between these small repeats. It so happens that low complexity genomes, by chance alone, present a larger number of small repeats. If we accept the hypothesis that tandem genesis mechanisms are not down-regulated in low complexity genomes, then we are immediately led to the conclusion that tandem genesis must be more frequent in these genomes, simply due to their higher compositional bias. Thus, we propose that in such genomes a higher number of primers appear by chance and lead to more abundant repeats.

Small, non-duplicated repeats can be used as primers for initiation of tandem duplications. Thus, many types of repeats are related: small repeats are transformed into tandem repeats, which are then turned into interspersed repeats. As a consequence one gains by analysing these repeats together, instead of dividing them into different classes.

In this respect, it is interesting to note that chromosomes 2 and 3 of Plasmodium falciparum exhibit a very high density of repeats (as compared with eukaryote chromosomes of the same size) (16) which is associated with a very low G + C content (18%). It is therefore tempting to suggest that in eukaryote chromosomes complexity of the genome also plays an important role in the mechanisms of repeat generation. Naturally, the statistical testing of this generalisation will have to await the availability of a larger sample of complete eukaryote genomes.


We have shown that a model for the dynamics of repeats (previously established in Eukarya), based on tandem genesis with further dispersion, holds for most Bacteria and Archaea. As predicted by the model, we show that in most genomes (i)?direct repeats are more numerous than inverted repeats, (ii) CDR are in large excess, (iii) there is a negative correlation between repeat identity and spacer size and (iv) there is a positive correlation between repeat length and spacer size. This strongly suggests that despite their diversity, intrachromosomal repeats of all genomes share similar dynamics that are probably related to very ancient mechanisms shared by the three domains of life. Naturally, this model is not exclusive of other mechanisms of duplication (transposition, horizontal gene transfer, insertions, hyperploidisation, etc.).

We have also shown that nucleotide composition biases of the chromosome strongly influence the rate of tandem repeat creation and thus the rate of repeat amplification. Other effects are likely to shape the dynamics of bacterial repeats and the large availability of complete genomes will shed light on them. This will certainly provide new clues in deciphering the dynamics of repeats in bacterial genomes and shed additional light on genome evolution.


We would like to thank I. Gon?alves, D. Higuet, E. Maillier and J. Pothier for their scientific help and their friendly support. We would also like to thank P. Avner and E. Leguern for their helpful remarks on previous versions of this manuscript. This work was supported by grants from the Association pour la Recherche sur le Cancer. G.A. was funded by the Fondation pour la Recherche M?dicale. E.C. and P.N. are members of Universit? Pierre et Marie Curie (Paris, France).

1.. Ohno, S.. (1970) Evolution by Gene Duplication. Springer-Verlag, Heidelberg, Germany.
2.. Wolfe, K.H.. (2001) Yesterday?s polyploids and the mystery of diploidization. Nature Rev. Genet., 2, 333?341. [pmid: 11331899]
3.. Andersson, S.G.. and Kurland,C.G. (1998) Reductive evolution of resident genomes. Trends Microbiol., 6, 263?268. [pmid: 9717214]
4.. Katti, M.V.. , Ranjekar,P.K. and Gupta,V.S. (2001) Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol. Biol. Evol., 18, 1161?1167. [pmid: 11420357]
5.. Le Fleche, P.. , Hauck,Y., Onteniente,L., Prieur,A., Denoeud,F., Ramisse,V., Sylvestre,P., Benson,G., Ramisse,F. and Vergnaud,G. (2001) A tandem repeats database for bacterial genomes: application to the genotyping of Yersinia pestis and Bacillus anthracis. BMC Microbiol., 1, 2. [pmid: 11299044]
6.. Levinson, G.. and Gutman,G.A. (1987) High frequencies of short frameshifts in poly-CA/TG tandem repeats borne by bacteriophage M13 in Escherichia coli K-12. Nucleic Acids Res., 15, 5323?5338. [pmid: 3299269]
7.. van Belkum, A.. , van Leeuwen,W., Scherer,S. and Verbrugh,H. (1999) Occurrence and structure-function relationship of pentameric short sequence repeats in microbial genomes. Res. Microbiol., 150, 617?626. [pmid: 10673001]
8.. Levinson, G.. and Gutman,G.A. (1987) Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol. Biol. Evol., 4, 203?221. [pmid: 3328815]
9.. Yeramian, E.. and Buc,H. (1999) Tandem repeats in complete bacterial genome sequences: sequence and structural analyses for comparative studies. Res. Microbiol., 150, 745?754. [pmid: 10673012]
10.. Mahillon, J.. and Chandler,M. (1998) Insertion sequences. Microbiol. Mol. Biol. Rev., 62, 725?774. [pmid: 9729608]
11.. Romero, D.. and Palacios,R. (1997) Gene amplification and genomic plasticity in prokaryotes. Annu. Rev. Genet., 31, 91?111. [pmid: 9442891]
12.. Rocha, E.P.C.. , Danchin,A. and Viari,A. (1999) Analysis of long repeats in bacterial genomes reveals alternative evolutionary mechanisms in Bacillus subtilis and other competent prokaryotes. Mol. Biol. Evol., 16, 1219?1230. [pmid: 10486977]
13.. Kraus, E.. , Leung,W.Y. and Haber,J.E. (2001) Break-induced replication: a review and an example in budding yeast. Proc. Natl Acad. Sci. USA, 98, 8255?8262. [pmid: 11459961]
14.. Paques, F.. and Haber,J.E. (1999) Multiple pathways of recombination induced by double-strand breaks in Saccharomyces cerevisiae. Microbiol. Mol. Biol. Rev., 63, 349?404. [pmid: 10357855]
15.. Achaz, G.. , Coissac,E., Viari,A. and Netter,P. (2000) Analysis of intrachromosomal duplications in yeast Saccharomyces cerevisiae: a possible model for their origin. Mol. Biol. Evol., 17, 1268?1275. [pmid: 10908647]
16.. Achaz, G.. , Netter,P. and Coissac,E. (2001) Study of intrachromosomal duplications among the eukaryote genomes. Mol. Biol. Evol., 18, 2280?2288. [pmid: 11719577]
17.. Chedin, F.. , Dervyn,E., Dervyn,R., Ehrlich,S.D. and Noirot,P. (1994) Frequency of deletion formation decreases exponentially with distance between short direct repeats. Mol. Microbiol., 12, 561?569. [pmid: 7934879]
18.. Lovett, S.T.. , Gluckman,T.J., Simon,P.J., Sutera,V.J. and Drapkin,P.T. (1994) Recombination between repeats in Escherichia coli by a recA-independent, proximity-sensitive mechanism. Mol. Gen. Genet., 245, 294?300. [pmid: 7816039]
19.. Bi, X.. and Liu,L.F. (1996) recA-independent DNA recombination between repetitive sequences: mechanisms and implications. Prog. Nucleic Acid Res. Mol. Biol., 54, 253?292. [pmid: 8768077]
20.. Peeters, B.P.. , de Boer,J.H., Bron,S. and Venema,G. (1988) Structural plasmid instability in Bacillus subtilis: effect of direct and inverted repeats. Mol. Gen. Genet., 212, 450?458. [pmid: 3138528]
21.. Kurtz, S.. and Schleiermacher,C. (1999) REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics, 15, 426?427. [pmid: 10366664]
22.. Karlin, S.. and Ost,F. (1985) Maximal segmental match length among random sequences from a finite alphabet. In Cam,L.M.L. and Olshen,R.A. (eds), Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer. Association for Computing Machinery, New York, NY, Vol. 1, pp. 225?243.
23.. Smith, T.F.. and Waterman,M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195?197. [pmid: 7265238]
24.. Sueoka, N.. (1962) On the genetic basis of variation and heterogeneity of DNA base composition. Proc. Natl Acad. Sci. USA, 48, 582?592. [pmid: 13918161]
25.. Schneider, T.D.. , Stormo,G.D., Gold,L. and Ehrenfeucht,A. (1986) Information content of binding sites on nucleotide sequences. J. Mol. Biol., 188, 415?431. [pmid: 3525846]
26.. Hayashi, T.. , Makino,K., Ohnishi,M., Kurokawa,K., Ishii,K., Yokoyama,K., Han,C.G., Ohtsubo,E., Nakayama,K., Murata,T. et al. (2001) Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Res., 8, 11?22. [pmid: 11258796]
27.. Rocha, E.P.C.. and Blanchard,A. (2002) Genomic repeats, genome plasticity and the dynamics of Mycoplasma evolution. Nucleic Acids Res., 30, 2031?2042. [pmid: 11972343]
28.. Coissac, E.. , Maillier,E. and Netter,P. (1997) A comparative study of duplications in bacteria and eukaryotes: the importance of telomeres. Mol. Biol. Evol., 14, 1062?1074. [pmid: 9335146]
29.. Shigenobu, S.. , Watanabe,H., Hattori,M., Sakaki,Y. and Ishikawa,H. (2000) Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS. Nature, 407, 81?86. [pmid: 10993077]
30.. Ochman, H.. and Moran,N.A. (2001) Genes lost and genes found: evolution of bacterial pathogenesis and symbiosis. Science, 292, 1096?1099. [pmid: 11352062]


[Figure ID: gkf391f1]
Figure 1 

A model of interspersed repeats dynamics. In this model, interspersed repeats originate mainly from tandem repeats, which can be separated by further chromosomal rearrangements. In newly created repeats with a small spacer (i) the conversion rate is high, keeping the two copies identical and (ii) the deletion rate is also high, so that over a longer time scale only small repeats are retained. However, if one or more rearrangements (e.g. insertion, translocation and/or inversion) occur separating the two copies, both deletion and conversion rates decrease markedly. Both copies are then free to evolve.

[Figure ID: gkf391f2]
Figure 2 

Repeat density as a function of chromosome size. Plot of DN as a function of chromosome size. This figure illustrates the positive correlation between DN and chromosome size.

[Figure ID: gkf391f3]
Figure 3 

Densities of inverted repeats versus direct repeats. For each of the 53 chromosomes, we plotted the densities in number (DN2 = two-copy number/size in Mb) of inverted repeats as a function of DN2 of direct repeats. Because of the large difference in densities between genomes, two scales have been used. Abbreviations of species used in this figure correspond to those described in Table 1. The density of direct repeats is generally greater than the density of inverted repeats, but both are of the same order of magnitude.

[Figure ID: gkf391f4]
Figure 4 

Complexity of chromosomes as a function of repeat density. Entropy (a measure of nucleotide complexity) of each of the 53 chromosomes as a function of their global repeat density. Entropy measures the nucleotide complexity of a sequence: if each nucleotide frequency is 0.25, then entropy is maximum (1), else it is lower. This figure illustrates that entropy is negatively correlated with repeat density.

[TableWrap ID: gkf391tb1] Table 1.  Organisms analysed

[TableWrap ID: gkf391tb2] Table 2.  Densities of repeats

aChromosomes containing transposable elements.

bAbbreviations and order are those used in Table 1.

cSize of the chromosome (in Mb)

dDN, number of copies per Mb. DN2 is the density for two-copy CDR only.

eDL is the proportion of the chromosome included in repeats. DL2 is the proportion of the chromosome included in two-copy CDR only.

[TableWrap ID: gkf391tb3] Table 3.  Close direct repeats

aAbbreviations and order are those used in Table 1.

bObserved number of two-copy CDR.

cProbability of finding N2CDR or more under a random model. In the random model, one can estimate the probability of finding at least N2CDR in N2 two-copy direct repeats. This probability is 1 ? B(0) + ? + B(NCDR ? 1), where B(n) is the probability of finding n CDR in N direct repeats using a binomial law where the frequency of CDR is 2000/L for circular chromosomes and 2000/L ? (1000/L)2 for linear ones (L = chromosome length). At an ? risk of 10?4, we assumed that 52/53 chromosomes are over-represented in CDR (with a risk of 0.005 of getting one or more false positives).

dDensity in number (copies/Mb) for two-copy CDR.

[TableWrap ID: gkf391tb4] Table 4.  Correlations between identity and spacer size for CDR

a24 chromosomes with more than 20 two-copy CDR were used to test correlations.

bAbbreviations are those used in Table 1.

cCoefficients of Kendall ? rank tests between spacer size and identity for CDR.

dProbability associated with the Kendall ? rank tests. We assumed that, at an ? risk of 0.01, 18/24 correlations are significant (with a risk of 0.21 of getting at least one false positive and of 0.02 of getting at least two false positives).

eWe used two-copy; three-copy, four-copy and five-copy CDR to test 17 new chromosomes and re-test the six non-significant ones, where the two-copy CDR were less than 20 or were the tested correlation was P < 0.01.

fProbability associated with the Kendall ? rank tests. We assumed that, at an ? risk of 0.01, 15/23 correlations are significant (with a risk of 0.21 of getting at least one false positive and of 0.02 of getting at least two false positives).

[TableWrap ID: gkf391tb5] Table 5.  Correlations between length and spacer size for CDR

a24 chromosomes with more than 20 two-copy CDR were used to test correlations.

bAbbreviations are those used in Table 1.

cCoefficients of Kendall ? rank tests between spacer size and length for CDR.

dProbability associated with the Kendall ? rank tests. We assumed that, at an ? risk of 0.01, 13/24 correlations are significant (with a risk of 0.21 of getting at least one false positive and of 0.02 of getting at least two false positives).

eWe used two-copy, three-copy, four-copy and five-copy CDR to test 17 new chromosomes and re-test the 11 non-significant ones.

fProbability associated with the Kendall ? rank tests. We assumed that, at an ? risk of 0.01, 11/27 correlations are significant (with a risk of 0.24 of getting at least one false positive and of 0.03 of getting at least two false positives).

Article Categories:
  • Article

Previous Document:  Structural perturbations in DNA caused by bis-intercalation of ditercalinium visualised by atomic fo...
Next Document:  The role of DNA polymerase beta in determining sensitivity to ionizing radiation in human tumor cell...