Adjust quality scores from alignment and improve sequencing accuracy.  
Jump to Full Text  
MedLine Citation:

PMID: 15459287 Owner: NLM Status: MEDLINE 
Abstract/OtherAbstract:

In shotgun sequencing, statistical reconstruction of a consensus from alignment requires a model of measurement error. Churchill and Waterman proposed one such model and an expectationmaximization (EM) algorithm to estimate sequencing error rates for each assembly matrix. Ewing and Green defined Phred quality scores for basecalling from sequencing traces by training a model on a large amount of data. However, sample preparations and sequencing machines may work under different conditions in practice and therefore quality scores need to be adjusted. Moreover, the information given by quality scores is incomplete in the sense that they do not describe error patterns. We observe that each nucleotide base has its specific error pattern that varies across the range of quality values. We develop models of measurement error for shotgun sequencing by combining the two perspectives above. We propose a logistic model taking quality scores as covariates. The model is trained by a procedure combining an EM algorithm and model selection techniques. The training results in calibration of quality values and leads to a more accurate construction of consensus. Besides Phred scores obtained from ABI sequencers, we apply the same technique to calibrate quality values that come along with Beckman sequencers. 
Authors:

Ming Li; Magnus Nordborg; Lei M Li 
Related Documents
:

11883387  Comparison of predictive models for postoperative nausea and vomiting. 11894037  Relationship between skill and outcome in the laboratorybased model. 2602577  Activationa predictor of need fulfillment in couples. 16925807  Minimal changes in health status questionnaires: distinction between minimally detectab... 21436107  The devil in the details: interactions between the branchlength prior and likelihood m... 21230207  Comparative monte carlo efficiency by monte carlo analysis. 
Publication Detail:

Type: Comparative Study; Evaluation Studies; Journal Article; Research Support, U.S. Gov't, P.H.S. Date: 20040930 
Journal Detail:

Title: Nucleic acids research Volume: 32 ISSN: 13624962 ISO Abbreviation: Nucleic Acids Res. Publication Date: 2004 
Date Detail:

Created Date: 20041001 Completed Date: 20041015 Revised Date: 20091118 
Medline Journal Info:

Nlm Unique ID: 0411011 Medline TA: Nucleic Acids Res Country: England 
Other Details:

Languages: eng Pagination: 518391 Citation Subset: IM 
Affiliation:

Computational Biology, University of Southern California, Los Angeles, CA, USA. 
Export Citation:

APA/MLA Format Download EndNote Download BibTex 
MeSH Terms  
Descriptor/Qualifier:

Algorithms Arabidopsis / genetics Base Sequence Campylobacter jejuni / genetics Consensus Sequence Logistic Models* Reproducibility of Results Sequence Alignment Sequence Analysis, DNA / methods* 
Comments/Corrections 
Full Text  
Journal Information Journal ID (nlmta): Nucleic Acids Res ISSN: 03051048 ISSN: 13624962 Publisher: Oxford University Press, Oxford, UK 
Article Information Download PDF Copyright ? 2004 Oxford University Press Received Day: 9 Month: 7 Year: 2004 Revision Received Day: 8 Month: 9 Year: 2004 Accepted Day: 8 Month: 9 Year: 2004 Print publication date: Year: 2004 Electronic publication date: Day: 30 Month: 9 Year: 2004 Volume: 32 Issue: 17 First Page: 5183 Last Page: 5191 ID: 521663 Publisher Id: gkh850 DOI: 10.1093/nar/gkh850 PubMed Id: 15459287 
Adjust quality scores from alignment and improve sequencing accuracy  
Ming Li  
Magnus Nordborg  
Lei M. Li*  
Computational Biology, University of Southern California, Los Angeles, CA, USA


*To whom correspondence should be addressed. Tel: +1 213 740 2407; Fax: +1 213 740 2437; Email: lilei@usc.edu 
Shotgun sequencing is the standard methodology for genome sequencing (^{1}). Starting with a whole genome, or a large genomic region, short random fragments are generated and sequenced. Enough fragments are sequenced so that almost all positions in the genome or region are covered multiple times just by chance. The standard sequencing procedure is exemplified by the commonly used Phred/Phrap suite of software (^{2},^{3}). First, each fragment is basecalled from its chromatogram, i.e. a vector times series of four fluorescence intensities, and a sequence of A, C, G and T, is inferred. The commercial producers of sequencing devices include Applied Biosystems, Inc., Beckman Coulter, Inc., etc. Typically each base is accompanied by a quality value that is meant to convey some idea of how likely the basecalling is correct. Second, the basecalled sequences are assembled into a contig using an ad hoc alignment algorithm that compares both strands and overlap between fragments. Quality values are taken into account during the alignment to eliminate low quality reads. Third, a consensus sequence is constructed from this alignment by comparing different reads for each position.
The accuracy of the consensus sequence depends on the coverage (i.e. how many independent observations we have for each nucleotide base pair in the genome) and the performance of the basecalling algorithm. The quality values of basecalling play a crucial role in the construction of consensus. If they are misleading or interpreted incorrectly, the consensus sequence will be less reliable. The Phred quality scores for basecalling are defined from sequencing traces in such a way that they have a probabilistic interpretation. This is achieved by training a model on a large amount of data. However, in practice, sample preparations and sequencing machines may work under different conditions and therefore quality scores need to be adjusted. Moreover, the information given by quality scores is incomplete in the sense that they do not describe error patterns. We observe that each nucleotide base has its specific error pattern that varies across the range of quality values.
Churchill and Waterman (^{4}) proposed another model to define a consensus. It is based on an assembly without assuming the availability of quality values. The parameters in the model include composition probabilities and sequencing error rates and are estimated by an expectation?maximization (EM) algorithm based on the alignment. The consensus is defined by the probability of the target sequence conditional on observations. This offers an evaluation of reliability.
In this article, we combine quality scores of basecalling and the idea in Churchill and Waterman's model (^{4}), to improve sequencing accuracy. Specifically, we start with assembled contigs and quality scores to build up complete probabilistic error models. One option is to represent the error pattern of each nucleotide by a multinomial model. Since the true sequence is unknown, we develop an EM algorithm to deal with missing data. In a more sophisticated logistic model, we take quality scores as covariates. To parsimoniously represent the nonlinear effect of quality scores, we adopt simple piecewise linear functions in the regression model. The model is trained by a procedure combining an EM algorithm, the Bayesian information criterion (BIC) criterion and backward deletion. The training results in calibration of quality values and leads to a more accurate consensus construction.
The first source of data in this article comes from the Campylobacter jejuni wholegenome shotgun sequencing project (^{5}). The raw data, generated on ABI 373 and 377 automated sequencers were downloaded from the Sanger Center (ftp.sanger.ac.uk/pub/pathogens/cj). The total length of the genome sequence is 1?641?481 bp. There are 33?824 reads and the average coverage is 10folds. The sequence assembly was obtained by Phrap (see http://www.phrap.org). We tested our methods on the first 100 kb of the reference sequence and the corresponding reads. The coverage of the C.jejuni sequencing project is unusually high, so we randomly removed some reads to decrease the average coverage from 10 to 6fold. Since the reference sequence was obtained on reads of 10fold, we will assume that it is close to the true sequence later when we calculate single base discrepancy (SBD).
To test our methods on data obtained using another sequencing technology, we analyzed data from an Arabidopsis thaliana resequencing project carried out at USC (http://walnut.usc.edu/2010) using Beckman Coulter CEQ automated sequencers. These data were obtained as part of a polymorphism survey and thus contain different haplotypes. Since we are interested only in sequencing error, in this paper, we selected ?500 kb of raw data from nonpolymorphic regions.
Throughout the article, we represent random variables by capital letters and their values by small letters. First, reads are aligned into an assembly matrix. We introduce two alphabets: = {A, C, G, T, ?} and = {A, C, G, T, ?, N, ?}, where ? denotes an internal gap, N denotes any ambiguous determination of a base and the null symbol ? is for nonaligned positions beyond the ends of a fragment. Each fragment is either in direct or in reverse complemented orientation. To deal with the issue of orientation, we introduce a complementary operation ? on the alphabet as follows: , and . An illustrative example of assembly matrices is shown in Figure 1.
We denote the target sequence by S = S_{1}S_{2} ??? S_{n}, where S_{j} takes any value from the alphabet . Random fragments generated from the template are aligned by an assembler. This results in an assembly matrix {X_{ij}}_{m?n}. The elements of the fragment assembly matrix, denoted by x_{ij}, take values from the alphabet . Each row in {X_{ij}} contains the ordered sequence of bases and possible gaps in a particular fragment. The column index j = 1, ?, n runs from the leftmost base in the assembly to the rightmost. We represent the orientation of the ith fragment in the assembly by
The observations X_{ij} are subject to measurement error. We denote the true base of fragment i at position j by Y_{ij} ? . Therefore
We denote the compositional probability by ?_{a} = Pr(S_{j} = a), a ? .
After appropriate preprocessing, each sequencing chromatogram contains a series of peaks of four colors. The rationale of basecalling is that each peak represents one base, and the order of peaks from the four channels is consistent with the order of nucleotide bases on the underlying DNA fragment. In addition to basecalling, Phred also assigns each basecall a quality score q, which takes integer values from 0 to Q (Q is 64 for Phred scores) (^{2}). Quality scores are based on trace features such as peak spacing, uncalled/called peak ratio and peak resolution. The model that defines quality scores was so trained, on a large amount of sequencing traces, that the scores could be interpreted as probabilities. Mathematically, the score is defined by
where, ?_{ij} is the error probability of basecalling. We randomly select one position from an assembly and let Y and X be its true base and called base, respectively. Let denote the event that the basecalling is incorrect, namely, = {X ? Y}. Then the correct calling probability given base a is: 1 ? ? = Pr(Y = aX = a), where a ? . Notice that
If the assumption of unbiased basecalling is valid, namely, Pr(X = a) = Pr(Y = a), then we have Pr(X = aY = a) = Pr(Y = aX = a) = 1 ? ?. Consequently, we are able to interpret the Phred scores as probabilities by
Even though Phred scores are valuable information for the construction of consensus, they are not the complete picture of measurement error. In general, for a ? b ? , we have
We denote sequencing error rates, conditional on event , by w(ba) = Pr(X = bY = a, ) for a ? b, and arrange them in the following table:
where {w(ba)} satisfy
The sequencing error rates relate to the conditional probabilities as follows.
Since Phred scores provide only partial information about sequencing error rates, we need to estimate the rest. For the sake of simplicity, we skip the issue of fragment orientation when we describe the sequencing error models.
Our perspective is to incorporate Phred quality scores into the Churchill?Waterman model (^{4}). We first adopt the parameterization in Equation 2 to model sequencing error, and refer to it as the conditional sequencing error model. The parameters ? in this model include the composition probability {?_{a}} and the conditional sequencing error rates {w(ba)}. The likelihood of the assembly and underlying sequence is given by
Since {S_{j}} are missing, we estimate the parameters by the EM algorithm. The following form of loglikelihood is easy for imputing the sufficient statistics.
From a regression perspective, we take Phred scores as a covariate. We denote
We assume that basecalling error rates follow a logistic form:
where each covariate h_{l}(q) is a function of quality score q and takes the form
Notice that each function has a knot o_{l}, where 0 ? o_{1} < o_{2}, ? < o_{L} < Q. Thus each regressor is a piecewise linear function of the quality score, which allows us to approximate any potential nonlinear effect. Equivalently, basecalling rates can be represented as:
Similar to Equation 3, this parameterization leads to the following form of loglikelihood function for the assembly and the underlying sequence, up to a term only relating to parameters.
where ? represents all the unknown parameters.
In both the conditional sequencing error model and the logistic model, the underlying sequence {s_{j}} is unknown. Its reconstruction relies on the estimates of parameters in the models. On the other hand, algorithms of estimating parameters are well established when {s_{j}} are known. Thus we use the EM algorithm to train the model iteratively. In the Estep, we impute the sufficient statistic from observations at the current parameter value and in the Mstep, we update the maximum likelihood estimate using the current imputed values of missing data. In the case of the conditional sequencing error model, the parameters are estimated by counting frequencies. In the case of the logistic model, the likelihood can be decomposed into five independent logistic regression models (^{6}). Consequently, we run reweighted least squares to estimate the parameters (^{7}). The mathematical and technical details are rather lengthy and tedious, and will be published in our technical report (^{8}).
According to the logistic model, the distribution of nucleotides at each position is given by:
As shown in the formula, the issue of orientation can generally be dealt with by the orientation indicators {r_{i}} and the complementary operator ?. In our convention, we observe directly when a fragment is in reverse orientation. After we plug in the estimated value of ?, we define the consensus at one position and its quality score by maximizing the above probability.
Although we can include piecewise linear functions at all possible knots in the logistic regression model (Equation 4), we seek a parsimonious model for several purposes. First, we would like to avoid potential overfitting, especially when the size of assembly is not large. Second, a parsimonious model may give us insights into quality scores.
The selection of knots is nothing but a model selection problem. To compare different models, we need an evaluation criterion. Based on quality scores, each fitted model defines a set of error rates, which in turn can be used to construct a consensus. If the truth is known, we can calculate SBD for a model (^{9}). SBD is thus one criterion for model comparison.
A practical solution to model selection ought to be selfevident from data. One such criterion is BIC (^{10}). It is defined as
That is, BIC penalizes loglikelihood by model complexity in terms of the number of parameters. For a logistic model with L knots, we have 20(L + 1) parameters. We calculate the BIC score for each model and choose the one that minimizes the quantity. The idea is to trade off goodness of fit and model complexity. Computationally, it is intensive to evaluate every model. We adopt the backward deletion strategy used in regression analysis to search for the optimal model (^{7}).
If we do not otherwise specify the data source, the results reported hereafter are based on the C.jejuni sequencing data explained earlier. In the conditional sequencing error model, quality scores are interpreted as error probabilities of basecalling. The model that defines the Phred scores is determined from a training data set (^{2},^{3}). When the model is applied to sequencing traces obtained under different working conditions, scores may deviate from probabilities to some extent. We examine this issue on sequencing reads from one BAC. After alignment, we count incorrect basecalls for each value of quality scores?Phred scores take integer values from 0 to 64. The observed score for the predicted quality score q is calculated from the assembly by:
where Err_{q} and Corr_{q} are, respectively, the number of incorrect and correct basecalls at quality score q. In Figure 2, we plot the observed scores against predicted Phred scores. When scores are >55, essentially no error is observed. When scores are <20, the prediction is fairly consistent. When scores are between 20 and 55, Phred scores overestimate probabilities. Thus calibration is desired for the purpose of improving accuracy of basecalling.
Next we apply the logistic model to the data. Let
where ? represents all the parameters. Under the assumption of unbiased basecalling, we have:
Then we can assign a new quality score to each basecall x_{ij}:
The bias of this adjusted quality score can be examined by:
where Err_{q?} and Corr_{q?} are respectively the number of incorrect and correct basecalls at the adjusted quality score q?. We plot the observed against the corrected quality score in Figure 3. Compared with Figure 2, we see that the corrected quality score is more consistent with the observed quality score. After adjustment, no error occurs above score value 42.
The CEQ software that comes along with Beckman sequencers offers quality values similar to Phred scores (^{11}). However, their scores are trained from a smaller data set when compared with Phred. As in Figure 2, we plot the observed scores against the predicted CEQ scores in Figure 4. The overestimate pattern is seen across almost the entire region. Then we apply the adjustment procedure to correct for the obvious bias. The training data set is ?500 kb. In Figure 5 we plot the observed against the corrected quality scores. This calibration can benefit our haplotype construction project at USC.
In the conditional sequencing error model, we assume that the error patterns, or the conditional error rates, are constant regardless of quality scores. To check the assumption, we compare frequencies of each type of sequencing errors at each quality value ranging from 0 to 64. That is, given an assembly, we calculate the empirical conditional error rates as follows:
When the true base is , we plot these conditional error rates against quality scores in Figure 6. It indicates that error patterns do depend on quality scores. After we fit a logistic model to the assembly, the conditional error probabilities as a function of quality scores are shown in Figure 7. When quality scores are >55, no sequencing error is observed. Thus conditional error patterns make sense only for scores <55. Many sequencing projects use the Q_{20} rule as a rough measure of the effective length of a DNA read (^{12}). Scores <20 indicate low quality regions. As we can see, error patterns change significantly at scores ?20?24. Since we do not have many bases with high scores, the inference in the high quality range is less reliable than that in the low quality range. When quality scores are <20, C and G are similar to each other; when the scores are >24, a totally different error pattern is observed. This byproduct of the parsimonious model offers another perspective of the Q_{20} rule.
We have introduced a conditional sequencing model and a logistic model. In the literature, two different methods exist to estimate sequencing error rates. On the one hand, the method proposed by Churchill and Waterman (^{4}) relies only on assembly but not on quality scores, and we refer to it as the simple probability model. On the other hand, we can use Phred scores and simply assign equal error chances among bases. Hereafter, we refer to it as the simple quality score method. In Table 1, we compare the performance of these methods using the C.jejuni data set. The majority rule defines the consensus by choosing the most frequent nucleotide at each position. Compared with the majority rule, the simple probability model reduces errors by onequarter, not resorting to any other information other than the assembly itself. The simple quality score method cuts errors by more than half. The gain is from the training data set that defines Phred scores. The conditional sequencing error model reduces errors further. The best result, 346 SBDs, is achieved by the logistic model with five knots. BIC selects a logistic model with three knots that has 348 SBDs. The likelihood scores for these models are also shown in Table 1. The likelihood of a model measures its goodness of fit to the data. For the same data set, we slightly perturb the Phred scores associated with the called bases, and errors resulting from the simple quality score method increase substantially from 367 to 517 while the performance of the logistic method remains almost the same.
We have observed that different alignment algorithms may produce slightly different assembly matrices. Our adjustment of scores is adaptive to alignment in the sense that it optimizes performance based on each assembly. When a new alignment procedure is used, adjustment may change correspondingly.
Phrap (see http://www.phrap.org) examines all individual sequences at a given position, and generally uses the highest quality sequence to build the consensus. Phrap also uses the quality information of individual sequences to estimate the quality of the consensus sequence. By comparison, our method can be used with any other assembly algorithms. The reconstruction of consensus and definition of quality value are based on a probabilistic model. It can adjust potential bias of quality scores in basecalling.
In the logistic model, the inner loop computes the parameters in logistic regressions by either the Fisher scoring method or by the Newton?Raphson method. Both methods converge quadratically. The outer loop is an EM procedure, and in general an EM algorithm converges at a linear rate. Thus, the computing complexity hinges on the EM algorithm. Specifically, let ?, L, D_{1} and D_{2} be the coverage, number of knots, number of Newton iterations and number of EM iterations, respectively; then the complexity is about O[(n? + L^{3}D_{1})D_{2}], where n is the size of the target DNA. Similarly, the complexity for the conditional sequencing error model is O(n?D), where D is the number of EM iterations.
We have checked the errors that are left uncorrected by the procedure described in this article. Almost all of them are from regions with repeat patterns. They can be single, di or trinucleotide repeats. Situations become even more subtle if two repeats are next to one another. In one example, an A is in the middle of four Cs and is missed. In these cases, it is not appropriate to assume that the sequencing error pattern is independent of local contexts. We are considering more sophisticated models to deal with regions with repeats. Li and Speed (^{13},^{14}) proposed a parametric deconvolution procedure to improve accuracy of sequencing for regions with repeats.
We have applied our method to data sets of different sizes. The larger the data set, the greater the number of knots selected in the optimal model. We can achieve satisfactory training with an assembly of size 30 kb and a coverage of six. The result is not sensitive to the ?quality? of the quality scores. In comparison, the training of Phred scores requires several hundred million basecalls, and it has been carried out on the sequencing traces generated from ABI sequencers. It is difficult to obtain reliable quality scores for other sequencers by the Phred training method if only limited basecalling data are available. In this situation, we can apply the method proposed in this article to adjust the preliminary quality scores obtained under roughly the same conditions and obtain probabilistically meaningful quality scores. Earlier, we reported one such example that calibrates Beckman CEQ quality scores using 500 kb from an Arabidopsis resequencing project.
We thank Prof. Michael Waterman for his various suggestions. This work is supported by a NIH CEGS grant to University of Southern California.
REFERENCES
1..  Adams, M.D.. , Fields,C. and Ventor,J.C. (eds). (1994) Automated DNA Sequencing and Analysis. Academic Press, London, San Diego. 
2..  Ewing, B.. and Green,P. (1998) Basecalling of automated sequencer traces using phred. 2. Error probabilities. Genome Res., 8, 186?194. [pmid: 9521922] 
3..  Ewing, B.. , Hillier,L., Wendl,M.C. and Green,P. (1998) Basecalling of automated sequencer traces using phred. 1. Accuracy assessment. Genome Res., 8, 175?185. [pmid: 9521921] 
4..  Churchill, G.A.. and Waterman,M.S. (1992) The accuracy of DNA sequences: estimating sequence quality. Genomics, 14, 89?98. [pmid: 1358801] 
5..  Parkhill, J.. , Wren,B.W., Mungall,K., Ketley,J.M., Churcher,C., Basham,D., Chillingworth,T., Davies,R.M., Feltwell,T., Holroyd,S., et al. (2000) The genome sequence of the foodborne pathogen Campylobacter jejuni reveals hypervariable sequences. Nature, 403, 665?668. [pmid: 10688204] 
6..  McCullagh, P.. and Nelder,J.A. (1989) Generalized Linear Model, 2nd edn. Chapman and Hall, London. 
7..  Venables, W.N.. and Ripley,B.D. (1994) Modern Applied Statistics with Splus, 2nd edn. Springer, New York. 
8..  Li, M.. and Li,L.M. (2004) Estimate Sequencing Error Patterns by Logistic Regression Models with Missing Responses. Technical report, Computational Biology, University of Southern California. 
9..  Felsenfeld, A.. , Peterson,J., Schloss,J. and Guyer,M. (1999) Assessing the quality of the DNA sequence from the human genome project. Genome Res., 9, 1?4. [pmid: 9927479] 
10..  Schwarz, G.. (1978) Estimating the dimension of a model. Ann Stat, 6, 461?464. 
11..  Winer, R.. , Yen,G. and Huang,J. (2002) Call Scores and Quality Values: Two Measures of Quality Produced by the CEQ? Genetic Analysis Systems. Beckman Coulter, Inc, Fullerton, CA. 
12..  Nelson, D.O.. and Fridlyand,J. (2003) Designing meaningful measures of real length for data produced by DNA sequencers. In Goldstein,D. (ed.), Science and Statistics: A Festschrift for Terry Speed, Vol. 40 of Lecture Notes?Monograph series. Institute of Mathematical Statistics, Beachwood, OH, pp. 295?306. 
13..  Li, L.M.. (2002) DNA sequencing and parametric deconvolution. Stat Sin, 12, 179?202. 
14..  Li, L.M.. and Speed,T.P. (2002) Parametric deconvolution of positive spike trains. Ann Stat, 28, 1279?1301. 
Figures
Tables
w  A  C  G  T  ? 

A  ?  w(CA)  w(GA)  w(TA)  w(?A) 
C  w(AC)  ?  w(GC)  w(TC)  w(?C) 
G  w(AG)  w(CG)  ?  w(TG)  w(?G) 
T  w(AT)  w(CT)  w(GT)  ?  w(?T) 
?  w(A?)  w(C?)  w(G?)  w(T?)  ? 
Method  SBD  Loglikelihood of assembly 

Majority rule  810  ? 
Simple probability  591  ?339704 
Simple quality score  367  ?292781 
Conditional sequencing error  358  ?281411 
Logistic (five knots)  346  ?272341 
The majority rule is a straightforward counting strategy; the simple probability model is the method proposed by Churchill and Waterman (^{2}); the simple quality score method uses Phred scores and assigns equal error chances among bases; the conditional sequencing model uses Phred scores and estimates error pattern from data by an EM algorithm; the logistic model predicts sequencing errors by Phred scores.
Article Categories:

Previous Document: AtmtPNPase is required for multiple aspects of the 18S rRNA metabolism in Arabidopsis thaliana mitoc...
Next Document: DNAbinding domain of GCN4 induces bending of both the ATF/CREB and AP1 binding sites of DNA.