|Optimizing amino acid substitution matrices with a local alignment kernel.|
|Jump to Full Text|
|PMID: 16677385 Owner: NLM Status: MEDLINE|
|BACKGROUND: Detecting remote homologies by direct comparison of protein sequences remains a challenging task. We had previously developed a similarity score between sequences, called a local alignment kernel, that exhibits good performance for this task in combination with a support vector machine. The local alignment kernel depends on an amino acid substitution matrix. Since commonly used BLOSUM or PAM matrices for scoring amino acid matches have been optimized to be used in combination with the Smith-Waterman algorithm, the matrices optimal for the local alignment kernel can be different.
RESULTS: Contrary to the local alignment score computed by the Smith-Waterman algorithm, the local alignment kernel is differentiable with respect to the amino acid substitution and its derivative can be computed efficiently by dynamic programming. We optimized the substitution matrix by classical gradient descent by setting an objective function that measures how well the local alignment kernel discriminates homologs from non-homologs in the COG database. The local alignment kernel exhibits better performance when it uses the matrices and gap parameters optimized by this procedure than when it uses the matrices optimized for the Smith-Waterman algorithm. Furthermore, the matrices and gap parameters optimized for the local alignment kernel can also be used successfully by the Smith-Waterman algorithm.
CONCLUSION: This optimization procedure leads to useful substitution matrices, both for the local alignment kernel and the Smith-Waterman algorithm. The best performance for homology detection is obtained by the local alignment kernel.
|Hiroto Saigo; Jean-Philippe Vert; Tatsuya Akutsu|
Related Documents :
|23572755 - Accelerated fermentation of 'idli' batter using soy residue okara.
12391595 - Golden mean to longevity: rareness of mitochondrial cytochrome b variants in centenaria...
24195735 - Determination of the composition in sugars and organic acids in peach using mid-infrare...
10867005 - Platelet-activating factor (paf)-dependent transacetylase and its relationship with paf...
18483775 - Milk composition studies in transgenic goats expressing recombinant human butyrylcholin...
6433515 - Decalcification for electron microscopy with l-ascorbic acid.
|Type: Journal Article; Research Support, N.I.H., Extramural; Research Support, Non-U.S. Gov't Date: 2006-05-05|
|Title: BMC bioinformatics Volume: 7 ISSN: 1471-2105 ISO Abbreviation: BMC Bioinformatics Publication Date: 2006|
|Created Date: 2006-07-24 Completed Date: 2006-08-14 Revised Date: 2013-06-07|
Medline Journal Info:
|Nlm Unique ID: 100965194 Medline TA: BMC Bioinformatics Country: England|
|Languages: eng Pagination: 246 Citation Subset: IM|
|Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, 611-0011, Japan. email@example.com|
|APA/MLA Format Download EndNote Download BibTex|
Amino Acid Sequence
Amino Acid Substitution
Molecular Sequence Data
Pattern Recognition, Automated / methods
Proteins / chemistry*
Sequence Alignment / methods*
Sequence Analysis, Protein / methods*
Sequence Homology, Amino Acid
|R33 HG003070/HG/NHGRI NIH HHS|
Journal ID (nlm-ta): BMC Bioinformatics
Publisher: BioMed Central, London
Copyright © 2006 Saigo et al; licensee BioMed Central Ltd.
open-access: This is an Open Access article distributed under the terms of the Creative Commons Attribution License (), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Received Day: 4 Month: 2 Year: 2006
Accepted Day: 5 Month: 5 Year: 2006
collection publication date: Year: 2006
Electronic publication date: Day: 5 Month: 5 Year: 2006
Volume: 7First Page: 246 Last Page: 246
Publisher Id: 1471-2105-7-246
PubMed Id: 16677385
|Optimizing amino acid substitution matrices with a local alignment kernel|
|Hiroto Saigo1||Email: firstname.lastname@example.org|
|Jean-Philippe Vert2||Email: Jean-Philippe.Vert@ensmp.fr|
|Tatsuya Akutsu1||Email: email@example.com|
1Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, 611-0011, Japan
2Center for Computational Biology, Ecole des Mines de Paris, 35 rue Saint-Honoré, 77300 Fontainebleau, France
Sequence comparison for homology detection remains one of the core tools in bioinformatics. For example, BLAST  and PSI-BLAST  are widely used for this task, from wet biologists to bioinformaticians. Thanks to those tools, more than half of the newly identified protein sequences are nowadays recognized as having homologs . The identification of remote homologs, however, remains a challenging task because sequence divergence can prevent sequence comparison algorithms from recognizing those homologies. In order to improve the performance of sequence comparison algorithms, a possible strategy is to use data from large databases like SCOP , PFAM  and COG  in order to optimize the parameters of the algorithm to detect homology. Following this strategy, we previously developed a score to compare protein sequences, called the local alignment kernel (LA kernel) , which in combination with a support vector machine could detect remote homology better than several state-of-the-art methods, including the Smith-Waterman (SW) algorithm , in a benchmark experiment based on the SCOP database. Although the LA kernel was used as a kernel function in combination with a support vector machine in , it can also be independently thought of as a measure of similarity between biological sequences, based on the scoring of local alignments between the sequences. In fact it bears similarities to the AAS algorithm , Hybrid Alignment algorithm  and BALSA algorithm  for sequence comparison, in the sense that all of these algorithms compute a summation of the scores over all possible local alignments (using a forward algorithm), instead of computing the score of only the best alignment (using the Viterbi algorithm), as the SW algorithm does.
Both the SW algorithm and the LA kernel depend critically on gap parameters and on a substitution matrix (also called a score matrix) that quantifies the contribution in the score of an alignment between any two given amino acids. Different substitution matrices lead to different alignment scores, and potentially to, different performance in terms of homology detection. Although homology detection is the actual goal of sequence comparison, most substitution matrices used in bioinformatics have been optimized for different purposes (a cluster of such matrices is available from the AAindex database ). For example, the PAM (point accepted mutation) matrices  are based on the probability of single point mutations and the theory of Markov chains. Among the PAM series the PAM250 matrix, which corresponds to the 250 PAM evolution time, is most frequently used in bioinformatics. Subsequently, Gonnet et al.  and Jones et al.  applied the same method to different and larger databases, resulting in different amino acid substitution matrices (GCB and JTT, respectively). The BLOSUM matrices  are constructed from the Blocks database of aligned protein sequences. The popular BLOSUM62 matrix is constructed from the blocks of sequence segments with identity larger than 62%.
A different methodology to construct a substitution matrix has been followed by Hourai et al.  and Kann et al. . Following the idea that the final goal of sequence comparison is to detect homologies, these authors investigated the possibility to automatically optimize a substitution matrix to improve the performance of the final score in terms of homology detection. This optimization, based on a training dataset of pairs of proteins extracted from the Cluster of Orthologous Group (COG) database , uses an objective function that quantifies how well the final score separates the true homologs from non-homologs. Homology detection is known to be particularly difficult for pairs of proteins with less than 25% sequence identity, and the main motivations for these studies are to go further in this so-called "twilight zone" by assessing the performance of homology detection as the main objective function for the optimization of the substitution matrix. The methods of Hourai et al. and Kann et al. differ in the objective function that is optimized. Hourai et.al. try to separate the distribution of homologs from the distribution of non-homologs by minimizing the Bayes error rate. Kann et al. prepared a dataset of homologous pairs from the COG database and maximized the average C(confidence)-value of the pairs, where the C value is designed to be large when the expected number of non-homologous sequences scoring higher than the candidate pair is small. In spite of these differences, the methods by Hourai et al. and Kann et al. both suffer from the difficulty to optimize the SW score with respect to the substitution matrix. Indeed, the fact that the SW score only takes into account the maximum scoring alignment makes it non-differentiable with respect to the substitution matrix. As a result, the final objective function which is based on SW scores is itself not differentiable with respect to the substitution matrix, and therefore difficult to optimize. The trick used by both algorithms is to observe that the SW score of a pair of sequences is piecewise differentiable, as long as the maximum scoring alignment remains the same. Hence the authors suggest to alternate both local optimization of the substitution matrix by gradient descent and computation of the best scoring alignment that depends on the current substitution matrix. A drawback of this approach is that the local moves of the substitution matrices, based on a given set of alignments, might be very different from those required to globally optimize the objective function.
This paper is devoted to the extension of these approaches to the LA kernel, instead of the SW local alignment score. The motivations for this work are twofold. First, the LA kernel was previously shown to be a more sensitive measure of similarity for remote homologs, suggesting that it could also remain competitive with an optimized substitution matrix. Second, contrary to the SW score, the LA kernel is differentiable with respect to the elements of the substitution matrix and the gap parameters, and we show below that these derivatives can be computed efficiently by dynamic programming. This means that any objective function that is itself differentiable with respect to the LA kernel is differentiable with respect to the substitution matrix and can be optimized by simple gradient descent methods, without the need to alternate between the gradient descent steps and alignment steps used in the optimization of the SW score. Applying this procedure to the objective function used in , we optimized the substitution matrix as well as the gap parameters to separate true homologs from non-homologs in a dataset of protein sequences extracted from the COG database, and evaluated the performance of the resulting methods for homology detection on several independent test sets. We compared these results with those obtained after optimizing the substitution matrix with the Smith-Waterman algorithm , and compared how each scoring algorithm performs with each optimized matrix.
Pairs of homologous sequences with identity smaller than 20% were collected from the COG database and used for the training and testing of the method. For each pair, an E-value measuring the significance of the alignment score was computed, from which the corresponding confidence value C = 1/(1 + E) was derived. The objective of the optimization procedure is to maximize the mean confidence value ⟨C⟩ over the training set, and its performance is evaluated by the average confidence value on the test set. In order to avoid the risk of falling into local optima, we used several amino acid substitution matrices (BLOSUM62, PAM250, JTT, GCB) with default gap parameters (12 and 2 for gap open and extension penalties, respectively) as starting points of the optimization. Among them, BLOSUM62 led to the best local optimum, and we present the performance of this optimization below.
The mean confidence values ⟨C⟩ over the 300 training, 48 validation and 47 test pairs during the optimization procedure for both the LA kernel and the SW score are plotted in Figure 1. The optimization was carried out on the training set until the criterion reached a maximum on the validation set, to prevent over-fitting of the parameters to the training set. The performance of this procedure was then evaluated on the independent test set. As expected, we observe that the confidence value on the training set smoothly increases during the optimization. In the case of the LA kernel, a maximum is reached around 30 iterations in the validation set. The mean C value on the test set also seems to have reached its maximum around 30 iterations. The learning curve for the SW score also increases on the training, validation and test sets, and reaches a maximum after the first iteration. However, we observed that the convergence is not always as fast when starting from different substitution matrices (data not shown). This figure also demonstrates that the optimization with the LA kernel goes further than with the SW algorithm in terms of mean C value. Figure 2 is the comparison of the optimized C values for the LA kernel and the SW algorithm. In the graph, each point corresponds to a training pair of sequeces. This graphs illustrate the fact that the optimization process progresses further for the LA kernel than for the SW algorithm, and that many pairs (although not all) reach a remarkable confidence value.
The amino acid substitution matrices as well as gap parameters optimized for the SW score (BLOSUM62SWOPT) and the LA kernel (BLOSUM62LAOPT) on the training set are shown in Tables 1 and 2, respectively. It should be noted that the score matrix optimized for the SW score is based on Kann et al.'s method, and the score matrix optimized for the LA kernel is based on our method. In spite of a global conservation of most values, slight variations can be observed. To see the difference of those slight changes, we performed a principal component analysis (PCA) for each obtained matrix and plotted along the 1st and the 2nd principal components (Figure 3).
Moreover, we calculated the average l1-distance between two matrices M1 and M2 as
where M(a, b )denotes the substitution score between amino acids a and b, |a| and |b| are the number of amino acids (= 20). The averaged l1-distances between BLOSUM62 and BLOSUM62SWOPT, BLOSUM62 and BLOSUM62LAOPT, and finally BLOSUM62SWOPT and BLOSUM62LAOPT, are 0.074, 0.26 and 0.23, respectively. These differences show that the optimization with the LA kernel diverged further from the original matrix (0.26) than with the SW score (0.23), and that both optimizations did not necessarily go in the same direction. The matrix shown in Table 3 is the final matrix optimized for the LA kernel using both the COG training and test data. The l1-distance between this matrix and the original BLOSUM62 matrix is 0.30, and the result of PCA on this matrix is shown in Figure 3. The overall placement of each amino acid residue was similar through the matrices, reflecting physicochemical properties of the different amino-acids. For example, charged (D, E, K, R, H) or polar amino acids (S, T, P, N, Q) are placed on the left side of the figure, while non-polar amino acids (A, C, L, M, I, V, F, W) are placed on the right side. One exception is glycine (G), which is known to be a non-polar amino acid but is placed within the cluster of polar amino acids. This may be because of its conformational flexibility with only a proton constituting its side chain. For PC2, we can observe that amino acids with rings (H, Y, F, W) are placed distinctly from other amino acids.
The reason why the optimized matrix at first sight look very similar to the BLOSUM62 matrix is certainly that the latter is already a very good substitution matrix extensively used by the research community for homology detection. The slight differences in the substitution matrices, however, lead to significant improvements in the mean ⟨C⟩ value.
Performances of algorithms over the COG test set were evaluated for both the LA kernel and the SW score in combination with both BLOSUM62LAOPT and BLOSUM62SWOPT. Figure 4 shows the Errors Per Query plot proposed by Brenner et al. , where coverage was defined as the fraction of true homologs that have scores above the threshold, and errors per query was defined as the number of non-homologous pairs above the threshold divided by the number of queries. This approach is closely related to ROC (Receiver Operating Characterisitc) analysis: coverage can be considered as a fraction of true positives and errors per query is a rate of false positives. More "Coverage" with less "Errors Per Query" means better performance. Therefore the methods with the left-most distributions in this plot can be thought of as being capable of going deeper into the twilight zone, in terms of of detecting remote homologs. The normalized area under the ROC curve is also used as a common measure to compare the performance of ranking, and takes values between 0 and 1. An ROC score of 1 means a perfect ranking, and an ROC score close to 0.5 means almost as good as random. The ROC score for each method is shown in Table 4. In the table, the result of PSI-BLAST  with the BLOSUM62 matrix was shown together as a baseline method. PSI-BLAST was trained using the training dataset only with the option "-j 2". In order to evaluate the performance of the score matrix, no large external database was used to make a profile for PSI-BLAST, since incorporating a huge database into a profile eliminates the effect of a initial score matrix. The ROC score for PSI-BLAST with BLOSUM62 was 0.811 in the COG distant test set. For the LA kernel, they were 0.852 and 0.895 with BLOSUM62 and BLOSUM62OPT, respectively. The ROC score for PSI-BLAST with BLOSUM62 was 0.854 in the PFAM distant test set. For LA kernel, they were 0.932 and 0.947 with BLOSUM62 and BLOSUM62OPT, respectively. This shows that the gain in performance resulting from the use of the LA kernel instead of the PSI-BLAST method is in the range 5.1% – 9.1%, while the gain resulting from the optimization of the substitution matrix with the LA kernel is in the range 1.6% – 5.1%. Overall the performance improvement resulting from the use of the LA kernel with BLOSUM62LAOPT over the standard PSI-BLAST method with BLOSUM62 is around 10% for the distant test sets, and almost zero for the close test sets in terms of ROC score (Table 4).
With these criteria, the LA kernel based on the BLOSUM62LAOPT substitution matrix is the most effective. Interestingly, the BLOSUM62LAOPT matrix is as good as the BLOSUM62SWOPT matrix when used with the SW algorithm as well, although the BLOSUM62LAOPT matrix was optimized with the LA kernel. This highlights the fact that optimization based on the SW score encounters more difficulty in finding a good maximum than the optimization based on the LA kernel. Finally we observe that the LA kernel outperforms the SW algorithm also in the independent PFAM distant test set (Figure 5) with both optimized matrices, confirming its superiority over a large choice of parameters.
Interestingly, although the BLOSUM62LAOPT matrix and BLOSUM62SWOPT matrix are optimized for the detection of remote homologs, they also performed competitively in the COG and PFAM close homologs dataset (Figure 6, 7). The average C value that is calculated for the whole test dataset with various optimized matrices in combination with the SW algorithm and the LA kernel is shown in Table 5. Results of an ROC analysis for each method in each database are also shown in Table 4. We can observe that in the dataset of close homologs, the differences of performance are much smaller than those in the dataset of distant homologs. Interestingly, the superiority of this method is particularly important in the case of the PFAM distant test set (Figure 5), that is, in the detection of remote homologs. This can certainly be attributed to the fact that the training set itself (COG distant) was made of distant homologs, and suggests that different substitution matrices might be optimized for different levels of sequence similarities.
The main contribution of this paper is to propose an optimization framework for substitution matrices based on an exact gradient descent method. This approach is made possible by the fact that the alignment score we consider, the LA kernel, is differentiable with respect to the substitution matrix, contrary to the SW alignment score. The fact that the matrix optimized with this approach outperforms the matrix optimized with the SW score even for the SW algorithm itself suggests that there is an important benefit for the LA kernel approach compared to the more heuristic nature of the optimization in the case of the SW score. It should be pointed out, however, that there is a continuum between the LA kernel and the SW score . Indeed the LA kernel depends on a parameter β set to 0.5 in this study, but increasing β high enough finally leads to the SW score. In other words, the SW score, which is piecewise linear with respect to the elements of the substitution matrix (and therefore only piecewise differentiable), can be seen as the limit of infinitely differentiable functions. Intuitively, the optimization of the LA kernel can be thought of as the optimization of a smooth approximation of the SW score, which can more easily find good local optima. This suggests, in the spirit of simulated annealing, that further improvements for the SW algorithm might be obtained by optimizing the LA kernel and increasing β simultaneously; however, we leave this avenue of research for future work.
It should be pointed out that fixing the scaling parameter β to 0.5 is not a restriction in itself. Indeed, although the derivative of the LA kernel with respect to β can also be computed efficiently by dynamic programming, leading to the possibility of optimizing β as well as the substitution matrix and gap parameters, this would have no effect on the optimal score function that only depends on the products of β with the substitution and gap parameters. In other words, allowing β to vary would lead to an over-parameterized model. Although fixing β has therefore no effect on the global optimum of the objective function, it might nevertheless have an important effect on our optimization procedure because it defines where the optimization starts. Taking β = 0.5 with default gap and substitution parameter, which were shown in previous studies to perform well on remote homology detection, is certainly a safe choice as starting point of the optimization.
A second point to be highlighted is the good performance of the LA kernel as a similarity score compared to the SW score. While it was shown in  that the LA kernel outperforms the SW score as a kernel function for support vector machines, the present studies validate the relevance of the LA kernel as a measure of similarity. It should be pointed out that the advantage of the LA kernel over the SW score is expected to increase for remote homologs, because when the sequence identity is small the best local alignment computed by the SW score is likely not to be the correct one, and in this case multiple hits of relatively short suboptimal alignments (motifs )between two sequences would be of importance, leading to the idea that averaging scores over a large number of candidate alignments might provide better evidence for homology .
Finally, let us mention that the LA kernel is in fact infinitely differentiable, and its second derivative (Hessian) with respect to the substitution matrix could be computed, also by dynamic programming. It would therefore be possible, in principle, to use faster gradient descent algorithms such as Newton's method for the optimization. We did not follow this avenue in our experiments because this would require the computation of a 212 × 212 Hessian matrix at each iteration, which would need more than 100 times the amount of computation than without the computation of the Hessian. Of course, the Hessian is of no help for the SW algorithm, because it is constantly equal to zero on the points where the SW score is differentiable.
We proposed a method to optimize amino acid substitution matrices for the LA kernel, based on the properties of differentiability of the LA kernel with respect to the substitution matrix. This is the first time amino acid substitution matrices for pairwise sequence comparison are optimized for use with the forward algorithm . The optimized matrices exhibit good performance on distant datasets both with the SW algorithm and the LA kernel, and they are competitive on close datasets. The derived matrices may be useful when standard methods fail to detect homologs.
In this section, we first show how to compute the derivative of the LA kernel with respect to the elements of an amino acid substitution matrix. Then we present an objective function meant to favor the discrimination between true homologs and non-homologs, and finally explain how we created the datasets used in this study.
The LA kernel  between two sequences x and y is defined by
KLA(x,y)=∑πeβs(x,y,π), (1) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGlbWsdaWgaaWcbaGaemitaWKaemyqaeeabeaakmaabmaabaacbeGae8hEaGNaeiilaWIae8xEaKhacaGLOaGaayzkaaGaeyypa0ZaaabuaeaacqWGLbqzdaahaaWcbeqaaGGaciab+j7aIjabdohaZnaabmaabaGae8hEaGNaeiilaWIae8xEaKNaeiilaWIae4hWdahacaGLOaGaayzkaaaaaaqaaiab+b8aWbqab0GaeyyeIuoakiabcYcaSiaaxMaacaWLjaWaaeWaaeaacqaIXaqmaiaawIcacaGLPaaaaaa@4B8A@
where β is a parameter, π runs over the possible local alignments between x and y, and s(x, y, π) is the score of an alignment π between x and y. The score of an alignment π is itself given by the sum of substitution scores for the letters paired together, minus an affine gap penalty:
where na,b(x, y, π) represents the number of times that the amino acid a is aligned with the amino acid b, S(a, b) denotes the substitution score between amino acids a and b, ngd(x, y,π) and nge(x, y, π) are the number of gap opens and extensions, respectively, and gd and ge are penalties for gap open and gap extension, respectively.
As shown in (1) the LA kernel takes into account all possible alignments between two strings by summing the scores, and can be computed by the following algorithm.
In the above algorithm, M stands for the matching state between amino acids, while X, Y, X2 and Y2 are for the states corresponding to insertions or deletions.
The score of the LA kernel is then described as the logarithm of (1):
K^LA(x,y)=logKLA(x,y)=log∑πeβs(x,y,π). (2) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGlbWsgaqcamaaBaaaleaacqWGmbatcqWGbbqqaeqaaOWaaeWaaeaaieqacqWF4baEcqWFSaalcqWF5bqEaiaawIcacaGLPaaacqGH9aqpcyGGSbaBcqGGVbWBcqGGNbWzcqWGlbWsdaWgaaWcbaGaemitaWKaemyqaeeabeaakmaabmaabaGae8hEaGNae8hlaWIae8xEaKhacaGLOaGaayzkaaGaeyypa0JagiiBaWMaei4Ba8Maei4zaC2aaabuaeaacqWGLbqzdaahaaWcbeqaaGGaciab+j7aIjabdohaZnaabmaabaGae8hEaGNae8hlaWIae8xEaKNae8hlaWIae4hWdahacaGLOaGaayzkaaaaaaqaaiab+b8aWbqab0GaeyyeIuoakiabc6caUiaaxMaacaWLjaWaaeWaaeaacqaIYaGmaiaawIcacaGLPaaaaaa@5DAC@
The derivative of (2) with respect to S(a, b) can therefore be written as:
∂K^LA(x,y)∂S(a,b)=∂∂S(a,b)∑πeβs(x,y,π)∑πeβs(x,y,π). (3) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaWcaaqaaiabgkGi2kqbdUealzaajaWaaSbaaSqaaiabdYeamjabdgeabbqabaGcdaqadaqaaGqabiab=Hha4jabcYcaSiab=Lha5bGaayjkaiaawMcaaaqaaiabgkGi2kabdofatnaabmaabaGaemyyaeMaeiilaWIaemOyaigacaGLOaGaayzkaaaaaiabg2da9maalaaabaWaaSaaaeaacqGHciITaeaacqGHciITcqWGtbWudaqadaqaaiabdggaHjabcYcaSiabdkgaIbGaayjkaiaawMcaaaaadaaeqaqaaiabdwgaLnaaCaaaleqabaacciGae4NSdiMaem4Cam3aaeWaaeaacqWF4baEcqGGSaalcqWF5bqEcqWFSaalcqGFapaCaiaawIcacaGLPaaaaaaabaGae4hWdahabeqdcqGHris5aaGcbaWaaabeaeaacqWGLbqzdaahaaWcbeqaaiab+j7aIjabdohaZnaabmaabaGae8hEaGNaeiilaWIae8xEaKNae8hlaWIae4hWdahacaGLOaGaayzkaaaaaaqaaiab+b8aWbqab0GaeyyeIuoaaaGccqGGUaGlcaWLjaGaaCzcamaabmaabaGaeG4mamdacaGLOaGaayzkaaaaaa@6D93@
Note that the denominator of (3) is the same as (1), and can therefore be calculated by Algorithm 1 above, while the numerator of (3) is calculated by Algorithm 2 below.
In the above algorithm, δ((xi, yj) = (a, b)) is the Kronecker delta function which returns one if the ith amino acid of x is a and the jth amino acid of y is b, and zero otherwise. Derivative of local alignment kernel with respect to the gap open parameter gd and gap extension parameter ge can be calculated similarly.
To assess the significance of the score on a database search, the Z-score is widely used:
where s is the score of a query against a candidate homolog in the database, and μ and σ are the mean and variance of the scores of a query versus possible non-homologs in the database. For extreme values (maxima or minima) such as the SW score, the extreme value distribution (EVD) is commonly used to assess the statistical significance of the scores. The probability that a given random score is equal to or greater than s is given by
where a and b are parameters for the extreme value distribution. Then for the search of a database of size D, the expected number of scores which are higher than s is E = Dp(μ > s). A natural objective function to quantify the performance of an algorithm for remote homology is therefore to minimize the E-values obtained on pairs of distant homologs. Following Kann et al. , we consider the confidence value C = 1/(1 + E), setting D = 100000 for the computation of the E-value, and define our objective function to be maximized as the average of C over a training set of homologous pairs.
If the score s is differentiable with respect to an amino acid substitution matrix and gap penalties (which we denote as a parameter θ here), then C and the derivative of C can be written as:
C=(1+D(1−e−e−αZ−β))−1, (4)∂C∂θ=αC2De−aZ−b−e−aZ−b∂Z∂θ. MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaafaqaaeGadaaabiGaaW9=aqlacaWLjaGaem4qameabaGaeyypa0dabaWaaeWaaeaacqaIXaqmcqGHRaWkcqWGebarcqGGOaakcqaIXaqmcqGHsislcqWGLbqzdaahaaWcbeqaaiabgkHiTiabdwgaLnaaCaaameqabaGaeyOeI0ccciGae8xSdeMaemOwaOLaeyOeI0Iae8NSdigaaaaakiabcMcaPaGaayjkaiaawMcaamaaCaaaleqabaGaeyOeI0IaeGymaedaaOGaeiilaWIaaCzcaiaaxMaadaqadaqaaiabisda0aGaayjkaiaawMcaaaqaamaalaaabaGaeyOaIyRaem4qameabaGaeyOaIyRae8hUdehaaaqaaiabg2da9aqaaiab=f7aHjabdoeadnaaCaaaleqabaGaeGOmaidaaOGaemiraqKaemyzau2aaWbaaSqabeaacqGHsislcqWGHbqycqWGAbGwcqGHsislcqWGIbGycqGHsislcqWGLbqzdaahaaadbeqaaiabgkHiTiabdggaHjabdQfaAjabgkHiTiabdkgaIbaaaaGcdaWcaaqaaiabgkGi2kabdQfaAbqaaiabgkGi2kab=H7aXbaacqGGUaGlaaaaaa@6C94@
The derivative of Z with respect to θ is itself obtained by:
Concerning the validity of assuming that the score of the LA kernel follows an extreme value distribution, we randomly shuffled non-homologous sequence of the same length 100000 times, and observed that the extreme value distribution is still a good approximation for the distribution of scores of the LA kernel (Figure 8). We chose β = 0.5 obtained from previous research, i.e., by moving β from 0 to 1 with an interval of 0.1 and choosing the best performing β . In fact, we can run gradient descent with respect to the matrix and β together. But the point is that the system is over-parameterized, and fixing β = 0.5 will have no influence at the end.
We used both the algorithms of Smith-Waterman and the LA kernel together with their derivatives, then maximized the objective function using gradient descent with Armijo's rule for line search .
For the optimization using Smith-Waterman algorithm, we adopted the same method as in , that is, to alternate both local optimization of the substitution matrix by gradient descent and computation of the best scoring alignment. Note that this alternation is not necessary the LA kernel.
Since there is no guarantee of reaching the global optimum, we used several starting matrices such as BLOSUM62, PAM250, GCB and JTT, with default gap parameters (12 and 2 for open and extension, respectively). In order to limit the over-fitting of the parameters to the training set, the optimization was carried out until the objective function reached a maximum on an independent validation set. The performance of the parameters selected by this procedure was then assessed on an independent test set.
It is known that the LA kernel is an approximation of the SW score for large β . More precisely, the following holds:
Furthermore, the derivative of the LA kernel (1) with respect to the substitution score S(a, b) is equal to:
where Eπ denotes expectation with respect to the probability distribution p(π)=eβs(x,y,π)∑πeβs(x,y,π) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGWbaCdaqadaqaaGGaciab=b8aWbGaayjkaiaawMcaaiabg2da9maalaaabaGaemyzau2aaWbaaSqabeaacqWFYoGycqWGZbWCdaqadaqaaGqabiab+Hha4jabcYcaSiab+Lha5jabcYcaSiab=b8aWbGaayjkaiaawMcaaaaaaOqaamaaqababaGaemyzau2aaWbaaSqabeaacqWFYoGycqWGZbWCdaqadaqaaiab+Hha4jabcYcaSiab+Lha5jabcYcaSiab=b8aWbGaayjkaiaawMcaaaaaaeaacqWFapaCaeqaniabggHiLdaaaaaa@4F08@ on the set of possible alignments π . The probability of an alignment π therefore contributes to a proportion of the score of an alignment π to the score K^LA(β)(x,y) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGlbWsgaqcamaaDaaaleaacqWGmbatcqWGbbqqaeaadaqadaqaaGGaciab=j7aIbGaayjkaiaawMcaaaaakmaabmaabaacbeGae4hEaGNaeiilaWIae4xEaKhacaGLOaGaayzkaaaaaa@38CC@, and forms a Gibbs distribution over tt with energy –s(x, y, π). β can be thought of as the inverse temperature: at low temperature (large β), only the low-energy states (large score) have non-vanishing probability; at large temperature (small β), all states (all scores) have similar probability. Denoting by Π0(x, y) the set of alignments π that have the maximum score, this shows that at low temperature one gets:
In the case where there exists a single alignment π0 that maximizes the score, then this reduces to na,b(x, y, π0), which is exactly the derivative of the SW score in this case. This result clarifies the difference between taking the derivative of the LA kernel and that of the SW score when it exists. The derivative of the SW score is the amino acid residue count in the optimal alignment, while the derivative of the LA kernel is an expectation of an amino acid residue count over possible alignments. As a result, up to the β factor, the derivative of the LA kernel is an approximation of the derivative of the SW score when it exists. In particular the gradient of the LA kernel approximates the gradient used in the parameter optimization step of Kann et al.'s algorithm for large β. The same approximation properties hold for higher-order differentials, although the LA kernel is everywhere infinitely differentiable while the SW score is only piecewise linear over the space of substitution matrices.
Training and testing to discriminate homologs from non-homologs was performed on the Cluster of Orthologous Group (COG) database . We were interested in homologs whose homology is hard to detect, and collected sequences with less than 20% identity only from the COG database, resulting in 395 pairs of protein sequences. We used 300 of them for training, 48 for validation and the rest (47) for evaluation. Note that this threshold of identity (20%) is harder than that of Kann et al.'s methods in order to learn known but clearly distant relationships of homologs. Also, since it is always important to assess the confidence in an independent way, we prepared sequence pairs of distant homologs from the PFAM  database in a similar manner, resulting in 200 additional pairs, and used them as the second test set. The third and the fourth data sets are the COG close and PFAM close datasets – prepared by keeping the identity between 25% and 40%. We ran SSEARCH on all the sequences in the PFAM database against all the training and test set sequences from COGs in order to remove the similar sequences from the PFAM dataset using a threshold of E <10.
HS extended the code, carried out experiments and drafted the manuscript. JPV conceived of the study and did the original coding. TA provided discussion on the methodology and algorithm. JPV and TA participated in discussion and preparation of manuscript. All three authors read and approved the final manuscript.
The computational resource was provided by Bioinformatics Center, Institute for Chemical Research, Kyoto University and the Supercomputer Laboratory, Kyoto University. Part of this work was supported by Grants-in-Aid for Scientific Research #16300092 and "Systems Genomics" from MEXT of JAPAN, and the Japanese-French SAKURA grant. JPV is supported by NIH award R33 HG003070.
|Altschul SF,Gish W,Miller W,Myers EW,Lipman DJ. A basic local alignment search toolJournal of Molecular Biology 1990;215:403–410. [pmid: 2231712] [doi: 10.1006/jmbi.1990.9999]|
|Altschul SF,Madden TL,Schaffer AA,Zhang J,Zhang Z,Miller W,Lipman DJ. Gapped blast and psi-blast: A new generation of protein database search programsNucleic Acids Research 1997;25:3389–3402. [pmid: 9254694] [doi: 10.1093/nar/25.17.3389]|
|Fischer D,Eisenberg D. Finding families for genomic ORFansBioinformatics 1999;15:759–762. [pmid: 10498776] [doi: 10.1093/bioinformatics/15.9.759]|
|Murzin AG,Brenner SE,Hubbard T,Chothia C. SCOP: A structural classification of proteins database for the investigation of sequences and structuresJournal of Molecular Biology 1995;247:536–540. [pmid: 7723011] [doi: 10.1006/jmbi.1995.0159]|
|Bateman A,Birney E,Durbin R,Eddy SR,Finn RD,Sonnhammer ELL. Pfam 3.1: 1313 multiple alignments match the majority of proteinsNucleic Acids Res 1999;27:260–262. [pmid: 9847196] [doi: 10.1093/nar/27.1.260]|
|Tatusov RL,Galperin MY,Koonin EV. The COG database: a tool for genome-scale analysis of proteins functions and evelutionNuleic Acids Res 2000;25:3389–3402.|
|Saigo H,Vert JP,Akutsu T,Ueda N. Protein homology detection using string alignment kernelsBioinformatics 2004;20:1682–1689. [pmid: 14988126] [doi: 10.1093/bioinformatics/bth141]|
|Smith T,Waterman M. Identification of common molecular subsequencesJournal of Molecular Biology 1981;147:195–197. [pmid: 7265238] [doi: 10.1016/0022-2836(81)90087-5]|
|Bucher P,Hofmann Kay. A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring SystemProceedings of the Fourth International Conference on Inteligent Systems for Molecular Biology 1996;96:44–51.|
|Yu Y,Bundscuh R,Hwa T. Hybrid alignment: High performance with universal statisticsBioinformatics 2002;18:864–872. [pmid: 12075022] [doi: 10.1093/bioinformatics/18.6.864]|
|Webb BM,Liu JS,Lawrence CE. BALSA: Bayesian algorithm for local sequence alignmentNucleic Acids Research 2002;30:1268–1277. [pmid: 11861921] [doi: 10.1093/nar/30.5.1268]|
|Durbin R,Eddy S,Krogh A,Mitchison G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. 1998Cambridge University Press;|
|Tomii K,Kanehisa M. Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteinsProtein Eng 1996;9:27–36. [pmid: 9053899]|
|Dayhoff MO,Schwarts RM,Orcutt BC. A Model of Evolutionary Change in ProteinsAtlas Prot Seq Struct 1978;5:345–252.|
|Gonnet GH,Cohen MA,Brenner SA. Exahustive matching of the entire protein sequence databaseScience 1992;256:1443–1445. [pmid: 1604319]|
|Jones DT,Taylor WR,Thornton JM. The rapid generation of mutation data matrices from protein sequencesCABIOS 1992;8:275–282. [pmid: 1633570]|
|Henikoff S,Henikoff JG. Amino acid substitution matrices from protein blocksProc Natl Acad Sci, USA 1992;89:10915–10919. [pmid: 1438297]|
|Hourai Y,Akutsu T,Akiyama Y. Optimizing substitution matrices by separating score distributionsBioinformatics 2004;20:863–873. [pmid: 14752003] [doi: 10.1093/bioinformatics/btg494]|
|Kann M,Qian B,Goldstein RA. Optimization of a New Score Function for the Detection of Remote HomologsPROTEINS: Structure, Function, and Genetics 2000;41:498–503. [doi: 10.1002/1097-0134(20001201)41:4<498::AID-PROT70>3.0.CO;2-3]|
|Brenner SE,Chothia C,Hubbard TJP. Assessing sequence comparison methods with reliable structually identified distant evolutionary relationshipsProc Natl Acad Sci USA 1998;98:6073–6078. [pmid: 9600919] [doi: 10.1073/pnas.95.11.6073]|
|Press WH,Teukolsky SA,Vetterling WT,Flannery BP. Numerical Recipes in C. 1993Cambridge University Press;|
Amino acid substitution matrix optimized for the SW score. Optimization was started from the BLOSUM62 matrix with gap open and extension penalties initialized to 12 and 2 respectively, on COG training data. After the optimization procedure, the open and extension penalties are 12.3 and 2.8, respectively.
Amino acid substitution matrix optimized for the LA kernel (β = 0:5). Optimization was started from the BLOSUM62 matrix with gap open and extension penalties initialized to 12 and 2 respectively, on COG training data. After the optimization procedure, the gap open and extension penalties are 12.5 and 5.0, respectively.
Final amino acid substitution matrix optimized for the LA kernel (β = 0:5). Optimization was started from the BLOSUM62 matrix with gap open and extension penalties initialized to 12 and 2 using all the COG distant data. After the optimization procedure, the gap open and extension penalties are 11.6 and 5.7, respectively.
ROC scores for the SW algorithm and the LA kernel in the independent dataset. The first column shows the scoring method. For example, BLOSUM62SWOPT is the matrix optimized for the SW algorithm starting from the BLOSUM62. The second column shows the performance of each score matrix by the SW algorithm on the COG distant test set. The following columns show the performance, in terms of average ROC score, of each matrix used in combination with either the SW algorithm or the LA kernel on four different datasets. The second row shows the performance of PSI-BLAST with the BLOSUM62 with gap open and extension parameters set to 11 and 1 (default), respectively. The best ROC score in each dataset is highlighted in bold font.
|Method||COG distant||COG close||PFAM distant||PFAM close|
Comparison of various scoring matrices and scoring algorithms. The first column shows the scoring matrices. For example, BLOSUM62SWOPT is the matrix optimized for the SW algorithm starting from the BLOSUM62 matrix. The second column shows the performance of each score matrix by the SW algorithm on the COG distant test set. The following columns show the performance, in terms of average C, of each matrix used in combination with either the SW algorithm or the LA kernel on four different datasets. The best ⟨C⟩ in each column is highlighted in bold font.
|Score matrix||COG distant||COG close||PFAM distant||PFAM close|
Previous Document: Maximum static inspiratory and expiratory pressures with different lung volumes.
Next Document: A new extract of the plant Calendula officinalis produces a dual in vitro effect: cytotoxic anti-tum...