The difficulty of protein structure alignment under the RMSD.  
Jump to Full Text  
MedLine Citation:

PMID: 23286762 Owner: NLM Status: PubMednotMEDLINE 
Abstract/OtherAbstract:

BACKGROUND: Protein structure alignment is often modeled as the largest common point set (LCP) problem based on the Root Mean Square Deviation (RMSD), a measure commonly used to evaluate structural similarity. In the problem, each residue is represented by the coordinate of the Cαatom, and a structure is modeled as a sequence of 3D points. Out of two such sequences, one is to find two equalsized subsequences of the maximum length, and a bijection between the points of the subsequences which gives an RMSD within a given threshold. The problem is considered to be difficult in terms of time complexity, but the reasons for its difficulty is not wellunderstood. Improving this time complexity is considered important in protein structure prediction and structural comparison, where the task of comparing very numerous structures is commonly encountered. RESULTS: To study why the LCP problem is difficult, we define a natural variant of the problem, called the minimum aligned distance (MAD). In the MAD problem, the length of the subsequences to obtain is specified in the input; and instead of fulfilling a threshold, the RMSD between the points of the two subsequences is to be minimized. Our results show that the difficulty of the two problems does not lie solely in the combinatorial complexity of finding the optimal subsequences, or in the task of superimposing the structures. By placing a limit on the distance between consecutive points, and assuming that the points are specified as integral values, we show that both problems are equally difficult, in the sense that they are reducible to each other. In this case, both problems can be exactly solved in polynomial time, although the time complexity remains high. CONCLUSIONS: We showed insights and techniques which we hope will lead to practical algorithms for the LCP problem for protein structures. The study identified two important factors in the problem's complexity: (1) The lack of a limit in the distance between the consecutive points of a structure; (2) The arbitrariness of the precision allowed in the input values. Both issues are of little practical concern for the purpose of protein structure alignment. When these factors are removed, the LCP problem is as hard as that of minimizing the RMSD (MAD problem), and can be solved exactly in polynomial time. 
Authors:

Shuai Cheng Li 
Publication Detail:

Type: Journal Article Date: 20130104 
Journal Detail:

Title: Algorithms for molecular biology : AMB Volume: 8 ISSN: 17487188 ISO Abbreviation: Algorithms Mol Biol Publication Date: 2013 
Date Detail:

Created Date: 20130318 Completed Date: 20130319 Revised Date: 20130418 
Medline Journal Info:

Nlm Unique ID: 101265088 Medline TA: Algorithms Mol Biol Country: England 
Other Details:

Languages: eng Pagination: 1 Citation Subset:  
Affiliation:

Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong. shuaicli@cityu.edu.hk. 
Export Citation:

APA/MLA Format Download EndNote Download BibTex 
MeSH Terms  
Descriptor/Qualifier:


Comments/Corrections 
Full Text  
Journal Information Journal ID (nlmta): Algorithms Mol Biol Journal ID (isoabbrev): Algorithms Mol Biol ISSN: 17487188 Publisher: BioMed Central 
Article Information Download PDF Copyright ©2013 Li; licensee BioMed Central Ltd. openaccess: Received Day: 8 Month: 12 Year: 2011 Accepted Day: 17 Month: 12 Year: 2012 collection publication date: Year: 2013 Electronic publication date: Day: 4 Month: 1 Year: 2013 Volume: 8First Page: 1 Last Page: 1 PubMed Id: 23286762 ID: 3599502 Publisher Id: 1748718881 DOI: 10.1186/1748718881 
The difficulty of protein structure alignment under the RMSD  
Shuai Cheng Li1  Email: shuaicli@cityu.edu.hk 
1Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong 
A common approach to understand the properties of a protein is to compare it to other proteins. Proteins that are similar, in terms of either their amino acid sequences or 3dimensional structures, often share similar functions, or are related evolutionarily. The latter, structural comparison, is particularly interesting since protein structures are known to be more evolutionarily conserved than the biological sequences which encode them. Furthermore, proteins of similar structures may have similar functionality, even when their sequences differ [^{1}].
Structural comparison is typically a problem of aligning two sets of 3dimensional coordinates. (In most of the known structural alignment problems, each point is the 3D coordinates of the Cα atom, one per residue. Hence, a structure can be modeled for structural alignment purpose as a sequence of 3D points.) The alignment usually involves a rigid transformation to superimpose the two sequences of points, and a mapping which specifies the matched points. The parameters to optimize in the alignment may differ in different situations, because it is not easy to single out a set of parameters that best captures the similarity between two given structures [^{2}]. In many situations, the alignment needs not match between every point in the two sequences. At present, there is a consensus among molecular biologists in the use of the following two parameters [^{2}^{}^{4}]:
1. the number of residues (points) or percentage of total residues (points) matched in the alignment.
2. the root mean square deviation (RMSD) of the matched residues (points).
In general, the RMSD need not be minimized. It suffices that it is within a reasonable threshold. Hence, a good alignment is customarily taken to be one which maximizes the number of residue matches, within a given RMSD threshold. Many structural alignment methods are based on this principle. The computational complexity of finding an optimal solution to the problem is not well understood. Shibuya et al. formulated a restricted version of the problem, and showed the problem to NPhard when the dimensionality is arbitrary. It is open whether their problem is NPhard in 3dimension [^{5}]. Other problems related to structural comparison based on the RMSD have been found to be difficult. For example, the problem of finding a substructure from multiple 3dimensional structures which minimizes the total RMSD, is NPhard [^{6}].
For the variants of the alignment problem that are not based on the RMSD, we have the following results. When the objective is to maximize the number of point matches which are no more than a threshold distance apart, the problem is solvable in O(n^{32.5}) time, where n is the number of points [^{7}]. The contact map overlap problem, where a graph is created out of each structure, and the problem is one of comparing the two graphs, is NPhard [^{8}], and remains NPhard even when we require points that are matchable to be within a threshold distance [^{9}]. These results, together with an early result which shows a related problem called threading to be NPhard [^{10}], have traditionally led molecular biologists to believe that the structural alignment problem is difficult in general (e.g. [^{11}^{}^{13}]), even though a PTAS exists for the problem under a broad class of distance measures [^{14}]. Heuristic algorithms have also been proposed for many variants of structural alignment problem [^{15}^{}^{23}]. While these methods perform reasonably well in general, they provide no guarantee on the quality of their results.
As noted by Shibuya et al., relatively few theoretical results have been obtained on problems defined over the RMSD, and the general problem of structural alignment under the RMSD remains open [^{5}]. At present, whether the problem is intractable or not is not only of theoretical interests but also of practical concerns, due to advances in protein structure prediction which requires the comparison of very numerous structures. In this paper we show mathematical insights and techniques which we hope will lead to practical algorithms for the problem.
We first show that the difficulty of the problem does not lie solely in the individual components of their requirement. More precisely, then the problem can be solved in polynomial time.
– if either a mapping that contains the optimal mapping is known (Theorem 3), or
– if the optimal superposition is known (Lemma 1),
Our study shows that the difficulty of the LCP problem is also very much due to the two factors: (1) the problem allows the input coordinates to be of any arbitrary precision, and (2) it assumes no limit on the distance between two consecutive Cαatoms.
We consider the case where the input coordinates are integral, and the distance between two consecutive points is restricted. The first requirement is practical since in protein structures, coordinates are typically specified to a fixed precision (e.g. three decimal places in protein structures [^{24}]), and can be trivially scaled up to integral values. Similar assumptions are made in Euclidean problems such as the Euclidean TSP [^{25}]. The second requirement likewise does not add any restriction to the problem of protein structure alignment, since there is a natural upper bound (∼3.8Å) to the distance between two Cαatoms. In this case, the following results hold.
– Given a polynomial time algorithm for finding a largest alignment of RMSD below a threshold d, one can efficiently compute an alignment of a given size ℓ which minimizes the RMSD (Theorem 7). (Since the other direction is easy, this shows that the two problems are of similar difficulty.)
– The structural alignment problem under the RMSD is solvable exactly in polynomial time (Theorem 10).
A protein structure for alignment purpose is modeled as a finite, ordered sequence of three dimensional (3D) points. Hence, a structure of n residues is written as (p_{1},p_{2},...,p_{n}), where each point pi∈R3.In the ‘Results assuming integral coordinates and restricting distance between points’ section, we will further assume each p_{i} to be integral. We write P^{′} ⊆ P iff P^{′} is a subsequence of P, and write f:P ↦ Q iff f is a mapping which maps points in the sequence P to points in the sequence Q.
We now state our problems. The main problem we consider is the largest common point set (LCP) problem under the RMSD, a wellknown problem in protein structure alignment. In the LCP, the objective is to find a mapping of the largest cardinality where the RMSD of the matched points is no more than a given threshold (Table 1).
We do not require the optimal superposition of P and Q in the output, since that can be computed from P^{′}, Q^{′}, and f in linear time [^{26}]. We refer to f as an alignment. An alignment can be sequential or nonsequential: an alignment is sequential iff for any two points pi1, pi2∈P′, where the corresponding f(pi1)=qj1 and f(pi2)=qj2, we have i_{1}i_{2} iff j_{1}j_{2}. Otherwise the alignment is nonsequential. The LCP problem which requires alignments to be sequential is said to be sequential, otherwise it is nonsequential. We mainly discuss sequential alignment in this paper. The techniques developed can be easily adapted to the nonsequential case. Given two equalsized sequences P^{′} = (p_{1},…,p_{n}) and Q^{′}, together with a bijection f between P^{′}and Q^{′}, the root mean square deviation (RMSD) is defined as
(1)
RMSD(P′,Q′)=mint∑1≤i≤nt(f(pi))−pi2n, 
where t is a rigid transformation. The RMSD, with its corresponding transformation t, can be computed in linear time [^{26}].
A natural variant of the LCP problem is to minimize the RMSD instead of maximizing the size of the mapping, as follows. Given an integer ℓ, find subsequences of size ℓ of the input, such that the RMSD between the points of the subsequences is minimized. We call this problem the minimum aligned distance (MAD) problem (Table 2).
Clearly, if the MAD problem is solvable in polynomial time, then the LCP problem is solvable in polynomial time. However, the other direction is unclear. Theorem 7 will show that for P and Q of integral coordinates, if the LCP problem is solvable in polynomial time, then the MAD problem is solvable in polynomial time.
We let P^, Q^, f^, ℓ and d_{opt} denote an optimal P^{′}, Q^{′}, f, l and d, respectively. The optimal rigid transformation for superimposing P^ and Q^ is denoted T, and can be computed from P^, Q^ and f^. The symbol cPmax denotes the largest value in the coordinates of P, cQmax the largest value in the coordinates of Q, and cmax=max{cPmax,cQmax}, and we know that c_{max} = O(n) for protein structures.
Since two point sequences with a known mapping can be superimposed optimally under the RMSD in linear time [^{26}], it is natural to ask if the difficulty in LCP or MAD lies solely in the combinatorial complexity of finding the optimal subsets, i.e. P^ and Q^. Our results show the contrary: if the optimal superposition T is known, both problems can be solved in polynomial time.
We first consider the sequential case. Let d_{p,q} = t(p) − q^{2} and let M[i,j;k] denote the minimum squared sum cost of k pair matches for the point sets (p_{1},p_{2},...,p_{i}) and (q_{1},q_{2},...,q_{j}). If 1 ≤ k ≤ ℓ, 2 ≤ i ≤ m and 2 ≤ j ≤ n, we have a recurrence relation of
(2)
M[i,j;k]=maxM[i−1,j−1;k−1]+dpi,qj,M[i,j−1;k],M[i−1,j;k] 
The base case of the recursion is obvious. Dynamic programming can be employed to fill up the respective M[i,j;k] values. After all the values are filled, one can find the maximum k, such that the squared sum is no more than kθ^{2} for the LCP problem. The MAD problem can be solved similarly.
The nonsequential case can be similarly solved using the maximumflow minimumcost problem [^{27}]. The following lemma states these results.
If an optimal transformation T is known, both the LCP problem and the MAD problem can be solved in O(mnℓ) time.
We next ask if the difficulty in the LCP and MAD could be due to the task of superimposing P and Q in an optimal manner to identify the subsets P^ and Q^. To examine this possibility, we remove the combinatorial task of examining each of the possible mapping between the points, by assuming a bijection F which contains the optimal mapping. (Note that this results in the problem known as model superposition in structural biology.) Again, our results show the resultant LCP and MAD problems to be solvable exactly in polynomial time.
Assume that F is a bijection that maps points in P to points in Q. Let P′=(p1′,…,pl′) be the domain of F and Q′=(q1′,…,ql′) be the range of F. Without loss of generality assume that F is sequential, and hence F(pi′)=qi′.
One can exhaustively evaluate all the subsequences of P^{′} for one with the least RMSD. However, since the number of such subsequences is exponential in l, this does not immediately give us a polynomial time solution.
If a rigid transformation T for Q^{′} is given, and all the pairs (pi′,qi′), 1 ≤ i ≤ l are sorted according to the value T(pi′)−qi′, the MAD problem is then to choose the first ℓ pairs from (pi′,qi′), 1 ≤ i ≤ l, and the LCP problem is to choose the first ℓ pairs, such that RMSD((p1′,...,pℓ′),(q1′,...,qℓ′))≤θ and ℓ = l or RMSD((p1′,...,pℓ′,pℓ+1′),(q1′,...,qℓ′,qℓ+1′))>θ. This gives us an incentive to obtain a total ordering of T(pi′)−qi′, which will allow us to solve the MAD problem by selecting the first ℓpairs in the ordering. The set of transformations which produce the same total according to T(pi′)−qi′, yield the same result for the MAD problem, and therefore these transformations are equivalent. This enables us to design a discrete version of the problem.
For clarity, we first present an algorithm with only translation.
Consider two pairs (pi′,qi′) and (pj′,qj′). The transformations T to separate the two types of transformations that T1(qi′)−pi′>T1(pj′)−qj′ and T2(qi′)−pi′T2(pj′)−qj′ are the transformations where
(3)
T(qi′)−pi′2−T(pj′)−qj′2=0. 
Let ∙denote dot product. If the transformation is a translation t, we have
(4)
T(qi′)−pi′2−T(qj′)−pj′2=qi′−pi′−t2−qj′−pj′−t2=∑v=x,y,z(vqi′−vpi′−vt)2−∑v=x,y,z(vqj′−vpj′−vt)2=qi′−pi′2−qj′−pj′2−2t∙((qi′−pi′)−(qj′−pj′))=0 
Consider the space of all translation vectors in R3, and consider each vector as a point in this space (not the space that P and Q are in). The values that the variable t in Equation 4 may take form a plane in this translation space. The plane partitions the translation space into two types of translations, T_{1} and T_{2} say, where T1(qi′)−pi′>T1(pj′)−qj′ and T2(qi′)−pi′T2(pj′)−qj′. Since there are l pairs, there are O(l) planes, which partition the space into O(l^{3}) cells.
The translations in each cell result in the same ordering of the pairs with respect to T(pi′)−qi′. For each cell, this total order can be obtained in O(l) time from any given total order of its neighbor cells, since the change is O(1). Therefore, the MAD solution can be obtained in amortized time O(l) for each cell, and the LCP solutions thus can be obtained in time O(l^{2}). Hence the total runtime is of O(ℓ^{4}) for the MAD problem, and O(ℓ^{5}) for the LCP problem of translations, with the given mapping F.
The rigid transformations which separate the two relations T1(qi′)−pi′>T1(pj′)−qj′ and T2(qi′)−pi′T2(pj′)−qj′ are as in Equation 3.
Suppose the rigid transformations T is composed of a rotation R and a translation t.
(5)
T(qi′)−pi′2−T(qj′)−pj′2=R(qi′)−pi′−t2−R(qj′)−pj′−t2=R(qi′)−pi′2−R(qj′)−pj′2−2t∙[(R(qi)−pi)−(R(qj)−pj)] 
A rotation matrix contains three variables, which is specified using three angles, say α_{1}, α_{2}, α_{3}, each from −Π to Π. Let r_{i} = cos α_{i}, s_{i} = cos α_{i}, then Equation 5 can be considered as a polynomial of nine variables in degree six. The nine variables are r_{i}, s_{i} and the three variables for translation. In total, there are O(l) such polynomials.
We know the following theorem from the literature.
[^{7},^{28}]Given a set of k polynomials,P={f1,...,fk}, where each polynomial has a maximum degree of s, contains at most r variables, and in addition all the coefficients are rational, then all the sign conditions can be determined by O(k(k/r)^{r}s^{O(r)}) arithmetic operations. A sign condition V is the vector of signs for some pointu∈Rk; that is, V = (sign(f_{1}(u)),...,sign(f_{k}(u))). Two pointsu,u′∈Rkare equivalent if their sign condition vectors are the same.
Each sign vector represents the transformations of the cell it belongs to, and it determines a total order of the pairs. Similar as in the case of translation, with Theorem 2, we have
Given a bijection F:P^{′} → Q^{′}, where P^{′} = l, then the MAD problem can be solved in O(l^{10}) time and the LCP problem can be solved in O(l^{11}) time.
One possible contributing factor to the difficulty of the LCP problem could be its flexibility in allowing input coordinates of any arbitrary precision. This is because intuitively, this arbitrariness in the precision introduces the burden of examining the solution space in an unbounded manner. However, such an exhaustive search is not necessary for the purpose of protein structure comparison, since coordinates of protein structures are specified only to three decimal places in the commonly used PDB format.
In this section, we restrict the precision in which the input coordinates may be specified. Without loss of generality, we assume that the input coordinates are given in integers, since numbers of any fixed precision can be trivially scaled up to integral values. This assumption is used to obtain Lemma 5 and Theorem 7.
We also place an upper bound on the distance between consecutive points according to the structure of proteins. As a result, c_{max} is bounded by n, as follows.
Points drawn from protein structures have upper bounds on their diameters because they are connected, and many are globular.
 For a connected structure, the points are at most O(n) distance apart. That is, c_{max} is of O(n).
 For a globular structure, the points are at most O(n^{1/3}) distance apart [^{14}]. That is, c_{max} is of O(n^{1/3}).
Given a point p, let the x coordinate of p be denoted x_{p}. Similarly, we can define y_{p} and z_{p}. Without loss of generality, we assume that the first point of a protein structure is at the origin. The largest coordinate of a protein is bounded by O(n), and the largest coordinate of a globular protein structure is bounded by O(n^{1/3}); that is
(6)
maxp∈P,v=x,y,zvp=O(n),if P is a protein structure 
(7)
maxp∈P,v=x,y,zvp=O(n1/3),if P is a globular protein structure 
Our results show that, under these two conditions,
1. the LCP problem is of similar difficulty as the MAD problem, and
2. both problems can be solved exactly in polynomial time.
We first establish some bounds to the RMSD. The minimum RMSD is zero if T brings Q^ to coincide exactly with P^. This case is referred to as the exact matching, which can be easily solved by the method in [^{29}]. However, if we assume the RMSD to be nonzero, then a lower bound and an upper bound for it can be computed.
Let Π be a permutation of {1,...,ℓ}. For the sequence X = (x_{1},…,x_{n}), let di,jX1 ≤ i,j ≤ n, denote the Euclidean distance between x_{i} and x_{j}. The following results, which are proven in the Appendix, can be obtained.
12ℓ∑i=1ℓ/2dΠ(i),Π(i+ℓ/2P^)−dΠ(i),Π(i+ℓ/2)Q^2RMSD(P^,Q^). 
If RMSD(P^,Q^)≠0, then RMSD(P^,Q^)≥12cmax2−12cmax2−12ℓ.
RMSD(P^,Q^)≤43cmax.
Suppose there is a polynomial time algorithm for solving the LCP problem. To use it to solve the MAD problem, we assume that d_{opt} ∈ [l,u], for some real l and u, l ≤ u. We use a binary search strategy in the interval [l,u], as shown in Table 3, to search for the minimum value such that the LCP solution size is ℓ. However, the search will not terminate if an arbitrary accuracy of the d_{opt} value is required. We prove below that the accuracy of d_{opt} can be defined by polynomially many bits. Given two threshold t_{1} and t_{2}, assume that we obtain two different LCP solutions, and the RMSD values of the two solutions are θ_{1} and θ_{2}, where θ_{1} > θ_{2}. Similar to the arguments in Lemma 5, the difference between θ_{1} and θ_{2} is at least 12cmax2−12cmax2−12ℓ. Therefore if two consecutive binary search operators have the difference of the threshold values below 12cmax2−12cmax2−12ℓ, the search can be terminated. The values of l and u are the same as in the previous subsection. Hence,
Solving the MAD problem is equivalent to solving O(logℓc_{max}) instances of the LCP problem.
Since the reduction from the LCP problem to the MAD problem is obvious, we conclude that the two problems are of similar difficulty.
We now show that under the two conditions, the LCP and MAD problem can be solved in polynomial time.
As shown in the ‘Complexity of the LCP and MAD when the optimal superposition is known’ section, when the optimal superposition is known, there are polynomial time algorithms for LCP and MAD. We consider an enumeration of all the possible superpositions. Under the two conditions, we claim that there are at most polynomially many such superpositions.
First, if we know P^ and Q^, then optimal superposition can be computed in the following two steps:
1. Translate P^ and Q^ such that their centroids are at the origin.
2. Then, rotate Q^ to find the superposition with the minimum distance [^{26}].
Denote the translations to obtain the optimal solution for P and Q as t_{P} and t_{Q}, respectively, and denote the optimal rotation by R^.
We now show that one needs to examine only polynomially many translations and rotation combinations to discover the values for t_{P}, t_{Q}, and R^. These numbers can be effectively bounded by n when properties of protein structures are taken into account. We first describe these properties.
The centroid of P^ is ∑p′∈P^p′ℓ. To bring P^ to origin, the translation is −∑p′∈P^p′ℓ. Clearly, all the three coordinates of −∑p′∈P^p′ are integers since all the coordinates of the points in p ∈ P are integers. The value of xcoordinate of −∑p′∈P^p′ℓ is bounded by −∑p′∈P^xp′ℓ≤∑p′∈P^cPmaxℓ≤cmax. Similarly, all the three coordinates of the translation −∑p′∈P^p′ℓ are bounded within the interval [−c_{max},c_{max}]. To obtain an optimal MAD solution, the translation on P^ must be in the form of Iℓ, where I is an integer. Since it is possible to examine all the possible values for I, we have the following result.
t_{P},t_{Q} ∈ {I/ℓ − ℓc_{max} ≤ I ≤ ℓc_{max}}^{3}.
With the centroids of P^ and Q^ translated to the origin, we proceed to identify the rotation in our algorithm. Let X_{P} denote the vector 〈xp1,...,xpℓ〉 for structure P. Similarly we define Y_{P} and Z_{P}.
Let P^t=(p1′−t,...,pℓ′−t) and Q^t=(q1′−t,...,qℓ′−t), pi′∈P^, qi′∈Q^.
Given P^ and Q^, to compute the rotation R^, the first step is to create the 3 × 3 matrix, which is (from [^{26}])
(8)
M=XP^tP∙XQ^tQXP^tP∙YQ^tQXP^tP∙ZQ^tQYP^tP∙XQ^tQYP^tP∙YQ^tQYP^tP∙ZQ^tQZP^tP∙XQ^tQZP^tP∙YQ^tQZP^tP∙ZQ^tQ 
Each above matrix is decomposed by the singular value decomposition, and a rotation matrix is produced hereafter.
We know that the coordinate of each point in the protein is within the interval [−c_{max},c_{max}]. This implies that for U = X,Y,Zand V = X,Y,Z,
UP^tP∙VQ^tQ=∑k=1ℓ(pi,k−tP)(qj,k−tQ)≤∑k=1ℓ(2cmax)2≤4ℓcmax2. 
Also, it is clear that UP^tP∙VQ^tQ is in the form of I/ℓ^{2}, where I is an integer. The matrix in Equation 8 has nine elements; we denote e ∈ M if e is one of these elements. The following lemma follows.
For each element e ∈ Min Equation 8, e∈{I/ℓ2−4ℓ3cmax2≤I≤4ℓ3cmax2}.
To compute the optimal MAD solution, we first enumerate all the possible translations and rotations. A solution is computed for each translation and rotation combination according to Lemma 1. An optimal solution can be chosen from these computed solutions (Table 4).
According to Lemma 8, the optimal translation t_{P} and t_{Q} must be within {I/ℓ − ℓc_{max} ≤ I ≤ ℓc_{max}}^{3}. To find the optimal rotation matrix, it suffices that we try all the possible values for each entry in Equation 8. Since there are ℓ27cmax18 matrices, the number of total transformations to examine is bounded by O(ℓ33cmax34). It takes time O(mnℓ) to identify the MAD solution for each transformation. An LCP solution can be obtained by iterating ℓ from 1 to min{m,n} for the MAD problem.
The running time consists of the productions of three parts: the number of possible translations, the number of possible rotation matrix, and the running time for given a rotation matrix and a translation combination (that is, the running time when then transformation is known). These numbers are bounded by c_{max}, which is bounded by m when we consider the properties of protein structures. Likewise, c_{max} is polynomial with respect to the input size if coded in unary.
The MAD problem can be solved in O(ℓ^{34}m^{25}n) time for protein structures, and in O(ℓ^{34}m^{9}n) time for globular protein structures. The LCP problem can be solved in O(ℓ^{35}m^{25}n) time for protein structures, and in O(ℓ^{35}m^{9}n) time for globular protein structures. Both the MAD and LCP problems are pseudopolynomially solvable for general point sets.
We studied the LCP problem under the RMSD in this paper. As it turns out, the difficulty of the problem does not lie in its combinatoric aspect or its structural superposition aspect alone. That is, if the problem is hard, then it must be a consequence of both aspects. Our results show that if one is allowed to compromise on one of the aspects, then the problem is solvable exactly in polynomial time. Regrettably, we do not see how the optimal solution can be obtained in both cases.
On the other hand, we showed an encouraging result: There is a polynomial time algorithm which solves the problem optimally, if one restricts the input coordinates in the problem to be integral, and places a limit on the distance between consecutive points. These requirements do not pose any restriction to typical uses in the analysis of protein structures, since protein structures are specified only to a fixed precision in practice, and there is an upper bound to the distance between protein residues.
One problem is that our proposed polynomial time algorithm remains high in time complexity. We hope that the present work will provide the foundation for future efforts to obtain algorithms with lower runtime complexities.
In this Appendix, we include the proofs of the results in the paper which have been omitted to enhance readability.
12ℓ∑i=1ℓ/2dΠ(i),Π(i+ℓ/2)P^−dΠ(i),Π(i+ℓ/2)Q^2≤RMSD(P^,Q^). 
Without loss of generality, we just show that
12ℓ∑i=1ℓ/2di,i+ℓ/2P^−di,i+ℓ/2Q^2≤RMSD(P^,Q^).
Let
ri=T(qi)−pi2+T(qi+ℓ/2)−pi+ℓ/22,
u_{i}=〈p_{i},p_{i + ⌊ℓ/2⌋}〉, and
v_{i}=〈q_{i},q_{i + ⌊ℓ/2⌋}〉, where 1≤i≤⌊ℓ/2⌋.
First, we prove that r_{i} ≥ u_{i} − v_{i}^{2} / 2, for 1 ≤ i ≤ ⌊ℓ/2⌋. We first superimpose u_{i} and v_{i} to optimize the squared sum; that is, to find transformation T such that T(q_{i})−p_{i}^{2} + T(q_{i + ⌊ℓ/2⌋})−p_{i + ⌊ℓ/2⌋}^{2} is minimized. The centroids have to coincide to minimize the squared distance. Assume the centroids are at the origin and that the angle between (o,pi)→ and (o,qi)→ is α, where o is the origin, then by trigonometry
T(qi)−pi2+T(qi+ℓ/2)−pi+ℓ/22=2[(1/2×ui)2+(1/2×vi)2−2×1/2ui×1/2vi×cosα]≥(ui−vi)22 
r_{i} is the squared distance under transformation T, which may not be optimal for superimposing u_{i}and v_{i}. Therefore,
ri≥(ui−vi)22 
Putting things together, we have
RMSD(P^,Q^)≥r1+…+r⌊ℓ/2⌋ℓ≥(u1−v1)2+…+(u⌊ℓ/2⌋−v⌊ℓ/2⌋)22ℓ≥12ℓ∑i=1ℓ/2di,i+ℓ/2P^−di,i+ℓ/2Q^2 
If RMSD (P^,Q^)≠0, then RMSD (P^,Q^)≥12cmax2−12cmax2−12ℓ.
If RMSD (P^,Q^) is nonzero, then there is at least a pair of indices i and j, such that di,jP^− di,jQ^>0. According to Lemma 4,
RMSD(P^,Q^)≥12ℓdi,jP^−di,jQ^2=di,jP^−di,jQ^2ℓ>=12cmax2−12cmax2−12ℓ. 
□
RMSD (P^,Q^)≤43cmax.
Denote the furthest point to the origin in P as p_{max}, and the furthest point to the origin in Q as q_{max}. Then,
ℓ×RMSD2(P^,Q^)=∑i=1ℓT(qi)−pi2≤∑i=1ℓ(qi−pi)2≤∑i=1ℓmax{qmax−pmax2,qmax+pmax2}≤ℓmax{qmax−pmax2,qmax+pmax2}≤ℓ(2(cmax+cmax)2+(cmax+cmax)2+(cmax+cmax)2)2=48ℓcmax2 
□
LCP: Largest Common Point set;RMSD: Root Mean Square Deviation
The authors declare that they have no competing interests.
SCL conceived of the results herein and wrote the manuscript.
This work is supported by the CityU SRG 7002731. The author thanks the useful discussions with Prof. Richard M. Karp
References
Perutz MF,Rossmann MG,Cullis AF,Muirhead H,Will G,North ACT: Structure of myoglobin: a threedimensional Fourier synthesis at 5.5 Angstrom resolutionNatureYear: 196018541642210.1038/185416a018990801  
R Kolodny PK,Levitt M,Comprehensive evaluation of protein structure alignment methods: scoring by geometric measuresJ Comput BiolYear: 2005346411731188  
Sippl MJ,On distance and similarity in fold spaceBioinformaticsYear: 200824687287310.1093/bioinformatics/btn04018227113  
V A Ilyin CML A AbyzovStructural alignment of proteins by a novel TOPOFIT method, as a superimposition of common volumes at a topomax pointProtein SciYear: 20041371865187410.1110/ps.0467260415215530  
Shibuya T,Jansso J,Sadakane K,Lineartime protein 3D structure searching with insertions and deletionsAlgorithms Mol BiolYear: 2010571820047657  
Bu D,Li M,Li SC,Qian J,Xu J,Finding compact structural motifsTheor Comput SciYear: 20094102834283910.1016/j.tcs.2009.03.023  
Ambühl C,Chakraborty S,Gärtner B,Computing Largest Common Point Sets under Approximate CongruenceESA ’00: Proceedings of the 8th Annual European Symposium on Algorithms, Volume 1876 of Lecture Notes in Computer ScienceYear: 2000Saarbrücken, Germany: SpringerVerlag5263  
Goldman D,Papadimitriou CH,Istrail S,Algorithmic Aspects of Protein Structure SimilarityYear: 1999New York City, NY, USA: IEEE Computer Society  
Li SC,Ng YK,On protein structure alignment under distance constraint20th International Symposium on Algorithms and Computation, ISAAC 2009, Proceeedings, Volume 5878 of Lecture Notes in Computer ScienceYear: 2009Honolulu, HI, USA: Springer6576  
Lathrop RH,The Protein Threading Problem With Sequence Amino Acid Interaction Preferences Is NPCompleteProtein EngYear: 19957105910687831276  
Godzik A,The structural alignment between two proteins: is there a unique answer?Protein SciYear: 1996571325133810.1002/pro.55600507118819165  
Eidhammer I,Jonassen I,Taylor WR,Structure Comparison and Structure PatternsJ Comput BiolYear: 20007568571610.1089/10665270144615211153094  
Zhao Z,Fu B,Alanis FJ,Summa CM,Feedback algorithm and webserver for protein structure alignmentJ Comput BiolYear: 200815550552410.1089/cmb.2008.007518549304  
Kolodny R,Linial N,Approximate Protein Structural Alignment in Polynomial TimeProc Natl Acad SciYear: 2004101122011220610.1073/pnas.040438310115304646  
Akutsu T,Tashimo H,Protein structure comparison using representation by line segment sequencesProc. Pacific Symposium on Biocomputing ’96Year: 1996Hawaii, USA: World Scientific Press2540  
Alexandrov NN,SARFing the PDBProtein EngYear: 19969972773210.1093/protein/9.9.7278888137  
Caprara A,Lancia G,Structural alignment of largesize proteins via lagrangian relaxationRECOMB ’02: Proceedings of The Sixth Annual International Conference on Computational BiologyYear: 2002New York, NY, USA: ACM100108  
Comin M,Guerra C,Zanotti G,PROuST: A Comparison Method of ThreeDimensional Structure of Proteins using Indexing TechniquesJ Comp BiolYear: 2004111061107210.1089/cmb.2004.11.1061  
Gerstein M,Levitt M,Using Iterative Dynamic Programming to Obtain Accurate Pairwise and Multiple Alignments of Protein StructuresProceedings of the Fourth International Conference on Intelligent Systems for Molecular BiologyYear: 1996St. Louis, MO, USA: AAAI Press5967  
Gibrat JF,Madej T,Bryant SH,Surprising similarities in structure comparisonCurr Opin Struct BiolYear: 19966337738510.1016/S0959440X(96)8005838804824  
Holm L,Sander C,Protein structure comparison by alignment of distance matricesJ Mol BiolYear: 1993233123138 [http://dx.doi.org/10.1006/jmbi.1993.1489]. 10.1006/jmbi.1993.14898377180  
Lancia G,Carr R,Walenz B,Istrail S,101 optimal PDB structure alignments: a branchandcut algorithm for the maximum contact map overlap problemRECOMB ’01: Proceedings of The Fifth Annual International Conference on Computational BiologyYear: 2001New York, NY, USA: ACM193202  
Singh AP,Brutlag DL,Hierarchical Protein Structure Superposition Using Both Secondary Structure and Atomic RepresentationsProceedings of the 5th International Conference on Intelligent Systems for Molecular BiologyYear: 1997Halkidiki, Greece: AAAI Press284293  
Berman HM,Westbrook J,Feng Z,Gilliland G,Bhat TN,Weissig H,Shindyalov IN,Bourne PE,The protein data bankNucl Acids ResYear: 200028235242 [http://nar.oxfordjournals.org/cgi/content/abstract/28/1/235]. 10.1093/nar/28.1.23510592235  
Papadimitriou C,The Euclidean Traveling Salesman Problem is NPCompleteTheoretial Compututer SciYear: 19774323724410.1016/03043975(77)900123  
Arun KS,Huang TS,Blostein SD,Leastsquares fitting of two 3D point setsIEEE Trans Pattern Anal Mach IntellYear: 19879569870021869429  
Ahuja RK,Magnanti TL,Orlin JB,Network Flows: Theory, Algorithms, and ApplicationsYear: 1993Englewood Cliffs, NJ: Prentice Hall  
Caviness B, Johnson J A New Algorithm to Find a Point in Every Cell Defined by a Family of PolynomialsYear: 1998Springer Vienna: SpringerVerlag  
de Rezende PJ,Lee DT,Point Set Pattern Matching in dDimensionsAlgorithmicaYear: 199513438740410.1007/BF01293487 
Tables
Largest Common Point (LCP) set problem definition under RMSD
LCP problem by the RMSD measure  

Input:

sequences P = (p_{1},…,p_{n}), Q = (q_{1},…,q_{m}) and distance


threshold
θ∈R. Without loss of generality assume m ≥ n.

Output:

(i) subsequences P^{′} ⊆ P, Q^{′} ⊆ Q, P^{′} = Q^{′}, and


(ii) bijection f:P^{′} ↦ Q^{′}, fulfilling the following conditions:


(A) RMSD(P^{′},f(P^{′})) ≤ θ,

(B) the score l = P^{′} is maximized. 
Minimum Aligned Distance (MAD) problem definition under RMSD
MAD problem by the RMSD measure  

Input:

sequences P = (p_{1},…,p_{n}), Q = (q_{1},…,q_{m}) and
ℓ∈I.


Without loss of generality assume m ≥ n.

Output:

(i) subsequences P^{′} ⊆ P, Q^{′} ⊆ Q, P^{′} = Q^{′}, and


(ii) bijection f:P^{′} ↦ Q^{′}, fulfilling the following conditions:


(A) P^{′} = ℓ,

(B) d = RMSD(P^{′},f(p^{′})) is minimized. 
Employing an algorithm for the LCP problem to solve the MAD problem
Input:

sequences P = (p_{1},…,p_{n}), Q = (q_{1},…,q_{m}) and
ℓ∈I.



Without loss of generality assume m ≥ n.

Output:

(i) subsequences P^{′} ⊆ P, Q^{′} ⊆ Q, P^{′} = Q^{′}, and


(ii) mapping f:P^{′} ↦ Q^{′}, fulfilling the following conditions:


(A) P^{′} = ℓ,

(B) d = RMSD(P^{′},f(p^{′})) is minimized.  
1.

l ← 0, u ← ℓc_{max}

2.

m ← 1/2(l + u)

3.

Call LCP to solve the instance (P,Q,m).

4.

If the LCP solution has size no less than ℓ


u ← m


else


l ← m

5.

If
u−l≤12cmax2−12cmax2−12ℓ,


Output the most recent LCP solution of size no less than ℓ.

Otherwise, repeat Steps 25. 
A polynomial time algorithm for the MAD problem
Input:

sequences P = (p_{1},…,p_{n}), Q = (q_{1},…,q_{m}) and
ℓ∈I.



Without loss of generality assume m ≥ n.

Output:

(i) subsets P^{′} ⊆ P, Q^{′} ⊆ Q, P^{′} = Q^{′}, and


(ii) mapping f:P^{ ′ }↦Q^{′}, fulfilling the following conditions:


(A) P^{′} = ℓ,

(B) d = RMSD(P^{′},f(p^{′})) is minimized.  
1.

For each translation t ∈ {I/ℓ − ℓc_{max} ≤ I ≤ ℓc_{max}}^{3},


For each 3 × 3 matrix M, where ∀e ∈ M, e ∈ {I/ℓ^{2}−,


{4ℓ3cmax2≤I≤4ℓ3cmax2}


Compute rotation matrix R from M.


Q←RQ−t.


Apply an algorithm for the case where the superposition


is known to P and
Q (as discussed in the ‘Complexity Of


The LCP And MAD When The Optimal Superposition Is


Known’ section), and denote the solution MAD(P,
Q).

2.  Output the MAD(P, Q) of the smallest RMSD as the solution. 
Article Categories:
Keywords: Protein Structure, Alignment, RMSD, LCP. 
Previous Document: Achilles tendon biomechanics in psoriatic arthritis patients with ultrasound proven enthesitis.
Next Document: Recommendations for supplementary intravenous glucocorticosteroids in patients on longterm steroid ...