Document Detail

Nearest-neighbor classifier as a tool for classification of protein families.
Jump to Full Text
MedLine Citation:
PMID:  20975888     Owner:  NLM     Status:  PubMed-not-MEDLINE    
Abstract/OtherAbstract:
Knowledge about protein function is essential in understanding the biological processes. A specific class or family of protein shares common structural and chemical properties amongst its member sequences. The set of properties that display its unique characteristics for clearly classifying a protein sequence into its corresponding protein family needs to be studied. Our study of these important properties conducted on four major classes of proteins namely Globins, Homeoboxes, Heat Shock proteins (HSP) and Kinase have shown that frequency of twenty naturally occurring amino acids, hydrophobic content of protein, molecular weight of protein, isoelectric point of protein, secondary structure composition of amino acid residues as helices, coils and sheets and the composition of helices, coils and sheets in the secondary structure topology plays a significant role in correctly classifying the protein into its corresponding class or family as indicated by the overall efficiency of Nearest Neighbor Classifier as 84.92%.
Authors:
Mona Chaurasiya; Gohel Bakul Chandulah; Krishna Misra; Vivek Kumar Chaurasiya
Related Documents :
11536358 - Aromatic di-alanine repeats (adar) are structural motifs characteristic of the soluble ...
18020358 - Computationally mapping sequence space to understand evolutionary protein engineering.
12660998 - Large contact surface interactions between proteins detected by time series analysis me...
3150678 - Bacterial proteins with n-terminal leader sequences resembling mitochondrial targeting ...
17184198 - Plasma proteins in edematous white matter after intracerebral hemorrhage confound immun...
23203078 - Protein profiling of blood samples from patients with hereditary leiomyomatosis and ren...
Publication Detail:
Type:  Journal Article     Date:  2010-03-31
Journal Detail:
Title:  Bioinformation     Volume:  4     ISSN:  0973-2063     ISO Abbreviation:  Bioinformation     Publication Date:  2010  
Date Detail:
Created Date:  2010-10-26     Completed Date:  2011-07-14     Revised Date:  2013-05-29    
Medline Journal Info:
Nlm Unique ID:  101258255     Medline TA:  Bioinformation     Country:  Singapore    
Other Details:
Languages:  eng     Pagination:  396-8     Citation Subset:  -    
Affiliation:
Indian Institute of Information Technology, Allahabad, India.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Descriptor/Qualifier:

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Full Text
Journal Information
Journal ID (nlm-ta): Bioinformation
Journal ID (publisher-id): Bioinformation
ISSN: 0973-2063
Publisher: Biomedical Informatics Publishing Group
Article Information
© 2010 Biomedical Informatics Publishing Group
open-access:
Received Day: 07 Month: 12 Year: 2009
Revision Received Day: 30 Month: 1 Year: 2010
Accepted Day: 13 Month: 11 Year: 2010
collection publication date: Year: 2010
Electronic publication date: Day: 31 Month: 3 Year: 2010
Volume: 4 Issue: 9
First Page: 396 Last Page: 398
PubMed Id: 20975888
Publisher Id: 008300042010

Nearest-neighbor classifier as a tool for classification of protein families
Mona Chaurasiya1
Gohel Bakul Chandulah1
Krishna Misra1*
Vivek Kumar Chaurasiya2
1Indian Institute of Information Technology, Allahabad, India
2Indian Institute of Technology, Roorkee, India
*Krishna Misra: kkmisra@yahoo.com

Background

There is a vast gap between the between the amount of sequence information and functional characterization of the protein. Hence, fast computational methods are required for the correct characterization of its function [1>,2]. The present classification system of proteins is based either on Sequence­Sequence similarity or Sequence­Structure similarity [3>,4]. These two methods play a critical role in predicting a possible function for a new sequence. But these methods do not function properly when clear sequence or structural similarities do not exist as in case of distantly related proteins. As we know that not all homologous proteins have analogous functions [5]. Proteins have many shared domains, need not necessarily perform the same function e.g. SH2, WD40 etc., are known to have different functions [6]. Proteins of specific functional class share common structural and chemical features essential for performing similar functions [7]. It is of interest to consider protein functional family classification as a method for facilitating protein function prediction, which is expected to be particularly useful in the cases, described above and may thus be used as a protein function prediction tool to complement sequence alignment methods.

It has been reported that physical and chemical properties of the protein's primary sequence play an important role in determining the protein's function [8]. We have tried to explore new physiochemical properties along with the secondary structure information of the proteins for the correct characterization of its family. Instead of direct comparison of the sequences, Nearest Neighbor classifier [9] was used to cluster the physiochemical properties that were generated from the protein primary sequence. Samples of the protein known to be in a functional class are used to train the Nearest Neighbor system to recognize specific features and to classify the protein sequences. Such an approach may be applied to closely related proteins as well as distantly related proteins.


Methodology

The physical parameters of the protein are very important in defining an unknown protein into a specific class. The most important physical parameters are the hydrophobic and the polar residues. It has been found in research of transmembrane proteins that the discriminatory features are observed in the intermediate steps when the patch of the hydrophobic residues followed by neutral amino acids and the same is observed when the polar residue string is followed by neutral amino acids [4]. Thus on the basis of physiochemical properties and structural properties, protein family can be classified using derived parameters from primary sequence.

Our dataset comprised of sequences of protein that were randomly chosen from the Swissprot database [10]. Every protein sequence was represented by a specific feature vector assembled from encoded representations of tabulated residue properties including the composition of twenty naturally occurring amino acids, hydrophobic content of the protein, molecular weight of protein, isoelectric point of protein, secondary structure composition of amino acid residues as helices, coils and sheets and the composition of helices, coils and sheets in the secondary structure topology. Each of above mentioned features are defined as follow.

(1)Amino acid composition (aa)

aai = ai /N, where ai is frequency of amino acid ’i‘ in the protein and i = twenty different amino acid aai represents the proportion of amino acid ’i‘ in the protein N = total number of residue in protein sequence;

(2)Hydrophobic content (H)

H = h/N, where h = no. of hydrophobic residue in the protein sequence, N = total number of residue in protein sequence.

Isoelectric point of protein (Ip)

The Isoelectric point is the pH at which the protein has no net charge. The net charge of a protein was calculated as the sum of the number of positively charged residues (protonated lysine, arginine, histidine),minus the number of negatively charged residues (deprotonated tyrosine, cysteine, glutamate, aspartate), plus the number of protonated amino termini, minus the number of deprotonated carboxyl termini. The net charge calculation does not take into account any electrostatic interactions within the protein that may perturb ionization. For each amino acid of interest, the number of protonated residues is determined by the following equation: N(p) = N(t) [H(+)] / )[H(+ʉ] + K(N)), where N(p) = number of protonated residues, N(t) = total number of residues of a specific amino acid, [H(+)] = hydrogen ion concentration, K(N) = dissociation constant for the amino acid of interest that is equal to the following: (10)(−pK)N

Secondary structure feature (S and SS)

We predicted the secondary structure of protein sequence using SSPro4.0 [11]. We then calculated the following features; Proportion of Amino acid falling within particular secondary structure, Sj = aj/N,where aj is no. of amino acid that fall in the ’j‘ and j = (helix, sheet, loop) N= total number of residue in protein sequence. Secondary structure proportion composition ssj = nj/M, where nj is the no. of secondary structure ’j‘ in topology and j = (helix, sheet, loop) M = total number of secondary structure (helix, sheet, loop) in sequence topology.

Molecular weight of protein (mw)

mw = Ni=1 wi; wi = molecular weight of amino acid, N = total number of residue in protein sequence combining all above mentioned features,Finally feature vector is F = [ aa H Ip s ss mw] that becomes the input for Nearest Neighbor classifier.

Classification of proteins

For the classification of various classes of protein based on mentioned physiochemical properties (F) using nearest neighbor classifier (NN). NN is quite popular amongst pattern recognition community mainly due to its simplicity and good performance. Consider the problem classifying unknown object into P classes with the training data set which is formulated as X= {(xi, c1) … (xN, cN)} where x is the feature vector, {ci|i = 1 to P, P = no. of classes} and N = no. of objects. According to NN rule, a new unclassified object x is assigned a class ci of xi nearest to x. In our case, there is four class c = {globin, homeobox, heat shock protein, kinase} and N is 2000. We used Euclidean distance as a measure of similarity.


Results and Discussion

The study was conducted on four major classes of proteins namely Globins, Homeoboxes,, Heat Shock Proteins and Kinase. Sequences of all four protein class were downloaded from Swissprot database. Then proposed physiochemical features were evaluated using Nearest Neighbor Classification technique. The results obtained from Nearest Neighbor Classifier fed with training set of five hundred sequences of each protein family against test set of three hundred sequences of each protein family,chosen randomly,are as follows:

The results show that the properties used namely the frequency of twenty naturally occurring amino acids, hydrophobic content of protein, molecular weight of protein,isoelectric point of protein,secondary structure composition of amino acid residues as helices, coils and sheets and the composition of helices,coils and sheets in the secondary structure topology plays a significant role in correctly classifying the protein into its corresponding class or family as indicated by the overall efficiency of Nearest Neighbor 84.92% (Table 1 in supplementary material). On excluding the class Kinase the overall efficiency is 97.67% (Table 2 supplementary material) suggesting that the protein family “ Kinase” is more widely distributed than the Globin and the Homeobox protein families. Our further analysis on the individual protein family showed that Kinase and Heat Shock protein are closely related with respect to the above defined parameters.


Conclusion

The study suggests that the above taken feature set of proteins has the potential for classifying the proteins into protein families. Further addition of property / properties will help in discrimination of protein families that are widely distributed.


Supplementary material Data 1

Notes

FN1Citation:Chaurasiyaet al, Bioinformation 4(9): 396-398 (2010)

References
1. Pellegrini M,et al. Curr Opin Chem BiolYear: 20015465011166647
2. Teichman SA,Mitchison G. Nat BiotechnolYear: 2000182710625385
3. Bork P,Koonin EV. Nature GenetYear: 1998183133189537411
4. Teichmann SA,et al. Curr Opin Struct BiolYear: 20011135436311406387
5. Benner SA,et al. Res MicrobiolYear: 20001519710610865954
6. Marcotte EM,et al. ScienceYear: 199928575175310427000
7. Ladunga Istav,et al. BioinformaticsYear: 1999151028103810745993
8. Cai CZ,Han LY,Ji ZL,Chen X,et al. Nucleic Acids ResearchYear: 2003313692369712824396
9. Keck Hans Peter,Wetter Thomas,et al. In Silico BiologyYear: 20033002312954089
10. http://www.uniprot.org/downloads.
11. Cheng J,Randall A,et al. Nucleic Acids ResearchYear: 200533727615980571

Article Categories:
  • Hypothesis

Keywords: proteins, family, classification, classifier.

Previous Document:  Computational genome analyses of metabolic enzymes in Mycobacterium leprae for drug target identific...
Next Document:  Predictive inference on cytoplasmic and mitochondrial thioredoxin peroxidases in the highly radiores...