Document Detail

SOSUI-GramN: high performance prediction for sub-cellular localization of proteins in gram-negative bacteria.
Jump to Full Text
MedLine Citation:
PMID:  18795116     Owner:  NLM     Status:  PubMed-not-MEDLINE    
A predictive software system, SOSUI-GramN, was developed for assessing the subcellular localization of proteins in Gram-negative bacteria. The system does not require the sequence homology data of any known sequences; instead, it uses only physicochemical parameters of the N- and C-terminal signal sequences, and the total sequence. The precision of the prediction system for subcellular localization to extracellular, outer membrane, periplasm, inner membrane and cytoplasmic medium was 92.3%, 89.4%, 86.4%, 97.5% and 93.5%, respectively, with corresponding recall rates of 70.3%, 87.5%, 76.0%, 97.5% and 88.4%, respectively. The overall performance for precision and recall obtained using this method was 92.9% and 86.7%, respectively. The comparison of performance of SOSUI-GramN with that of other methods showed the performance of prediction for extracellular proteins, as well as inner and outer membrane proteins, was either superior or equivalent to that obtained with other systems. SOSUI-GramN particularly improved the accuracy for predictions of extracellular proteins which is an area of weakness common to the other methods.
Kenichiro Imai; Naoyuki Asakawa; Toshiyuki Tsuji; Fumitsugu Akazawa; Ayano Ino; Masashi Sonoyama; Shigeki Mitaku
Related Documents :
16408016 - Assignment of protein function in the postgenomic era.
14734306 - Computational identification of novel chitinase-like proteins in the drosophila melanog...
15099056 - A combinatorial selective labeling method for the assignment of backbone amide nmr reso...
17473316 - Subcellular localization prediction with new protein encoding schemes.
23056356 - Qsdh, a novel ahl lactonase in the rnd-type inner membrane of marine pseudoalteromonas ...
18096186 - Function of the c-terminal domain of the dead-box protein mss116p analyzed in vivo and ...
Publication Detail:
Type:  Journal Article     Date:  2008-07-14
Journal Detail:
Title:  Bioinformation     Volume:  2     ISSN:  0973-2063     ISO Abbreviation:  Bioinformation     Publication Date:  2008  
Date Detail:
Created Date:  2008-09-16     Completed Date:  2010-06-09     Revised Date:  2013-05-23    
Medline Journal Info:
Nlm Unique ID:  101258255     Medline TA:  Bioinformation     Country:  Singapore    
Other Details:
Languages:  eng     Pagination:  417-21     Citation Subset:  -    
Department of Applied Physics, Graduate School of Engineering, Nagoya University, Furocho, Chikusa-ku, Nagoya 464-8606, Japan.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Full Text
Journal Information
Journal ID (nlm-ta): Bioinformation
Journal ID (publisher-id): Bioinformation
ISSN: 0973-2063
Publisher: Biomedical Informatics Publishing Group
Article Information
Download PDF
? 2008 Biomedical Informatics Publishing Group
open-access: This is an open-access article, which permits unrestricted use, distribution, and reproduction in any medium, for non-commercial purposes, provided the original author and source are credited.
Received Day: 04 Month: 7 Year: 2008
Accepted Day: 06 Month: 7 Year: 2008
collection publication date: Year: 2008
Electronic publication date: Day: 14 Month: 7 Year: 2008
Volume: 2 Issue: 9
First Page: 417 Last Page: 421
ID: 2533062
PubMed Id: 18795116
Publisher Id: 009000022008

SOSUI-GramN: high performance prediction for sub-cellular localization of proteins in Gram-negative bacteria
Kenichiro Imai123*
Naoyuki Asakawa1
Toshiyuki Tsuji1
Fumitsugu Akazawa1
Ayano Ino1
Masashi Sonoyama1
Shigeki Mitaku1
1Department of Applied Physics, Graduate School of Engineering, Nagoya University, Furocho, Chikusa-ku, Nagoya 464-8606, Japan
2Toyota Physical and Chemical Research Institute, Nagakute-cho, Aichi 480-1192, Japan
3Computational Biology Research Center, AIST, Tokyo 135-0064, Japan
*Corresponding author: E-mail:


Subcellular localization is one of the most important characteristics of proteins and is central to understanding their function and the constitution of biological systems. In bacteria, information related to the subcellular location of pathogen proteins can facilitate the development of drugs and vaccines for treatment. We previously developed a system called SOSUI for predicting inner membrane proteins from amino acid sequences [1,2]. Because of its high accuracy, this system can be used to improve the performance of the prediction of subcellular localization. An advantage of SOSUI is that all of the parameters used for the prediction [1,2] are the physicochemical properties of amino acids. By adopting a similar approach, we developed in this work a novel method for predicting subcellular localization in Gram-negative bacteria for proteins in the extracellular, the outer membrane, the periplasm, and the medium cytoplasm, which are less hydrophobic than the proteins in the inner membrane. The physicochemical properties of the N- and C-terminal signal regions, as well as a whole sequences were also used for predicting protein localization. By combining this method with the SOSUI membrane prediction system, we constructed a unified system, referred to as SOSUI-GramN, for the complete characterization of subcellular protein localization in the five compartments of Gram-negative bacteria.

Several systems for predicting the subcellular localization of proteins in bacterial cells have already been developed: PSORTb, CELLO, PSLpred and P-CLASSIFIER [3-6]. The common weak point of those systems is precision and recall for the prediction of proteins localized in the extracellular medium are low [7]. The reasons for the low predictive performance associated with extracellular proteins may be due to the complex secretory pathways found in Gram-negative bacteria; a minimum of seven mechanisms have been reported to date [8,9]. In this work, we have assumed that the pathways through the cytoplasmic membrane can be categorized into two main groups: Sec dependent pathways and Sec independent pathways. Therefore, we classified proteins of the same localization into two main groups based on the physicochemical properties of their N-terminal signal regions prior to discrimination of the final localization of proteins. In addition, proteins with the same subcellular localization in each of the main groups were classified into several subclasses based on the physicochemical parameters of their N- and C-terminal signal sequences. Here we describe the development of a novel software system, SOSUI-GramN, which was achieved by combining the SOSUI system used for the prediction of inner membrane proteins with a new protein prediction system capable of discriminating among four different localizations of less hydrophobic proteins in Gram-negative bacteria, leading to an overall precision of 92.9%. Precision and recall obtained using for extracellular proteins SOSUI-GramN were markedly higher than the currently available systems, i.e. PSORTb, CELLO, PSLpred and P-CLASSIFIER etc.


The dataset used for developing the novel system was obtained from ePSORTdb [10] and Swiss-Prot (release 54.4) by omitting data with sequence identities exceeding 50%. The dataset contained 1795 proteins: 733 in the cytoplasm, 396 in the inner membrane, 248 in the periplasm, 240 in the outer membrane and 178 in the extracellular medium. The dataset was randomly partitioned into five sets, four of which were used for training and the remaining data were used to evaluate the system. Furthermore, an additional test set of 299 proteins was created in order to be used for the comparison of different prediction systems [7] because a number of proteins of our test data were included in training data of other Method. There is no sequence, which is identical to training and test data, in the data of 299, and the sequence identity between this calibration data, the training data, and the test data was less than at most 50%.

The prediction procedure of the SOSUI-GramN system consists of three layers of filters may reflects several pathways for the same subcellular localization site as shown in Figure 1. The first layer is the filter, consisting of SOSUI [1,2], required for distinguishing inner membrane proteins from the other classes of proteins. Some physiochemical parameter such as the distribution of hydrophobicty and amphiphilicity index [11,12] of amino acids around transmembrane regions and the size of protein, were combined in SOSUI to discriminate an inner membrane protein from other types of proteins. SOSUI achieved 98.0% accuracy of prediction for inner (or cytoplasmic) membrane protein of Prokayote. Because of its high accuracy, this system can distinctly differentiate inner membrane proteins from other proteins.

Using the physicochemical properties of N-terminal signal sequence region and whole sequence, the second filter then classifies all of the proteins that are not inner membrane proteins into two groups depending on whether they are involved in the Sec dependent or independent pathways. The proteins of the Sec dependent pathway generally have the signal sequence, consisting of the positively charged region, hydrophobic regions, and polar region in N-terminal and these are not found in the proteins involved in the Sec independent pathway. The physicochemical parameters, required for the second filter, are hydrophobicity index [11], the densities of positively and negatively charged residues, small polar and non-polar residues, aromatic residues, secondary structure-breaking residues such as proline and glycine, the helical periodicity of 3.6 residues, and the alternate periodicity of the two residues of the hydrophobicity index. We calculated the average of these parameters of the region of 60 residues around N-terminal sorting signal and total sequence, and then distinguished Sec dependent proteins from Sec independent proteins by the canonical discriminant analysis with using the average of these parameters of each target region.

The fundamental process required for the discrimination is described below. (i) The target region was divided into ten segments, except for total sequence, and then an average of parameter for segment was calculated by equation (1) (see supplementary material). When the target region is total sequence, the region is defined as one segment. (ii) To obtain the normalized p-th parameter of the target region j, the average values is normalized by the difference between positive data and negative data by equation (2) (see supplementary material). Finally, we discriminate positive data from negative data by the canonical discriminant analysis with using normalized parameters.

The third filter consists of several prediction modules based on the subclassification of the extracellular proteins, the outer membrane proteins, the periplasmic proteins and the cytoplasmic proteins. Assuming that there are several pathways for the same subcellular localization site, we classified proteins into several subclasses by the canonical discriminant analysis based on the average of physicochemical parameters of the 60 residues of the most of N- and C-terminal, which may include sorting signals, prior to discrimination analysis. We then distinguished a subclass from the other subclasses by the same discriminant analysis with the average of physicochemical parameters of 60, 100 and 200 residues of N- and C-terminal residues, as well as the entire sequence. Consequently, the assembly of the prediction modules required for the third layer are relatively complicated, but basically, each prediction module correspond to the discrimination of each subclass from the other subclasses and the discrimination method is the same as the method is used in second filter. The parameters, for these subclassification and prediction modules, are also identical to that for the discrimination in the second filter. The number of prediction modules was five for the prediction of the extracellular proteins, six for the outer membrane proteins and three for the periplasmic proteins, one for the cytoplasmic proteins as shown in Figure 1.


When the performance of the SOSUI-GramN prediction system for characterizing subcellular localization to the compartments of the extracellular, the outer membrane, the periplasm, the inner membrane, and the cytoplasm medium was evaluated using test data, precision estimates of 92.3%, 89.4%, 86.4%, 97.5% and 93.5%, and corresponding recall values of 70.3%, 87.5%, 76.0%, 97.5% and 88.4% were obtained, respectively. The overall performance for precision and recall was 92.9% and 86.7% (Table 1a under supplementary material). In this study, precision was calculated as TP / (TP+FP) and recall as TP / (TP+FN), where TP, FP and FN represent the number of true positive, false positive and false negative data, respectively. Table 1b (supplementary material) shows the performance of our system relative to that of other systems using 299 proteins as the calibration data. The assessment of individual methods reveals that the highest overall precision and recall were achieved by Psortb and SOSUI-GramN, respectively. The prediction of the extracellular proteins was particularly low accuracy and common weakness of the other methods, however, our system archived the highest precision and recall, which are 90.6 and 50.0%, respectively. Our system SOSUI-GramN significantly improved the precision of extracellular proteins. The performance of the prediction of outer and inner membrane protein was superior or equivalent to that of the other method. The highest precision of outer membrane proteins was achieved by Psortb, however our method attained the highest recall at 92.1%. Psortb performed the highest precision and recall for inner membrane protein at 96.5 and 79.7%, respectively followed closely by our method at 96.2 and 72.5%. The other hand, the highest precision for periplasmic and cytoplasmic proteins are attained by Psortb, and CELLO, PSLpred and P-CLASSIFIER archived higher recall for cytoplasmic than our method. The accuracy of prediction for periplasmic and cytoplasmic protein is lower than the other methods, however, our method predicted extracellular protein, outer and inner membrane protein, which are important for facilitating the development of drugs for bacterial pathogen, with high accuracy, so SOSUI-GramN would be particularly useful for screening amino acid sequences in medical applications.


Gram-negative bacteria have five major subcellular localization sites, which are extracellular, outer membrane, periplasm, inner membrane and cytoplasm. Proteins synthesized in cytosol and then are targeted and transported to their localization sites through translocation pathways, however the translocation pathways to same localization site is not single pathway. The secretion pathways of extracellular proteins are at least seven pathways, so this complicated pathways result in the diversity of target signals, moreover the low prediction accuracy of extracelluar proteins. In this study, we developed the novel method for prediction the subcellular localization of proteins in Gram-negative bacteria based on the classification of protein on the assumption of the several pathways for the same localization site. The comparison of individual methods (Table 1b in supplementary material) reveals that the performance associated with the prediction of extracellular proteins, which is an area of weakness common to many of the other methods, was superior to that obtained with other systems. Most of sorting signals is not fully understood, but our classification based on physicochemical properties around the sorting signal may provide the clue to make clear features of sorting signals.

The SOSUI-GramN system is a web-based application that users can use after inputting their single or multiple sequences on the submission. Note that a minimum input sequence length of more than 60 amino acids is required. The predicted subcellular location of the proteins of interest in Gram-negative bacterial cells is shown on the output page. In the event that the sequence length is less than 60 amino acids or if the confident prediction cannot be estimated, the SOSUI-GramN returns a value of ?unknown?. The population of ?unknown? in data tested was only about 7% which is smaller than the currently available systems. The SOSUI-GramN system is available at

This work was supported in part by a Grant-in-Aid from SENTAN, JST.

1. Hirokawa T,et al. Bioinformatics 1998;14:378. [pmid: 9632836]
2. Tsuji T,Mitaku S. CBIJ 2004;4:110.
3. Gardy JL,et al. Bioinformatics 2005;21:617. [pmid: 15501914]
4. Yu CS,et al. Proteins 2006;64:643. [pmid: 1675241]
5. Bhasin M,et al. Bioinformatics 2005;21:2522. [pmid: 15699023]
6. Wang J,et al. BMC Bioinformatics 2005;6:174. [pmid: 16011808]
7. Gardy JL,Brinkman FS. Nat Rev Microbiol 2006;4:741. [pmid: 16964270]
8. Gerlach RG,Hensel M. Int J Med Microbiol 2007;297:401. [pmid: 17482513]
9. Kostakioti M,et al. J Bacteriol 2005;187:4306. [pmid: 15968039]
10. Rey S,et al. Nucleic Acids Res 2005;33:D164. [pmid: 15608169]
11. Kyte J,Doolittle RF. J Mol Biol 1982;157:105. [pmid: 7108955]
12. Mitaku S,et al. Bioinformatics 2002;18:608. [pmid: 12016058]

Article Categories:
  • Hypothesis

Keywords: subcellular localization of proteins, Gram-negative bacteria, physicochemical parameters, amino acids.

Previous Document:  Cross chromosomal similarity for DNA sequence compression.
Next Document:  Nicotine enemas for active Crohn's colitis: an open pilot study.