| Addressing statistical biases in nucleotide-derived protein databases for proteogenomic search strategies. | |
| | |
MedLine Citation:
|
PMID: 23025403 Owner: NLM Status: Publisher |
Abstract/OtherAbstract:
|
Proteogenomics has the potential to advance genome annotation through high quality peptide identifications derived from mass spectrometry experiments, which demonstrate a given gene or isoform is expressed and translated at the protein level. This can advance our understanding of genome function, discovering novel genes and gene structure that have not yet been identified or validated. Because of the high-throughput shotgun nature of most proteomics experiments, it is essential to carefully control for false positives and prevent any potential misannotation. A number of statistical procedures to deal with this are in wide use in proteomics, calculating false discovery rate (FDR) and posterior error probability (PEP) values for groups and individual peptide spectrum matches (PSMs). These methods control for multiple testing and exploit decoy databases to estimate statistical significance. Here, we show that database choice has a major effect on these confidence estimates leading to significant differences in the number of PSMs reported. We note that standard target:decoy approaches using six-frame translations of nucleotide sequences, such as assembled transcriptome data, apparently under-estimate the confidence assigned to the PSMs. The source of this error stems from the inflated and unusual nature of the six-frame database, where for every target sequence there exists five "incorrect" targets that are unlikely to code for protein. The attendant FDR and PEP estimates lead to fewer accepted PSMs at fixed thresholds and we show that this effect is a product of the database and statistical modelling and not the search engine. A variety of approaches to limit database size and remove non-coding target sequences are examined and discussed in terms of the altered statistical estimates generated and PSMs reported. These results are of importance to groups carrying out proteogenomics, aiming to maximise the validation and discovery of gene structure in sequenced genomes, whilst still controlling for false positives. |
| | |
Authors:
|
Paul Blakeley; Ian M Overton; Simon J Hubbard |
Related Documents
:
|
10564803 - Sequencing and functional analysis of the nifenxorf1orf2 gene cluster of herbaspirillum... 23382983 - Base-calling algorithm with vocabulary (bcv) method for analyzing population sequencing... 827333 - Characterization and mapping of mitochondrial ribosomal rna and mitochondrial dna in dr... |
Publication Detail:
|
Type: JOURNAL ARTICLE Date: 2012-10-2 |
Journal Detail:
|
Title: Journal of proteome research Volume: - ISSN: 1535-3907 ISO Abbreviation: J. Proteome Res. Publication Date: 2012 Oct |
Date Detail:
|
Created Date: 2012-10-2 Completed Date: - Revised Date: - |
Medline Journal Info:
|
Nlm Unique ID: 101128775 Medline TA: J Proteome Res Country: - |
Other Details:
|
Languages: ENG Pagination: - Citation Subset: - |
Export Citation:
|
APA/MLA Format Download EndNote Download BibTex |
| MeSH Terms | |
Descriptor/Qualifier:
|
|
From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine
Previous Document: One-Phase Synthesis of Surface Modified Gold Nanoparticles and Generation of SERS Substrate by Seed ...
Next Document: Separation of Leukocytes From Blood Using Spiral Channel With Trapezoid Cross-section.