Document Detail

Speeding up chemical searches using the inverted index: the convergence of chemoinformatics and text search methods.
MedLine Citation:
PMID:  22462644     Owner:  NLM     Status:  MEDLINE    
In ligand-based screening, retrosynthesis, and other chemoinformatics applications, one often seeks to search large databases of molecules in order to retrieve molecules that are similar to a given query. With the expanding size of molecular databases, the efficiency and scalability of data structures and algorithms for chemical searches are becoming increasingly important. Remarkably, both the chemoinformatics and information retrieval communities have converged on similar solutions whereby molecules or documents are represented by binary vectors, or fingerprints, indexing their substructures such as labeled paths for molecules and n-grams for text, with the same Jaccard-Tanimoto similarity measure. As a result, similarity search methods from one field can be adapted to the other. Here we adapt recent, state-of-the-art, inverted index methods from information retrieval to speed up similarity searches in chemoinformatics. Our results show a several-fold speed-up improvement over previous methods for both threshold searches and top-K searches. We also provide a mathematical analysis that allows one to predict the level of pruning achieved by the inverted index approach and validate the quality of these predictions through simulation experiments. All results can be replicated using data freely downloadable from .
Ramzi Nasr; Rares Vernica; Chen Li; Pierre Baldi
Related Documents :
23645944 - A bayesian procedure for file linking to analyze end-of-life medical costs.
19041984 - Computer analysis of the microporous structure of activated carbon fibres using the fas...
22399964 - Improving ship detection with polarimetric sar based on convolution between co-polariza...
22471554 - Multi-model pathway enrichment methods for functional evaluation of expression regulati...
17825004 - Applications of additive semivarying coefficient models: monthly suicide data from hong...
20399764 - Circulating markers of liver fibrosis progression.
Publication Detail:
Type:  Journal Article; Research Support, N.I.H., Extramural; Research Support, Non-U.S. Gov't; Research Support, U.S. Gov't, Non-P.H.S.     Date:  2012-04-10
Journal Detail:
Title:  Journal of chemical information and modeling     Volume:  52     ISSN:  1549-960X     ISO Abbreviation:  J Chem Inf Model     Publication Date:  2012 Apr 
Date Detail:
Created Date:  2013-06-20     Completed Date:  2013-12-03     Revised Date:  2013-12-06    
Medline Journal Info:
Nlm Unique ID:  101230060     Medline TA:  J Chem Inf Model     Country:  United States    
Other Details:
Languages:  eng     Pagination:  891-900     Citation Subset:  IM    
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Databases, Chemical*
High-Throughput Screening Assays
Informatics / methods,  statistics & numerical data*
Likelihood Functions
Molecular Structure
Small Molecule Libraries / chemistry*,  classification
User-Computer Interface*
Grant Support
5T15LM007743/LM/NLM NIH HHS; LM010235-01A1/LM/NLM NIH HHS; R01 LM010235-01A1/LM/NLM NIH HHS; R01 LM010235-02/LM/NLM NIH HHS; R01 LM010235-03/LM/NLM NIH HHS; T15 LM007443-08/LM/NLM NIH HHS; T15 LM007443-09/LM/NLM NIH HHS; T15 LM007443-10/LM/NLM NIH HHS
Reg. No./Substance:
0/Small Molecule Libraries

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Previous Document:  Diagnostics of Anodic Stripping Mechanisms under Square-Wave Voltammetry Conditions Using Bismuth Fi...
Next Document:  Recent advances in the pathophysiology and pharmacological treatment of obesity.