Document Detail

Improving imbalanced scientific text classification using sampling strategies and dictionaries.
MedLine Citation:
PMID:  21926439     Owner:  NLM     Status:  In-Data-Review    
Many real applications have the imbalanced class distribution problem, where one of the classes is represented by a very small number of cases compared to the other classes. One of the systems affected are those related to the recovery and classification of scientific documentation. Sampling strategies such as Oversampling and Subsampling are popular in tackling the problem of class imbalance. In this work, we study their effects on three types of classifiers (Knn, SVM and Naive-Bayes) when they are applied to search on the PubMed scientific database. Another purpose of this paper is to study the use of dictionaries in the classification of biomedical texts. Experiments are conducted with three different dictionaries (BioCreative, NLPBA, and an ad-hoc subset of the UniProt database named Protein) using the mentioned classifiers and sampling strategies. Best results were obtained with NLPBA and Protein dictionaries and the SVM classifier using the Subsampling balancing technique. These results were compared with those obtained by other authors using the TREC Genomics 2005 public corpus.
Lourdes Borrajo; Rubén Romero; Eva Lorenzo Iglesias; Carmen María Redondo Marey
Related Documents :
22628319 - Analytical performance, reference values and decision limits. a need to differentiate b...
22308149 - Discriminating response groups in metabolic and regulatory pathway networks.
22399789 - Virtual experiments, physical validation: dental morphology at the intersection of expe...
22209809 - Differentiating bold and non-bold signals in fmri time series using multi-echo epi.
20053089 - Different cue weights at the same place.
21565059 - Arlequin suite ver 3.5: a new series of programs to perform population genetics analyse...
14995039 - Challenging the assumptions in estimating protein fractional synthesis rate using a mod...
25314229 - Comparing the factor structure of the wisconsin schizotypy scales and the schizotypal p...
8482689 - Model-free numerical deconvolution of recirculating indicator concentration curves.
Publication Detail:
Type:  Journal Article     Date:  2011-09-15
Journal Detail:
Title:  Journal of integrative bioinformatics     Volume:  8     ISSN:  1613-4516     ISO Abbreviation:  J Integr Bioinform     Publication Date:  2011  
Date Detail:
Created Date:  2011-09-19     Completed Date:  -     Revised Date:  -    
Medline Journal Info:
Nlm Unique ID:  101503361     Medline TA:  J Integr Bioinform     Country:  Germany    
Other Details:
Languages:  eng     Pagination:  176     Citation Subset:  IM    
Univ. of Vigo, Computer Science Dept., Campus As Lagoas s/n, 32004 Ourense, Spain.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Previous Document:  Prognostic Prediction through Biclustering-Based Classification of Clinical Gene Expression Time Ser...
Next Document:  Evaluating the effect of unbalanced data in biomedical document classification.