Document Detail


Text Classification Performance: Is the Sample Size the Only Factor to be Considered?
MedLine Citation:
PMID:  23920967     Owner:  NLM     Status:  In-Data-Review    
Abstract/OtherAbstract:
The use of text mining and supervised machine learning algorithms on biomedical databases has become increasingly common. However, a question remains: How much data must be annotated to create a suitable training set for a machine learning classifier? In prior research with active learning in medical text classification, we found evidence that not only sample size but also some of the intrinsic characteristics of the texts being analyzed-such as the size of the vocabulary and the length of a document-may also influence the resulting classifier's performance. This study is an attempt to create a regression model to predict performance based on sample size and other text features. While the model needs to be trained on existing datasets, we believe it is feasible to predict performance without obtaining annotations from new datasets once the model is built.
Authors:
Rosa L Figueroa; Qing Zeng-Treitler
Related Documents :
23872657 - Multivariate spatial models of excess crash frequency at area level: case of costa rica.
24360287 - High drinking in the dark mice: a genetic model of drinking to intoxication.
23524177 - Fuzzy model for risk assessment of persistent organic pollutants in aquatic ecosystems.
23204357 - Development of, and initial validity evidence for, the referee self-efficacy scale: a m...
12512877 - Isolated hemoperfused slaughterhouse livers as a valid model to study hepatotoxicity.
16937797 - Quantifying biological integrity by taxonomic completeness: its utility in regional and...
Publication Detail:
Type:  Journal Article    
Journal Detail:
Title:  Studies in health technology and informatics     Volume:  192     ISSN:  0926-9630     ISO Abbreviation:  Stud Health Technol Inform     Publication Date:  2013  
Date Detail:
Created Date:  2013-08-07     Completed Date:  -     Revised Date:  -    
Medline Journal Info:
Nlm Unique ID:  9214582     Medline TA:  Stud Health Technol Inform     Country:  Netherlands    
Other Details:
Languages:  eng     Pagination:  1193     Citation Subset:  T    
Affiliation:
Departamento de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de Concepción, Chile.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Descriptor/Qualifier:

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine


Previous Document:  A Model for Nurses Seeking Information using a Scholarly Information Map.
Next Document:  Hierarchical Semantic Structures for Medical NLP.