Document Detail

BioPig: A Hadoop-based Analytic Toolkit for Large-Scale Sequence Data.
MedLine Citation:
PMID:  24021384     Owner:  NLM     Status:  Publisher    
MOTIVATION: The recent revolution in sequencing technologies has led to an exponential growth of sequence data. As a result, most of the current bioinformatics tools become obsolete as they fail to scale with data. To tackle this "data deluge", here we introduce the BioPig sequence analysis toolkit as one of the solutions that scale to data and computation.
RESULTS: We built BioPig upon the Apache's Hadoop MapReduce system and the Pig data flow language. Compared to traditional serial and MPI based algorithms, BioPig has three major advantages: first, BioPig's programmability greatly reduces development time for parallel bioinformatics applications; second, testing BioPig with up to 500 Gb sequences demonstrates that it scales automatically with size of data; and finally, BioPig can be ported without modification on many Hadoop infrastructures, as tested with Magellan system at NERSC and the Amazon Elastic Compute Cloud. In summary, BioPig represents a novel program framework with the potential to greatly accelerate data-intensive bioinformatics analysis.
AVAILABILITY: BioPig is released as open source software under the BSD license at CONTACT:
Henrik Nordberg; Karan Bhatia; Kai Wang; Zhong Wang
Related Documents :
23681824 - Evolutionary systems biology: what it is and why it matters.
23257014 - Probiotics and nutrients for the first 1000 days of life in the developing world.
10723894 - Spatial, temporal and wavefront direction characteristics of 12-lead t-wave morphology.
23761994 - Grassroots campaign trail methods to recruit for clinical trials: recruitment lessons l...
22557484 - The origin and development of chinese acupuncture and moxibustion.
21474944 - A rapid generalized least squares model for a genome-wide quantitative trait associatio...
Publication Detail:
Type:  JOURNAL ARTICLE     Date:  2013-9-10
Journal Detail:
Title:  Bioinformatics (Oxford, England)     Volume:  -     ISSN:  1367-4811     ISO Abbreviation:  Bioinformatics     Publication Date:  2013 Sep 
Date Detail:
Created Date:  2013-9-11     Completed Date:  -     Revised Date:  -    
Medline Journal Info:
Nlm Unique ID:  9808944     Medline TA:  Bioinformatics     Country:  -    
Other Details:
Languages:  ENG     Pagination:  -     Citation Subset:  -    
Department of Energy, Joint Genome Institute, Walnut Creek, CA 94598, USA.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Previous Document:  Accounting for epistatic interactions improves the functional analysis of protein structures.
Next Document:  Toward a statistically explicit understanding of de novo sequence assembly.