Document Detail

BioJava: an open-source framework for bioinformatics.
Jump to Full Text
MedLine Citation:
PMID:  18689808     Owner:  NLM     Status:  MEDLINE    
SUMMARY: BioJava is a mature open-source project that provides a framework for processing of biological data. BioJava contains powerful analysis and statistical routines, tools for parsing common file formats and packages for manipulating sequences and 3D structures. It enables rapid bioinformatics application development in the Java programming language.
AVAILABILITY: BioJava is an open-source project distributed under the Lesser GPL (LGPL). BioJava can be downloaded from the BioJava website ( BioJava requires Java 1.5 or higher. All queries should be directed to the BioJava mailing lists. Details are available at
R C G Holland; T A Down; M Pocock; A Prlić; D Huen; K James; S Foisy; A Dräger; A Yates; M Heuer; M J Schreiber
Related Documents :
17377298 - Applications of computer assisted surgery and medical robotics at the issste, méxico: ...
8885098 - Rule induction and instance-based learning applied in medical diagnosis.
3174088 - The computer. an orthopedic instrument.
6524598 - New hope for identifying the unidentified. the national crime information center uniden...
24059338 - Logical design of medical chaperone for prion diseases.
2874158 - Carbamazepine in dementia.
Publication Detail:
Type:  Journal Article; Research Support, Non-U.S. Gov't     Date:  2008-08-08
Journal Detail:
Title:  Bioinformatics (Oxford, England)     Volume:  24     ISSN:  1367-4811     ISO Abbreviation:  Bioinformatics     Publication Date:  2008 Sep 
Date Detail:
Created Date:  2008-09-08     Completed Date:  2008-10-31     Revised Date:  2013-06-05    
Medline Journal Info:
Nlm Unique ID:  9808944     Medline TA:  Bioinformatics     Country:  England    
Other Details:
Languages:  eng     Pagination:  2096-7     Citation Subset:  IM    
European Bioinformatics Institute, EMBL-EBI, Genome Campus, Hinxton, Cambridgeshire, UK.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Computational Biology / methods*
Nucleic Acid Conformation
Programming Languages*
Protein Conformation
Sequence Analysis
Grant Support
077198//Wellcome Trust; //Wellcome Trust

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Full Text
Journal Information
Journal ID (nlm-ta): Bioinformatics
Journal ID (publisher-id): bioinformatics
Journal ID (hwp): bioinfo
ISSN: 1367-4803
ISSN: 1460-2059
Publisher: Oxford University Press
Article Information
Download PDF
? 2008 The Author(s)
Received Day: 23 Month: 5 Year: 2008
Revision Received Day: 30 Month: 6 Year: 2008
Accepted Day: 25 Month: 7 Year: 2008
Print publication date: Day: 15 Month: 9 Year: 2008
Electronic publication date: Day: 8 Month: 8 Year: 2008
pmc-release publication date: Day: 8 Month: 8 Year: 2008
Volume: 24 Issue: 18
First Page: 2096 Last Page: 2097
ID: 2530884
PubMed Id: 18689808
DOI: 10.1093/bioinformatics/btn397
Publisher Id: btn397

BioJava: an open-source framework for bioinformatics
R. C. G. Holland1
T. A. Down2
M. Pocock3
A. Prli?4*
D. Huen5
K. James4
S. Foisy6
A. Dr?ger7
A. Yates1
M. Heuer8
M. J. Schreiber9
1European Bioinformatics Institute (EMBL-EBI), Genome Campus, Hinxton, Cambridgeshire CB10 1SD, 2Gurdon Institute and Department of Genetics, Cambridge CB2 1QN, 3University Newcaste Upon Tyne, Newcastle Upon Tyne, NE1 7RU, 4Wellcome Trust Sanger Institute, Genome Campus, Hinxton, Cambridgeshire CB10 1SA, 5Department of Genetics, University of Cambridge, Cambridge CB2 3EH, UK, 6Laboratory in Genetics and Genomic Medicine of Inflammation, Montreal Heart Institute, Montreal, Canada H1T 1C8, 7Eberhard Karls University T?bingen, Center for Bioinformatics (ZBIT), T?bingen, Germany, 8Harbinger Partners, Inc. St. Paul, MN, USA and 9Novartis Institute for Tropical Diseases, 10 Biopolis Road, Chromos #05-01, Singapore 138670
Correspondence: *To whom correspondence should be addressed.
Associate Editor: Anna Tromontano


BioJava was conceived in 1999 by Thomas Down and Matthew Pocock as an Application Programming Interface (API) to simplify bioinformatics software development using Java (Pocock, 2003; Pocock et al., 2000). It has since then evolved to become a fully featured framework with modules for performing many common bioinformatics tasks. The goal of BioJava is to facilitate code reuse and to provide standard implementations that are easy to link to external scripts and applications.

BioJava is an open-source project that is developed by volunteers and coordinated by the Open Bioinformatics Foundation (OBF). It is one of several Bio* toolkits (Mangalam, 2002). All code is distributed under the LGPL license and can be freely used and reused in any form.

BioJava is a mature project and has been employed in a number of real-world applications and over 50 published studies. A list of these can be found on the BioJava website. According to the project tracking web site Ohloh (, the BioJava code-base represents an estimated 47 person-years worth of effort.


BioJava contains a number of mature APIs. The 10 most frequently used are: (1) nucleotide and amino acid alphabets, (2) BLAST parser, (3) sequence I/O, (4) dynamic programming, (5) structure I/O and manipulation, (6) sequence manipulation, (7) genetic algorithms, (8) statistical distributions, (9) graphical user interfaces and (10) serialization to databases. Below follows a short discussion of some of these modules.

At the core of BioJava is a symbolic alphabet API which represents sequences as a list of references to singleton symbol objects that are derived from an alphabet. Lists of symbols are stored whenever possible in a compressed form of up to four symbols per byte of memory.

In addition to the fundamental symbols of a given alphabet (A, C, G and T in the case of DNA), all BioJava alphabets implicitly contain extra symbol objects representing all possible combinations of the fundamental symbols.

The symbol approach allows the construction of higher order alphabets and symbols that represent the multiplication of one or more alphabets. An example is the codon ?alphabet? which is the cubed product of the DNA alphabet, each codon ?symbol? comprising three DNA symbols. Such an alphabet allows construction of views over sequences without modifying the underlying sequence which is useful for tasks such as translation. Other complex alphabets which can be described include conditional alphabets for the construction of conditional probability distributions, and heterogeneous alphabets such as the combination of the codon and protein alphabets for use with a DNA?protein aligning hidden Markov model (HMM). Other interesting applications of the alphabet API include chromosomes for genetic algorithms using, but not limited to, integer or binary symbol lists, and the representation of Phred quality scores (Ewing et al., 1998) as a multiplication of the DNA and integer alphabets.

The typical user would most likely start out by using the sequence input/output API and the sequence/feature object model. These allow sequences to be loaded from a number of common file formats such as FASTA, GenBank and EMBL, optionally manipulated in memory, then saved again or converted into a different format. The simplicity of this process is demonstrated in Figure 1.

Another useful API is the feature/annotation object model which associates sequences with located features and unlocated annotations. Features can be found either by keyword or by defining a location query from which all overlapping or contained features are returned, while annotations can be retrieved by keyword. The location model handles circular and stranded locations, split locations and multi-sequence locations allowing features to span complex sets of coordinates.

The protein structure API contains tools for parsing and manipulating PDB files (Berman et al., 2000). It contains utility methods to perform linear algebra calculations on atomic coordinates and can calculate 3D structure alignments. A simple interface to the 3D visualization library Jmol ( is contained as well. An add-on allows the serialization of the content of a PDB file to a database using Hibernate (

Other APIs include those for working with chromatograms, sequence alignments, proteomics and ontologies. Parsers are provided for reading, amongst others, Blast reports (Altschul et al., 1997), ABI chromatograms and NCBI taxonomy definitions.

Recently the BioJavaX module was added which provides more detailed parsing of the common file formats and improved storing of sequence data into BioSQL databases ( This allows to incorporate BioJava into existing data processing pipelines which use alternative OBF toolkits such as BioPerl (Stajich et al., 2002).

The BioJava web site provides detailed manuals on how to use the different components. In particular, the ?CookBook? section provides a quick introduction into solving many problems by demonstrating solutions with documented source code. There is also a section to demonstrate the performance of a few selected tasks via Java WebStart examples. To mention just one: the FASTA-formatted release 4 Drosophila genome sequence can be parsed in <20 s on a 1.80 GHz Core Duo processor.


BioJava aims to provide an API that is of use to anyone using Java to develop bioinformatics software, regardless of which specialization they may work in. Genomic features currently must be manipulated with reference to the underlying genomic sequence, which can make working with post-genomic datasets, such as microarray results, overly complex. Phylogenetics tools are already in development which will allow users to work with NEXUS tree files (Maddison et al., 1997).

Although the Blast parsing API is widely used, it does not support all of the existing blast-family output formats. We will continue the ongoing effort to add parsers for PSI-Blast and other currently unsupported formats.

Users are welcome to identify further areas of need and their suggestions will be incorporated into future developments.

BioJava is written entirely in the Java programming language, and will run on any platform for which a Java 1.5 run-time environment is available. Java 5 and 6 provide advanced language features, and we shall be taking advantage of these in the next major release, both to aid in maintenance of the library and to make it even easier for novice Java developers to make use of the BioJava APIs.


BioJava is one of the largest open-source APIs for bioinformatics software development. It is a mature project with a large user and support community. It offers a wide range of tools for common bioinformatics tasks. The BioJava homepage provides access to the source code and detailed documentation.


We want to thank everybody who made code or documentation contribution during the project's life. Each of these contribution is appreciated, though the total list of contributors is too long to be reproduced here. BioJava is not formally funded by any grants. Through the OBF we have received sponsorship from Sun Microsystems, Apple Computers and NESCent. The initial development of the phylogenetics module was undertaken as a Google Summer of Code 2007 project in collaboration with NESCent.

Funding: Funding for open access charge: Wellcome Trust.

Conflict of Interest: none declared.

Altschul SF,et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programsNucleic Acids Res.Year: 199725338934029254694
Berman HM,et al. The Protein Data BankNucleic Acids Res.Year: 20002823524210592235
Pocock M. Computational analysis of genomesPhD thesis.Year: 2003Cambridge, UKUniversity of Cambridge
Pocock M,et al. BioJava: open source components for bioinformaticsACM SIGBIO Newsl.Year: 2000201012
Ewing B,et al. Base-calling of automated sequencer traces using phredGenome Res.Year: 199881751859521921
Maddison DR,et al. NEXUS: an extensible file format for systematic informationSyst. Biol.Year: 199746590621
Mangalam H. The Bio* toolkits - a brief overviewBrief. Bioinform.Year: 20023396302
Stajich JE,et al. The Bioperl toolkit: Perl modules for the life sciencesGenome Res.Year: 2002121611161812368254


[Figure ID: F1]
Fig. 1. 

Loading a GenBank file with BioJava and writing it out as FASTA. The example demonstrates the use of several convenience methods that hide the bulk of the implementation. If the developer desires a more flexible parser it is possible to make use of the interfaces hidden behind the convenience methods to expose a fully customizable, multi-component, event-based parsing model.

Article Categories:
  • Applications Note
    • Sequence Analysis

Previous Document:  GlideScope video laryngoscope: a randomized clinical trial in 203 paediatric patients.
Next Document:  Detection of 3D atomic similarities and their use in the discrimination of small molecule protein-bi...