OSAT: a tool for sampletobatch allocations in genomics experiments.  
Jump to Full Text  
MedLine Citation:

PMID: 23228338 Owner: NLM Status: MEDLINE 
Abstract/OtherAbstract:

BACKGROUND: Batch effect is one type of variability that is not of primary interest but ubiquitous in sizable genomic experiments. To minimize the impact of batch effects, an ideal experiment design should ensure the even distribution of biological groups and confounding factors across batches. However, due to the practical complications, the availability of the final collection of samples in genomics study might be unbalanced and incomplete, which, without appropriate attention in sampletobatch allocation, could lead to drastic batch effects. Therefore, it is necessary to develop effective and handy tool to assign collected samples across batches in an appropriate way in order to minimize the impact of batch effects. RESULTS: We describe OSAT (Optimal Sample Assignment Tool), a bioconductor package designed for automated sampletobatch allocations in genomics experiments. CONCLUSIONS: OSAT is developed to facilitate the allocation of collected samples to different batches in genomics study. Through optimizing the even distribution of samples in groups of biological interest into different batches, it can reduce the confounding or correlation between batches and the biological variables of interest. It can also optimize the homogeneous distribution of confounding factors across batches. It can handle challenging instances where incomplete and unbalanced sample collections are involved as well as ideally balanced designs. 
Authors:

Li Yan; Changxing Ma; Dan Wang; Qiang Hu; Maochun Qin; Jeffrey M Conroy; Lara E Sucheston; Christine B Ambrosone; Candace S Johnson; Jianmin Wang; Song Liu 
Related Documents
:

24250508  Docking analysis and multidimensional hybrid qsar model of 1,4benzodiazepine2,5dione... 24982668  Estimating the predictive ability of genetic risk models in simulated data based on pub... 22995668  Indifferent to disease: a qualitative investigation of the reasons why some papua new g... 24482548  Bayesian modeling of temporal dependence in large sparse contingency tables. 24552508  Unorganized machines for seasonal streamflow series forecasting. 20146938  Screenclust: advanced statistical software for supervised and unsupervised high resolut... 
Publication Detail:

Type: Comparative Study; Journal Article; Research Support, N.I.H., Extramural Date: 20121210 
Journal Detail:

Title: BMC genomics Volume: 13 ISSN: 14712164 ISO Abbreviation: BMC Genomics Publication Date: 2012 
Date Detail:

Created Date: 20130121 Completed Date: 20130611 Revised Date: 20130711 
Medline Journal Info:

Nlm Unique ID: 100965258 Medline TA: BMC Genomics Country: England 
Other Details:

Languages: eng Pagination: 689 Citation Subset: IM 
Affiliation:

Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA. Li.Yan@RoswellPark.org 
Export Citation:

APA/MLA Format Download EndNote Download BibTex 
MeSH Terms  
Descriptor/Qualifier:

Algorithms* Data Collection / methods* Genomics / methods* Software* 
Grant Support  
ID/Acronym/Agency:

R01 CA095045/CA/NCI NIH HHS; R01CA095045/CA/NCI NIH HHS; R01CA133264/CA/NCI NIH HHS; R01HL102278/HL/NHLBI NIH HHS; R21CA162218/CA/NCI NIH HHS 
Comments/Corrections 
Full Text  
Journal Information Journal ID (nlmta): BMC Genomics Journal ID (isoabbrev): BMC Genomics ISSN: 14712164 Publisher: BioMed Central 
Article Information Download PDF Copyright ©2012 Yan et al.; licensee BioMed Central Ltd. openaccess: Received Day: 10 Month: 7 Year: 2012 Accepted Day: 4 Month: 12 Year: 2012 collection publication date: Year: 2012 Electronic publication date: Day: 10 Month: 12 Year: 2012 Volume: 13First Page: 689 Last Page: 689 PubMed Id: 23228338 ID: 3548766 Publisher Id: 1471216413689 DOI: 10.1186/1471216413689 
OSAT: a tool for sampletobatch allocations in genomics experiments  
Li Yan1  Email: Li.Yan@RoswellPark.org 
Changxing Ma2  Email: cxma@buffalo.edu 
Dan Wang1  Email: Dan.Wang@RoswellPark.org 
Qiang Hu1  Email: Qiang.Hu@RoswellPark.org 
Maochun Qin1  Email: Maochun.Qin@RoswellPark.org 
Jeffrey M Conroy3  Email: Jeffrey.Conroy@roswellpark.org 
Lara E Sucheston4  Email: Lara.Sucheston@roswellpark.org 
Christine B Ambrosone4  Email: Christine.Ambrosone@roswellpark.org 
Candace S Johnson5  Email: Candace.Johnson@roswellpark.org 
Jianmin Wang1  Email: Jianmin.Wang@RoswellPark.org 
Song Liu1  Email: Song.Liu@RoswellPark.org 
1Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA 

2Department of Biostatistics, SUNY University at Buffalo, Buffalo, NY, 14214, USA 

3Cancer Genetics, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA 

4Cancer Prevention and Control, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA 

5Pharmacology and Therapeutics, Roswell Park Cancer Institute, Buffalo, NY, 14263, USA 
A sizable genomics study such as microarray often involves the use of multiple batches (groups) of experiment due to practical complication. The systematic, nonbiological differences between batches in genomics experiment are referred as batch effects. Batch effects are widespread occurrences in genomic studies, and it has been shown that noticeable variation between different batch runs can be a real concern, sometimes even larger than the biological differences [^{1}^{}^{5}]. Without sound experiment designs and statistical analysis methods to handle batch effects, misleading or even erroneous conclusions could be made. This especially important issue is unfortunately often overlooked, partially due to the complexity and multiple steps involved in genomics studies.
To minimize the impact of batch effects, a careful experiment design should ensure the even distribution of biological groups and confounding factors across batches. It would be problematic if one batch run contains most samples of a particular biological group. In an ideal genomics design, the groups of the main interest, as well as important confounding variables should be balanced and replicated across the batches to form a Randomized Complete Block Design (RCBD) [^{6}^{}^{8}]. It makes the separation of the real biological effect of our interests and effects by other confounding factors statistically more powerful.
However, despite all best effort, it is often than not that the collected samples are not complying with the original ideal RCBD design. This is due to the fact that these studies are mostly observational or quasiexperimental since we usually do not have full control over sample availability [^{1}]. In clinical genomics study, samples may be rare, difficult or expensive to collect, irreplaceable or fail QC before profiling. The resulted unbalance and incompleteness nature of sample availability in genomics study, without appropriate attention in sampletobatch allocation, could lead to drastic batch effects. Therefore, it is necessary to develop effective and handy tool to assign collected samples across batches in an appropriate way in order to minimize the impact of batch effects.
We developed OSAT to facilitate the allocation of collected samples into different batches in genomics studies. OSAT is not aimed to be a software for experimental design carried out before sample collection, rather, it is developed to fulfill the needs arise from some practical limitations occurring in the genomics experiments. Specifically, OSTA is developed to address one practical issue in genomics studies – when the available experimental samples ready to be profiled in the genomics instruments are collected, how should one allocate these samples to different batches in a proper way to achieve an optimal setup minimizing the impact of batch effects at the genomic profiling stage? With a block randomization step followed by an optimization step, it produces setup that optimizes the even distribution of samples in groups of biological interest into different batches, reducing the confounding or correlation between batches and the biological variables of interest. It can also optimize the even distribution of confounding factors across batches. OSAT can handle challenging instances where incomplete and unbalanced sample collections are involved as well as ideal balanced RCBD.
An exemplary data is used for demonstration. It represents samples from a study where the primary interest is to investigate the expression differentiation in case versus control groups (variable SampleType). Two additional variables, Race and AgeGrp, are clinically important variables that may have impact on final outcome. We consider them as confounding variables. A total of 576 samples are included in the study, with one sample per row in the example file. As shown in Additional file 1: Table S1–S2, none of the three variables are characterized by balanced distribution.
The default algorithm implemented in OSAT will first block three variables considered (i.e., SampleType, Race and AgeGrp) to generate a single initial assignment setup, and then identify the optimal one with most homogeneous crossbatch strata distribution through shuffling the initial setup. Alternatively, if blocking the primary variable (i.e., SampleType) is the most important and the optimization of the other two variables is less important (but desired), a different algorithm implemented in OSATcan be used. It works by first blocking SampleType only to generate a pool of assignment setups, and then select the optimal one with most homogeneous crossbatch strata (i.e., SampleType, Race and AgeGrp) distribution.
As shown in Figure 1ac, the final setup produced by the default algorithm is characterized by relatively uniform distribution of all three variables across the batches. Pearson’s χ^{2} test examining the association between batches and each of the variables considered indicate that all there variables considered are highly uncorrelated with batches (pvalue > 0.99, Table 1). On the other hand, as shown in Figure 2ac, the final setup produced by the alternative algorithm is characterized by almost perfectly uniform distribution of SampleType variable (with small variation only due to the inherent limitation of the starting data such as unbalanced sample collection), with the uniformity of the other two variables not included in block randomization step decreased. Pearson's χ^{2} test (Table 1) shows that the resulting chisquare for SampleType decreases while those for Race and AgeGrp increase, indicating the tradeoff in prioritizing variable of primary interest for block randomization. Nevertheless, as shown in Figure 1d and Figure 2d, both algorithms produce final setups which show more homogeneous crossbatch strata distribution than the corresponding starting ones.
Simply performing complete randomizations might lead to undesired sampletobatch assignment, especially for unbalanced and/or incomplete sample sets. In fact, there is substantial chance that variables will be statistically dependent on batches if a complete randomization is carried out, especially for incomplete and/or unbalanced sample collections. As shown in Figure 3, an undesired setup can be produced through complete randomization of sampletobatch assignment. The Pearson's χ^{2} tests indicate all three variables are statistically dependent on batches with pvalues < 0.05 (Table 1).
Genomics experiments are often driven by the availability of the final collection of samples which might be unbalanced and incomplete. The unbalance and incompleteness nature of sample availability thus calls for the development of effective tools to assign collected samples across batches in an appropriate way in order to minimize the impact of batch effects at the genomics experiment stage. OSAT is developed to facilitate the allocation of collected samples to different batches in genomics study. With a block randomization step followed by an optimization step, it produces setup that optimizes the even distribution of samples in groups of biological interest into different batches, reducing the confounding or correlation between batches and the biological variables of interest. It can also optimize the homogeneous distribution of confounding factors across batches. While motivated to handle challenging instances where incomplete and unbalanced sample collections are involved, OSAT can also handle ideal balanced RCBD.
Partly due to its simplicity in implementation, complete randomization has been frequently used in the sample assignment step of experiment practice. When sample size is large enough, randomized design will be close to a balanced design. However, simple randomization could lead to undesirable imbalanced design where efficiency and confounding might be an issue after the data collection. As we demonstrated in the manuscript, simply performing randomizations might lead to undesired sampletobatch setup showing batch dependence, especially for unbalanced and/or incomplete sample sets which doesn’t comply with the original ideal design. OSAT package is designed to avoid such scenario, by providing a simple pipeline to create sample assignment that minimizes the association between sample characteristics and batches. The software was implemented in a flexible way so that it can be adopted by genomics practitioner who might not be specialized in experiment design.
It should be emphasized that although the impact of batch effect on genomics study might be minimized through proper design and sample allocation, it may not be completely eliminated. Even with perfect design and best effort in all stages of experiment including sampletobatch assignment, it is impossible to define or control all potential batch effects. Many statistical methods have been developed to estimate and reduce the impact of batch effect at the data analysis stage (i.e., after the experiment part is done) [^{1},^{9}^{}^{12}]. It would be helpful that analytic methods handling batch effects are employed in all stages of a genomics study, from experiment design to data analysis.
Experimental design has been applied in many areas, with methods being tailored to the needs of various fields. A collection of R packages for experimental design is available at http://cran.rproject.org/web/views/ExperimentalDesign.html. Many of these existing experiment design software work for ideal situation (i.e., before sample collection) where the sample size is fixed and/or model is specified. For example, the software in above link includes optimal design (e.g. AlgDesign, requiring model specification), orthogonal arrays for main effects experiments (e.g., function oa.design, constrained by sample size/number of factors), factorial 2level designs (e.g., Package FrF2, particularly important in industrial experimentation), and etc. We developed OSAT to facilitate the allocation of collected samples into different batches in genomics studies. Our software implements the general experiment design methodology to achieve the optimal sampletobatch assignment in order to minimize the impact of batch effects. It is specifically used in the profiling stage of a genomics study when the available experimental samples ready to be profiled in the genomics instruments are collected. It provides predefined batch layout for some of the most commonly used genomics platforms. Written in a modularized style in the open source R environment, it provides the flexibility for users to define the batch layout of their own experiment platform, as well as optimization objective function for their specific needs, in sampletobatch assignment in order to minimize the impact of batch effects. To our best knowledge, there is no other tool for this important utility within the framework of Bioconductor.
The current version of OSAT provides two algorithms for creation of sample assignment across the batches based on the principle of block randomization, which is an effective approach in controlling variability from nuisance variables such as batches and its interaction with variables of our primary interest [^{6}^{}^{8},^{13}]. Both algorithms are composed of a block randomization step and an optimization step. The default algorithm (implemented in function optimal.shuffle) sought to first block all variables considered to generate a single initial assignment setup, then identify the optimal one which minimizes the objective functions (i.e., the one with most homogeneous crossbatch strata distribution) through shuffling the initial setup. The alternative algorithm (implemented in function optimal.blcok) sought to first block specified variables (e.g., list of variables of primary interests) to generate a pool of assignment setups, then select the optimal one which minimize the objective functions based on all variables considered (including those variables which are not included in the block randomization step). A detailed description is provided as below.
By combining the variables of interest, we can create a unified variable with its levels based on all possible combinations of the levels of the variables involved. Assuming there are a total of s levels in the unified variable (referred as optimization strata in this package) with S_{j} samples in each stratum, j = 1 … s, and assuming we have m batches with B_{i}, i = 1… m wells available in each batch. In an ideal balanced RCBD experiment, we have equal sample size in each strata: S_{1}= …= S_{s}= S, and each batch includes the same number of available wells, B_{1}= … = B_{m}= B, with equal number of samples from each sample strata.
The expected number of sample from each stratum to each batch is denoted as E_{ij}. One can split it to its integer part and fractal part as
Eij=Bij∑iBi=Eij+δij 
where ⌊E_{ij}⌋ is the integer part of the expected number and δ_{ij} is the fractal part. In the case of equalbatch size, it reduces to Eij=Sjm . When we have RCBD, all δ_{ij} are zero.
For an actual sample assignment
B1⋮BmS1…Ssn11…n1s⋮⋱⋮nm1…nms 
where n_{ij} is the number of sample in each optimization strata from an actual sample assignment. Our goal is, through a block randomization step and an optimization step, to minimize the difference between expected sample size E_{ij} and the actual sample size n_{ij}.
The block randomization step is to create initial setup(s) of randomized sample assignment based on strata combining the blocking variables considered. The blocking variables include all variables of interests in the default algorithm, but only a specified subset of variables in the alternative algorithm.
In this step, we sample i sets of samples from each strata S_{j} with size ⌊E_{ij}⌋, as well as j sets of wells from each B_{j} batches with size of ⌊E_{ij}⌋. The two selections are linked together by the ij subgroup, randomized in each of them. The rest of samples r_{j} = S_{j} − ∑ _{i}⌊E_{ij}⌋ can be assigned to the available wells in each Block w_{i} = B_{i} − ∑ _{j}⌊E_{ij}⌋. The probability of a sample in r_{j} from strata S_{j} being assigned to a well from block B_{i} is proportional to the fractal part of the expected sample size δ_{ij}. For a RCBD, each batch will have equal number of samples with same characteristic and there is no need for further optimization. However, for other instances where the collection of samples is unbalanced and/or incomplete, an optimization step is needed to create a more optimal setup of sample assignment.
The optimization step aims to identify an optimal setup of sample assignments from multiple candidates. To select optimal sample assignment, we need to measure the variation of sample characteristics between batches. In this package, we define the optimal design as a sample assignment setup that minimizes our objective function based on principle of least square method [^{13}]. The objective function can be defined as
V=∑ijnij−Eij2 
where E_{ij} and n_{ij} were defined previously.
In the default algorithm implemented in OSAT, optimization is conducted through shuffling the initial setup obtained in the block randomization step. Specifically, after initial setup is created, we randomly select k samples from different batches and shuffle them between batches to create a new sample assignment. Value of the objective function is calculated for the new setup and compared to that of the original one. If the new value is smaller, the new assignment will replace the previous one. This procedure will continue until we reach a preset number of attempts (5000 by default).
In the alternative algorithm, multiple (typically thousands of or more) sample assignment setups are first generated by procedure described in the block randomization step above, based only on the list of specified blocking variable(s). The optimal one will be chosen by selecting the setup (from the pool generated in the block randomization step) which minimizes the value of the objective function based on all variables considered. This algorithm will guarantee the identification of a setup that is conformed to the blocking requirement for the list of specified blocking variables, while attempting to minimize the betweenbatches variations of the other variables considered.
We provide a brief overview of the OSAT usage as below. A more detailed description of package functionality can be found in the package vignette and manual.
To begin, sample variables to be considered in the sampletobatch assignment will be encapsulated in an object using function
sample<− setup.sample (x, optimal, …)where in data frame x each sample is represented by a row and category variables including our primary interest and other variables are listed as columns. The parameter optimal indicates the vector of variables to be considered.
Next, the number of plates to be used in the genomic experiment, the layout design of these plates, and the level of batch effect to be considered are captured in a container object using constructor functionContainer <− setup.container(plate, n, batch, …)where parameter plate is an object representing the layout (number and type of chip used, rows and columns of wells, the ordering of them, and etc.) of the plate used in the experiment. Layouts of some commonly used plates and chips are predefined in our package (e.g., the IlluminaBeadChip Plate). The user can define their own layout using the classes and methods provided in OSAT. Optional parameter batch has default value “plates”, indicate batch effect will be considered at the plate level. User can use batch="chips" to consider batch effect at chip level.
Third, sampletobatch assignment can be created through function
create.optimized.setup(fun="optimal.shuffle",sample, container, …)
The default algorithm is implemented in function optimal.shuffle, while the alternative algorithm is implemented in function optimal.blcok. Users can also define objective function following the instruction in the package vignette.
Last, bar plot of sample counts by batches for all variables considered is provided for visual inspection of the sample assignment. Chisquare tests are also to examine the dependence of sample variables on batches. The final sampletobatch assignment can be output to CSV.
Project name: OSATProject home page: http://bioconductor.org/packages/2.11/bioc/html/OSAT.htmlOperating system(s): Windows, Unixlike (Linux, Mac OSX)
Programming language: R >= 2.15
License: Artistic2.0Any restrictions to use by nonacademics: None
The authors declare that they have no competing interests.
LY, CM and SL conceived and designed the study. LY developed the software. LY CM and SL drafted the manuscript. QH, DW, MQ, JMC, LES, CAB, CSJ and JW all contributed to the study design. All authors read and approved the final manuscript.
Table S1. Example data. Table S2. Data distribution. Figure S1. Number of samples per plate. Paired specimens are placed on the same chip. Sample assignment use optimal.block method.
We wish to thank the anonymous reviewers for their valuable comments and suggestions, which were helpful in improving the paper. The work was supported in part by the National Institute of Health grant R01HL102278 to LES, R01CA133264 to CBA, R01CA095045 to CSJ, and R21CA162218 to SL.
References
Lambert CG,Black LJ,Learning from our GWAS mistakes: from experimental design to scientific methodBiostat(Oxford, England)Year: 2012132195  
Baggerly KA,Coombes KR,Neeley ES,Run batch effects potentially compromise the usefulness of genomic signatures for ovarian cancerJ Clin OncolYear: 2008267118610.1200/JCO.2007.15.195118309960  
Scherer A,Batch effects and noise in microarray experiments: sources and solutionsYear: 2009New York: John Wiley and Sons  
Leek JT,Scharpf RB,Bravo HC,Simcha D,Langmead B,Johnson WE,Geman D,Baggerly K,Irizarry RA,Tackling the widespread and critical impact of batch effects in highthroughput dataNat Rev GenetYear: 2010111073310.1038/nrg282520838408  
Mak HC,Storey J,The importance of new statistical methods for highthroughput sequencingNat BiotechnolYear: 201129433110.1038/nbt.183121478851  
Murray L,Randomized Complete Block DesignsYear: 2005New York: John Wiley & Sons Ltd  
Montgomery DC,Design and Analysis of ExperimentsYear: 20087thJohn Wiley & Sons: Wiley  
Fang K,Ma C,Uniform and Orthogonal DesignsYear: 2001Beijing: Science Press  
Huang H,Lu X,Liu Y,Haaland P,Marron JS,R/DWD: DistanceWeighted Discrimination for classification, visualization and batch adjustmentBioinformaticsYear: 2012288118210.1093/bioinformatics/bts09622368246  
Marsit CJ,Koestler DC,Christensen BC,Karagas MR,Houseman EA,Kelsey KT,DNA methylation array analysis identifies profiles of bloodderived DNA methylation associated with bladder cancerJ clin oncol: official journal of the American Society of Clinical OncologyYear: 2011299113310.1200/JCO.2010.31.3577  
Li C,Rabinovic A,Adjusting batch effects in microarray expression data using empirical BayesmethodsBiostatisticsYear: 2007811810.1093/biostatistics/kxj03716632515  
Ma C,Fang K,Liski E,A new approach in constructing orthogonal and nearly orthogonal arraysMetrikaYear: 200050325510.1007/s001840050049  
Chen C,Grennan K,Badner J,Zhang D,Gershon E,Jin L,Liu C,Removing batch effects inanalysis of expression microarray data: an evaluation of six batch adjustment methodsPLoS OneYear: 201162e1723810.1371/journal.pone.001723821386892 
Figures
Tables
Comparison of sample assignment by two algorithms implemented in OSAT and an undesired sample assignment through complete randomization


Default algorithm

Alternative algorithm

An undesired setup through complete randomization



(optimal.shuffle)

(optimal.block)


Variable  DF  Chisquare  P value  Chisquare  P value  Chisquare  P value 
SampleType

5

0.2034518

0.9990763

0.03507789

0.9999879

13.25243

0.021124664

Race

5

0.2380335

0.9986490

3.68541503

0.5955359

14.22455

0.014244218

Age_grp  20  0.8138166  1.0000000  5.08147313  0.9996856  39.75020  0.005371387 
Article Categories:

Previous Document: Effect of core design and veneering technique on damage and reliability of YTZPsupported crowns.
Next Document: Immunolocalization of alphakeratins and feather betaproteins in feather cells and comparison with ...