puma 3.0: improved uncertainty propagation methods for gene and transcript expression analysis.  
Jump to Full Text  
MedLine Citation:

PMID: 23379655 Owner: NLM Status: Publisher 
Abstract/OtherAbstract:

ABSTRACT: BACKGROUND: Microarrays have been a popular tool for gene expression profiling at genomescale for over a decadedue to the low cost, short turnaround time, excellent quantitative accuracy and ease of data generation.The Bioconductor package puma incorporates a suite of analysis methods for determining uncertainties from Affymetrix GeneChip data and propagating these uncertainties to downstream analysis. Asisoform level expression profiling receives more and more interest within genomics in recent years,exon microarray technology offers an important tool to quantify expression level of the majority ofexons and enables the possibility of measuring isoform level expression. However, puma does notinclude methods for the analysis of exon array data. Moreover, the current expression summarisationmethod for Affymetrix 3' GeneChip data suffers from instability for low expression genes. For thedownstream analysis, the method for differential expression detection is computationally intensiveand the original expression clustering method does not consider the variance across the replicatedtechnical and biological measurements. It is therefore necessary to develop improved uncertaintypropagation methods for gene and transcript expression analysis. RESULTS: We extend the previously developed Bioconductor package puma with a new method especially designed for GeneChip Exon arrays and a set of improved downstream approaches. The improvementsinclude: (i) a new gamma model for exon arrays which calculates isoform and gene expression measurements and a level of uncertainty associated with the estimates, using the multimappings betweenprobes, isoforms and genes, (ii) a variant of the existing approach for the probelevel analysis ofAffymetrix 3' GeneChip data to produce more stable gene expression estimates, (iii) an improvedmethod for detecting differential expression which is computationally more efficient than the existing approach in the package and (iv) an improved method for robust modelbased clustering of geneexpression, which takes technical and biological replicate information into consideration. CONCLUSIONS: With the extensions and improvements, the puma package is now applicable to the analysis of bothAffymetrix 3' GeneChips and Exon arrays for gene and isoform expression estimation. It propagatesthe uncertainty of expression measurements into more efficient and comprehensive downstreamanalysis at both gene and isoform level. Downstream methods are also applicable to otherexpression quantification platforms, such as RNASeq, when uncertainty information is availablefrom expression measurements. puma is available through Bioconductor and can be found athttp://www.bioconductor.org.Background. 
Authors:

Xuejun Liu; Zhenzhu Gao; Li Zhang; Magnus Rattray 
Related Documents
:

24034735  Exploring correlations in gene expression microarray data for maximum predictiveminimu... 23184475  A candidate malefertility femalefertility gene tagged by the soybean endogenous trans... 23945075  The influence of gene flow on species tree estimation: a simulation study. 23642045  Functional proteomics analysis to study atm dependent signaling in response to ionizing... 12529645  Periodic notch inhibition by lunatic fringe underlies the chick segmentation clock. 3003115  Nucleotide sequence and in vivo expression of the ilvy and ilvc genes in escherichia co... 
Publication Detail:

Type: JOURNAL ARTICLE Date: 201325 
Journal Detail:

Title: BMC bioinformatics Volume: 14 ISSN: 14712105 ISO Abbreviation: BMC Bioinformatics Publication Date: 2013 Feb 
Date Detail:

Created Date: 201325 Completed Date:  Revised Date:  
Medline Journal Info:

Nlm Unique ID: 100965194 Medline TA: BMC Bioinformatics Country:  
Other Details:

Languages: ENG Pagination: 39 Citation Subset:  
Export Citation:

APA/MLA Format Download EndNote Download BibTex 
MeSH Terms  
Descriptor/Qualifier:

Full Text  
Journal Information Journal ID (nlmta): BMC Bioinformatics Journal ID (isoabbrev): BMC Bioinformatics ISSN: 14712105 Publisher: BioMed Central 
Article Information Download PDF Copyright ©2013 Liu et al.; licensee BioMed Central Ltd. openaccess: Received Day: 13 Month: 9 Year: 2012 Accepted Day: 18 Month: 1 Year: 2013 collection publication date: Year: 2013 Electronic publication date: Day: 5 Month: 2 Year: 2013 Volume: 14First Page: 39 Last Page: 39 PubMed Id: 23379655 ID: 3626802 Publisher Id: 147121051439 DOI: 10.1186/147121051439 
puma 3.0: improved uncertainty propagation methods for gene and transcript expression analysis  
Xuejun Liu1  Email: xuejun.liu@nuaa.edu.cn 
Zhenzhu Gao1  Email: zhenzhu1717@yahoo.cn 
Li Zhang1  Email: leo.zhang@nuaa.edu.cn 
Magnus Rattray2  Email: magnus.rattray@manchester.ac.uk 
1College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, 29 Yudao St., Nanjing, 210016, China 

2Faculty of Life Sciences, University of Manchester, Oxford Road, Manchester M13 9PT, UK 
Microarrays have been applied to highthroughput gene expression profiling for over a decade due to several advantages, e.g. high coverage, low cost, short turnaround time, excellent quantitative accuracy and ease of data generation. It has been shown recently that microarrays still remain an efficient and reliable tool for expression quantification especially for lowabundance targets [^{1}]. We previously developed the Bioconductor package puma[^{2}] for Affymetrix GeneChip data analysis. In the initial probelevel analysis, puma uses the multimgMOS method [^{3}] to obtain an expression estimate for each gene and a level of uncertainty associated with this estimate. In the downstream analysis, puma propagates these uncertainties to principal component analysis, differential expression detection and gene expression clustering using methods NPPCA [^{4}], PPLR [^{5}] and PUMACLUST [^{6}], respectively, and obtains improved analysis results. In addition to expression measurements obtained from microarrays, these downstream methods are also applicable to other expression quantification platforms, e.g. RNASeq based on high throughput sequencing technology, providing a level of uncertainty is associated with each measurement.
As the analysis of alternative splicing gains more and more interest in recent years, exon microarray technology, such as Affymetrix GeneChip Exon arrays, provides an option for measuring isoform level expression. It is therefore necessary for puma to include methods for propagating isoform expression uncertainty in the analysis of exon array data. Furthermore, the current probelevel analysis method, multimgMOS, obtains unstable expression estimates for low expression genes which can adversely affect the downstream analysis results. For the downstream analysis, the PPLR method for differential expression detection is computationally expensive and the PUMACLUST method for expression clustering does not consider the variance across the replicated technical and biological measurements. For all these reasons, we present here a new version of the puma package which incorporates a suite of improved probelevel analysis methods for gene and transcript expression summarisation and uncertainty propagation methods for the downstream analysis. The new version of the package covers the wide range of quantitative expression analysis of microarray at both gene and isoform level with the great benefit from propagating uncertainty associated with expression estimates into various advanced downstream analyses.
Affymetrix microarrays use 25base long probes to measure transcript abundance. Traditional 3’ GeneChips use two types of probes, perfect match (PM) and mismatch (MM) probes. A PM probe matches the target sequence exactly, whereas the corresponding MM probe differs from the PM probe in the middle base which is changed to the complementary one. MM probes are introduced to act as a control for cross hybridisation and other types of background signal. The GeneChip Exon arrays use only PM probes to obtain higher density of coverage and make exon, isoform and gene level profiling possible. Many probelevel analysis methods for 3’ arrays such as PLIER [^{7}] and RMA [^{8}] which do not use MM probe intensities, can be applied to exon arrays directly for exon or gene level expression calculation by using probetoexon or probetogene mappings, respectively. With the estimated exon and gene expression, it is possible to perform alternative splicing detection by measuring exongene expression ratios [^{9}^{}^{11}]. In addition to calculating exon and gene expression ratios, isoform expression levels can also be quantified for a more refined downstream analysis.
The expression calculation at isoform level is nontrivial since one probe can be mapped to multiple transcripts or gene loci [^{12}]. Also, an important characteristic of Affymetrix microarray probes is that they have different sensitivity to transcript abundance according to their sequence content. Many probelevel analysis approaches for 3’ arrays account for these probespecific effects and have obtained improved results [^{3},^{13}]. Moreover, a level of uncertainty associated with estimated isoform expression would help downstream analyses to obtain more biologically relevant results. With available multimappings between probes and Ensembl transcripts, some methods have recently been proposed to address the expression calculation for known isoforms, such as MMBGX [^{14}] and MEAP [^{15}]. MMBGX uses a hierarchical Bayesian model to calculates the expression level of target transcripts and results in a posterior distribution of each isoform expression. MMBGX is solved by MCMC method and is therefore computationally intensive. After background removal, MEAP adopts a nonnegative matrix factorisation approach to summarise isoform expression as a point estimate and does not provide a level of uncertainty associated with this estimate. MMBGX and MEAP perform crosshybridisation correction according to different GC content for probes, removing probespecific effects to a certain extent. However, it has been shown that specific hybridisation also presents probespecific variations [^{8},^{16}]. We developed a new gamma model for exon array data (GME), which accounts for probeeffects in specific hybridisation and multimappings between probes, transcripts and genes. The GME model parameters are estimated by Maximum a Posteriori (MAP) optimisation to give isoform and gene level expression measurements with a level of uncertainty of these estimates, provided by a MAPLaplace approximation [^{17}]. The new method has been implemented as an R function in the new version of the puma package.
For traditional 3’ GeneChips, PM probes are thought to mainly measure specific hybridisation and MM probes measure nonspecific hybridisation and other background. However, probes for low expression genes often obtain higher background than true signal. When combining PM and the corresponding MM probe intensities to calculate gene expression, the resulting gene expression measurements can be unstable for low expression genes, especially on a log scale. For this reason, most popular methods provide an option of using PM probes only in order to obtain more stable expression values on the log scale, such as PLIER [^{7}], dCHIP [^{16}] and RMA [^{8}]. The previous method for 3’ GeneChips in puma, multimgMOS [^{3}], combines both PM and MM probe intensities to calculate gene expression values and provide a level of uncertainty associated with the measurements. For low expression genes the estimated logarithmic expression values are usually negative and the associated variance is typically large. These expression measurements with large error can further affect downstream analyses and may lead to incorrect biological conclusions. This is especially the case when the mean expression estimates are processed by methods outside of the puma package which do not account for measurement uncertainty. To alleviate this problem, we propose PMonly multimgMOS for 3’ arrays, which uses only PM probe intensities and obtains more stable gene expression estimates for low expression genes.
For the downstream analyses of gene expression, the new version of puma includes two newly improved approaches for finding differentially expressed (DE) genes and gene expression clustering. The previous method PPLR for finding DE genes considers the probelevel measurement error, which can improve results when there are few replicates available [^{5},^{18}]. PPLR uses an importance sampling procedure in the variational EM solver which leads to computational inefficiency since the number of samples needs to be increased to gain better accuracy. By adding a layer of hidden variables to the hierarchical Bayesian model, inference in the PPLR model is faster due to the elimination of this inefficient importance sampling step [^{19}]. The PUMACLUST method provided by the previous version of puma propagates probelevel uncertainty to improve results of standard Gaussian mixture clustering of gene expression [^{6}]. The recently proposed PUMACLUSTII [^{20}] approach improves PUMACLUST in several aspects. First, variance across the replicated technical and biological measurements for the same experimental condition is considered. Second, a Student’s tdistribution is adopted as the clustering components to improve the robustness of the method. Finally, the optimal number of components can be automatically found, and this is especially important for the clustering when the ground truth in the data is unknown.
puma includes two levels of analyses for expression data, expression summarisation and downstream analyses. At the summarisation level of analysis, the previous version of puma as described in [^{2}] can only processe 3’ GeneChip data using mainly multimgMOS. With the obtained gene expression measurements and the associated measurement uncertainty from microarrays or other platforms, puma propagates uncertainty into the downstream analyses, including PPLR for finding DE genes, PUMACLUST for gene expression clustering and NPPCA [^{4}] for principal component analysis of gene expression. The diagram of function components for the previous puma is shown in the upper part of Figure 1. After the extension and improvement in this paper, the functions of the new version of puma are illustrated in the lower part of Figure 1. The new version provides the following contributions:
•GME  In addition to traditional 3’ GeneChip data, the new version is capable of processing Exon array data using a new model GME at the summarisation level of analysis. From the Exon array data analysis, both gene and isoform expression can be computed.
•PMonly multimgMOS  PMonly multimgMOS is included to improve the stability of multimgMOS for gene expression estimation.
•IPPLR  At the downstream analyses, the new version of the package contains IPPLR as an improvement to speed up PPLR for detecting differential expression.
•PUMACLUSTII  For expression clustering, PUMACLUSTII is introduced to consider the technical and biological variance across experimental replicates. The new clustering method increases the robustness of clustering and automatically selects the optimal number of clusters by model selection.
With these contributions, methods in puma can process both gene and isoform expression, making puma useful in the analysis of alternative splicing. See Methods for more details on these algorithms.
The increasing availability of mappings of microarray probes to isoforms in the Ensembl database can be used to perform isoform expression estimation. In particular, multimappings between probes and isoforms are helpful in separating the intensity contributions from probes shared by multiple isoforms. Transcript expression estimation may benefit from this intensity separation. The database GATExplorer [^{12}] integrates information from multiple biological sources (including Ensembl database and probe sequences of Affymetrix microarrays) to provide the mappings between microarray probes and the functional transcriptional entities, i.e. gene loci, transcripts, exons and ncRNAs. We include the multimappings between Exon array probes, isoforms and genes obtained from GATExplorer into the separate Bioconductor data package pumadata which contains example and annotation data used by puma. Mappings for human, mouse and rat exon arrays are included and this makes puma applicable to all types of Affymetrix Exon arrays.
The new version of puma and the related pumadata package can be found at http://www.bioconductor.org. The GEM model is implemented in the function gmoExon to calculate gene and isoform level expression for Exon arrays. The PMonly multimgMOS method is implemented in the function PMmmgmos to estimate stable gene expression for Affymetrix GeneChips. The improved PPLR for detecting DE genes is implemented in the function pumaCombImproved. The PUMACLUSTII is implemented in the function pumaclustii for robust expression clustering. To use these functions, type library(puma) and library(pumadata) at R prompt to load puma package and the data package. A quick start of each of these functions is described below. For detailed use of these functions, please refer to the user manual of the puma package.
The expression summarisation method for Exon arrays is GME. The method makes use of multimappings between probes, isoforms and genes obtained from GATExplorer to aid the calculation of gene and isoform expression. The mappings are included in the individual package pumadata. The following code shows a quick start of this method.
The above code loads exon array data (CEL files) in the working directory as an AffyBatch object and processes it using GME method. Among the parameters, exontype can be one of “Human”, “Mouse” and “Rat”, indicating the exon chip type. GT can be one of “gene” and “transcript”, specifying the expression estimated at gene and isoform level, respectively. gsnorm specifies the algorithm used by the global scaling normalisation and can be one of “mean”, “median”, “meanlog” and “none”. “mean” and “meanlog” are meancentered normalisation on raw and the log scale, respectively, “median” is mediancentered normalisation and “none” means no global scaling normalisation. The value of gmoExon is an object of class exprReslt which stores the estimated expression and a level of uncertainty associated with this measurement.
PMonly multimgMOS increases the stability of the original multimgMOS method, especially for weakly expressed genes. We use an example dataset included in the pumadata package to demonstrate the use of this method.
The first parameter of the function PMmmgmos is an AffyBatch object containing the raw probe intensities. The parameter gsnorm has the same meaning as that in the function gmoExon. The value of PMmmgmos is an object of class exprReslt which contains the estimated gene expression and the corresponding estimation uncertainty.
IPPLR is designed to improve the computational efficiency of the original PPLR for finding differential expression. Similar to PPLR, it includes two steps to detect DE genes. At the first step, the function pumaCombImproved is used to combine expression from replicates to give a single measurement for the related condition. At the second step, the existing function pumaDE is used to calculate the PPLR (probability of positive logratio) values to identify DE genes. We use an example dataset in the puma package to demonstrate the use of this method as below.
The parameter of pumaCombImproved is an object of class ExpressionSet and can also be the outputs from GME, PMonly multimgMOS or multimgMOS. The function pumaDE generates lists of genes ranked by the PPLR values which indicate the significance of differential expression.
The existing clustering method PUMACLUST in puma considers uncertainty of gene expression but does not take into account the technical and biological variance when replicates are available. PUMACLUSTII is proposed to address this problem. It also adopts more robust components by using a Student’s t distribution instead of the Gaussian components used by PUMACLUST. We use an example dataset in the puma package to show the use of this method.
The first two parameters of pumaClustii are data frames containing the expression measurements and the associated uncertainty respectively. The minimum and maximum numbers of clusters are specified by the parameters mincls and maxcls, respectively. The parameter conds indicates the number of conditions involved in the data and reps is a vector specifying which condition each column of the input data frame belongs to. The result is a list containing the center of clustering components, the membership of components for each data point, the optional number of clusters and other auxiliary information.
We use the well studied Microarray Quality Control (MAQC) dataset [^{21}] to evaluate most of the extensions of the new version of puma at gene expression level. MAQC project measured gene expression levels from highquality RNA samples to assess the comparability across multiple platforms. We select two RNA samples, the universal human reference RNA (UHRR) and the human brain reference RNA (HBRR), from Affymetrix Exon array and Affymetrix U133 GeneChip platforms. Each sample type has five replicates for both platforms. Experiments of Exon arrays were carried out in two independent labs: McGill University (MU) and Virginia Tech (VT). We randomly selected data from MU for the evaluation of GME. For U133 GeneChips, we use data AFX_1_[AB][15] from GSE5350. Apart from microarray experiments, MAQC project also conducted qRTPCR experiments for around one thousand genes which can be served as a goldstandard to benchmark gene expression values estimated from other platforms [^{22},^{23}].
Among the qRTPCR data, we use the method similar to [^{23}] to filter out DE and nonDE genes with high certainty. Firstly, we select genes which were found to be “present” for at least three qRTPCR replicate assays. Secondly, average gene expression over replicates is calculated for each sample. Genes with absolute logratio between the UHRR and HBRR samples less than 0.2 are taken as “nonDE” genes. Those with logratio greater than 2.0 are “DE+” genes which are upregulated in UHRR sample and those with logratio less than 2.0 are “DE” genes being downregulated in UHRR sample. Finally, we map these nonDE and DE genes to Exon array and U133 GeneChip platforms and obtain the corresponding mapped genes and probesets for each platform as shown in Table 1. Using these qRTPCR validated data, we produce receiver operator characteristic (ROC) curves for various combinations of gene expression estimation methods and DE gene detection methods with the consideration of the direction sign of regulation.
The qRTPCR validated head and neck squamous cell carcinoma (HNSCC) dataset [^{15}] is used to verify the isoform expression calculated by GME. In HNSCC dataset, 15 cell lines from tongue and larynx were cultured and samples were assayed using Affymetrix Human Exon 1.0 ST microarrays. Amplification of the chromosome region 11q13 is a common genomic alteration in HNSCC. The 15 cell lines are divided into two sample groups, with 11q13 amplification (11q13+) and without 11q13 amplification (11q13). 11q13+ group contains seven cell lines and 11q13 group contains eight. qRTPCR experiments were performed for four alternatively spliced variants of two genes (ORAOV1 and NEO1) located in the 11q13 amplified region and associated with HNSCC. We use GME to calculate the expression levels for the four isoforms in all 15 cell lines and then apply PPLR to identify the differential expressed transcripts (DETs). The detected DETs are compared with qRTPCR findings to verify the performance of GME.
To evaluate the accuracy of GME for gene expression estimation from exon array data, we compare GME with the other two traditional methods RMA and PLIER. The functions implemented in Bioconductor package affy for RMA and PLIER methods are used to produce gene expression. We combine the different expression estimation methods with three DE detection methods, ttest, PPLR and IPPLR, to find DE genes on the MAQC dataset. ttest is applied to point estimates of gene expression from the three expression estimation methods. PPLR and IPPLR require a level of uncertainty associated with expression estimates, and they are therefore applied to GME and RMA which are able to provide expression measurement error. In addition to process all five replicates for each sample, we also randomly select two replicates to show the performance of each method with fewer number of replicates available. we repeat five runs for the processing of the 2replicate case. Figure 2 shows the average ROC curves of the comparison for 2replicate case and Figure 3 shows the results for 5replicate case. GME combined with PPLR obtains lower true positive rate (TPR) at the top of ranking list of DE genes. However, by increasing the number of sample in the importance sampling of PPLR, TPR gets obviously improved. The area under ROC curve (AUC) for the different expression estimation methods combined with various DE detection methods are shown in Table 2. We can see from Table 2 that GME outperforms the other alternatives at most cases, especially when combined with ttest and IPPLR. The comparison results show that GME is a competitive approach in gene expression calculation from Exon array data.
We use the qRTPCR validated HNSCC data set to verify the isoform expression calculated by GME. In HNSCC dataset, two ORAOV1 alternative splice variants (ORAOV1201 and ORAOV1202) and two NEO1 alternative splice variants (NEO1201 and NEO1202) are validated by qRTPCR experiments. We apply GME to this dataset and obtain the expression levels for the four transcripts. For each transcript in every one of the 15 cell lines, GME produces the expression estimate and a level of uncertainty associated with this estimate. Figures 4 and 5 show the distributions of isoform expression in each cell line of ORAOV1 and NEO1, respectively. The blue lines are for 11q13+ samples and the red lines for 11q13 samples. We can see from the figures that there is considerable variability in the transcript expression for the cell lines from each sample group. High expression is generally associated with low variance while low expression with large variance. For the expression distribution of NEO1201 as shown in the upper plot of Figure 5, there is extreme low expression for one cell line from each of the two sample groups. We then apply PPLR to the distributions of isoform expression to obtain the distributions of mean expression for each sample group, which are represented by the bold lines as shown in the figures. Note that the effects of low expression outliers are reduced by applying PPLR which accounts for technical and biological components of variance.
According to the qRTPCR results, the four transcripts are overexpressed in 11q13+ sample with less significant change for ORAOV1202 (p<0.0837). ORAOV1201 presents higher expression levels than ORAOV1202 in both 11q13+ and 11q13 samples, while NEO1202 is expressed at higher levels than NEO1201 in the two samples. Table 3 shows the directions of the relative expression change found by qRTPCR and GME. The results “+” and “” stand for up and downregulation in the first comparison component, respectively. For GME, the result of “+” indicates PPLR>0.5 and the result of “” indicates PPLR<0.5. We also show the probability of differential expression as calculated by max(PPLR,1−PPLR), with numbers close to 1.0 indicating strong support. It can been seen from Table 3 that the relative expression changes found by GME combined with PPLR are consistent with qRTPCR results for all comparisons. The results show that GME produces reliable isoform expression estimations for this specific dataset.
IPPLR accelerates the computation of PPLR by eliminating the importance sampling stage of the algorithm which significantly slows down PPLR computation. Table 4 shows the CPU run time of PPLR and IPPLR on 2replicate and 5replicate exon array data. The run time for 2replicate data is the average processing time over the 5 runs. It can be seen from Table 4 that the computation time of PPLR increases with the number of importance samples and IPPLR is therefore much more computationally efficient. The accuracy of detecting DE genes for different methods is shown in Table 2. We can see that with the same expression estimation method, IPPLR obtains the best accuracy for most datasets. PPLR and IPPLR outperform ttest. PPLR was compared with more sophisticated moderated ttests in the original publication [^{5}]. These show the usefulness of measurement error propagated into the downstream analysis. The improvement is especially significant for the 2replicate case demonstrating that probelevel measurement error helps to alleviate the need for experiment replicates. Note that as the number of importance samples increases the accurate of PPLR also gets improved. When the number of importance samples used is 10,000 then the accuracy of PPLR is close to that of IPPLR.
Our previous study [^{3}] shows that the original multimgMOS presents good sensitivity to the concentration change in samples due to the correction of nonspecific hybridisation by MM probe intensities. However, for weakly expressed genes the resulting logarithmic expression estimates are usually associated with large variance and this can cause instability in the downstream analysis. We divide the experimental data of Affymetrix U133 GeneChips into three groups, with “low”, “medium” and “high” expression respectively, to show this effect. Figure 6 shows the partition of the dataset with gene expression calculated from multimgMOS. Genes under line l_{1} belong to “low” expression group. Genes between line l_{1} and l_{2} belong to “median” expression group. Genes above line l_{2} belong to “high” expression group. The group of all genes is denoted as “all”. For each gene group, we plot ROC curves individually with the calculation from different expression methods combined with PPLR, as shown in Figure 7. The corresponding AUC values are shown in Table 5. We compare three expression estimation methods, PMonly multimgMOS, multimgMOS and the popular RMA approach. We can see that PMonly multimgMOS and multimgMOS outperform RMA for all gene groups. PMonly multimgMOS obtains better results than multimgMOS for “medium”, “low” and “all” groups, but fails in “high” group compared with multimgMOS. This shows PMonly multimgMOS performs better for relatively low expression genes while multimgMOS works well for high expression genes.
We randomly select two probesets, 220818_s_at and 203073_at, out of probesets whose PPLR values are significantly different between multimgMOS and PMonly multimgMOS. Probeset 220818_s_at is related to a low expression DE gene and 203073_at related to a high expression nonDE gene. The distributions of the expression difference between two conditions for the two probesets are shown in Figure 8. For the DE probeset in the left plot, the two methods obtain similar mean values of the expression difference, but obviously different measurement error. The variance of the expression difference calculated from multimgMOS is much larger than PMonly multimgMOS and this results in lower PPLR value, 0.747, compared with 1.000 from PMonly multimgMOS (PPLR values close to 0 or 1 indicate significant DE). Thus, this probeset is correctly classified as significant DE according to PMonly multimgMOS’s result while misclassified as nonDE according to multimgMOS’s computation. This shows that PMonly multimgMOS increases the stability of multimgMOS for gene expression calculation for lower expression. For the nonDE probeset on the right plot of Figure 8, multimgMOS correctly classifies this probeset with PPLR value 0.467 while PMonly multimgMOS misclassifies it with PPLR value 0.997 showing that PMonly multimgMOS can be less accurate in the high end.
PUMACLUSTII is a robust Student’s t mixture model and takes into accounts expression measurement error, and technical and biological variance. Our work in [^{20}] has already demonstrated that PUMACLUSTII obtained more accurate partitions compared with other alternatives on synthetic data. Furthermore, the method was shown to obtain numbers of clusters similar to the number of underlying groups in realistic simulated data. Applications of PUMACLUSTII on yeast metabolic cycle and cell cycle datasets have already shown that the method led to more biologically relevant clusters in terms of both GO category and TFgene interaction.
We have presented the extended and improved functions of the new version of the puma package and demonstrated the usefulness of these new functions on the well studied MAQC dataset and the qRTPCR validated HNSCC dataset. With these extensions and improvements, puma is able to provide accurate expression estimates for both Affymetrix 3’ GeneChips and Exon arrays. In addition to gene expression measurements, the new puma can also provide reliable estimation of isoform expression from Exon array data. For 3’ GeneChip data, the stability of expression measurements for low expression genes was improved. Furthermore, a level of uncertainty associated with these expression estimates can also be obtained and this measurement error can be propagated into our downstream analysis approaches to obtain improved results. With the consideration of expression measurement error in the downstream analyses, methods can be computationally demanding. The new puma package significantly improves the computational efficiency of the previous method for finding DE genes and obtains even better accuracy. As the final contribution, the new puma provides a robust clustering method which considers the withinchip measurement error and acrosschip technical and biological variance.
There are two main advantages of the new puma package. One is that the package processes Affymetrix 3’ GeneChips and Exon arrays to obtain accurate gene and isoform expression estimates with a level of uncertainty associated with these measurements. The other is that the package offers various downstream analysis approaches which make use of measurement error of expression to produce improved results at both gene and isoform level. Note that the data used for these downstream analyses is not limited to expression measurements from microarrays. The data can be expression measurement obtained from any other platform so long as a reasonable level of uncertainty can be associated with each measurement. For example, RNASeq is increasingly applied for transcript quantification [^{24}]. Some methods proposed to analyse RNASeq data are able to provide both expression estimates and measurement uncertainty [^{25},^{26}]. The transcript expression estimates and the related measurement error output by these methods can be used directly by the downstream analysis methods of puma. For all these reasons, puma is very useful to a large number of researchers who are interested in gene and transcript expression analysis.
Let y_{gjc} represent the jth PM probe intensity for the gth gene under the cth condition. Allowing any number of isoform contributions to y_{gjc}, we assume y_{gjc}=Σ_{k∈M(gj)}s_{gjkc}, where M(gj) is the set containing indices of isoforms mapping to probe j of gene g, and s_{gjkc} is the intensity contribution from the kth mapping isoform. Similar to the assumption of the multimgMOS method for 3’ array, we assume s_{gjkc} follow a gamma distribution, s_{gjkc}∼Ga(α_{gkc},β_{gj}), where β_{gj} is a probespecific latent variable which models the probe effects and is shared across the isoforms and experimental conditions of the same gene. As the summation of independent gammadistributed variables, y_{gjc} also follows a gamma distribution, y_{gjc}∼Ga(Σ_{k∈M(gj)}α_{gkc},β_{gj}). With a gamma prior for the latent variable β_{gj}, i.e. β_{gj}∼Ga(c_{g},d_{g}), the likelihood of probe intensities for a specific gene is
(1)
L{ygjc}{αgkc},cg,dg=∏jc∫p(ygjcΣk∈M(gj)αgkc,βgj)p(βgjcg,dg)dβgj. 
The integral in equation (1) can be computed analytically. The Maximum a Posteriori (MAP) solution of the model can thus be found by efficient numerical optimisation. With the estimated parameters {α^gkc}, ĉg and d^g, the distribution of the expression for each isoform is
(2)
p(sgjkc)=∫p(sgjkcα^gkc,βgj)p(βgjĉg,d^g)dβgj. 
We assume the expression of gene g is the sum of signal from its isoforms, i.e. Σ_{k}s_{gjkc}. Hence, the distribution of gene expression is also a gamma, Σ_{k}s_{gjkc}∼Ga(Σ_{k}α_{gkc},β_{gj}). Similarly, the posterior distribution of the gene expression can be expressed as
(3)
p(Σksgjkc)=∫p(ΣksgjkcΣkα^gkc,βgj)p(βgjĉg,d^g)dβgj. 
The posterior distributions of the logged gene/isoform expression can be estimated from equation (2) and (3), respectively. The expectation of the logged expression level is then computed and approximated by a Gaussian. The Gaussian approximation to the posterior distribution is useful for propagating the probelevel measurement error in subsequent downstream analyses of both gene and isoform expression.
Affymetrix 3’ GeneChips group probes into probesets. Most genes are covered by one probeset and gene expression level can be presented by the expression estimated from the grouped probe intensities. To improve the stability of gene expression measurements for the original multimgMOS [^{3}], we ignore the MM probe signal and assume PM probes measure specific hybridisation in a probespecific way. The intensities of PM probes within a probeset are assumed to follow a gamma distribution. Let y_{ijc} represent the jth PM intensity for the ith probeset under the cth condition. The model is defined by
(4)
yijc∼Ga(αic,bij) 
(5)
bij∼Ga(ci,di), 
where b_{ij} is a latent variable which models probespecific effects for the same type of chip.
The MAP solution of this model can be easily found by efficient numerical optimisation. With the estimated parameters α^ic, ĉi and d^i, the posterior distribution of PM intensities is
(6)
P(yijc)=∫P(yijcα^ic,bij)P(bijĉi,d^i)dbij. 
We use a Gaussian with a mean μ^ic and a variance σ^ic to approximate the posterior distribution of the expectation of log(y_{ijc}). The mean of the Gaussian is taken as the estimated gene expression and the variance shows the measurement error associated with this estimate.
In order to overcome the computation limitation of the original PPLR model, we propose an improved PPLR model (IPPLR) to detect DE genes. Similar to PPLR, IPPLR also considers both expression estimates and measurement uncertainty to obtain high accuracy in finding DE genes. We add a hidden variable x_{ij} to the original PPLR model, representing the true gene expression. We assume that the variable is Gaussian distributed xij∼N(μj,λ−1), where μ_{j} is the mean logged expression level under condition j and λ is the inverse of the betweenreplicate variance and is shared across different conditions. The measured expression level x^ij can be expressed as,
(7)
x^ij∼N(xij,sij2), 
where sij2 is the probelevel measurement error, which can be obtained from multimgMOS or PMonly multimgMOS.
We make a prior assumption that μ_{j} and λ^{−1} are independent and put a Gaussian prior on μ_{j},
(8)
μj∼N(μ0,η0−1), 
where μ_{0} and η_{0} are hyperparameters, on which we adopt noninformative hyperpriors. We assume a conjugate gamma prior on λ,
(9)
λ∼Ga(α,β). 
We use the EM algorithm combined with a variational method to work out the model. In the Estep of PPLR, the variational distribution of λ is obtained by importance sampling which slows down the computation of the method. In contrast, the computation in the Estep of IPPLR is analytical due to the introduction of the latent variable x_{ij}. IPPLR is therefore more computationally efficient than PPLR.
Once the posterior distribution of μ_{j} is obtained, the probability of positive logratio (PPLR) between a treatment μ_{t} and a control μ_{c} can be calculated by
(10)
PPLR=∫0+∞d(μt−μc)P(μt−μcD,ϕ^), 
where D is the observed dataset and ϕ^ is the set of ML estimates of hyperparameters. The examined transcript is upregulated in the treatment when PPLR>0.5 while downregulated when PPLR<0.5.
For the cases where technical or biological replicates are available, we propose a robust Student’s tmixture model to deal with the technical and biological variability. Suppose the expression estimate for gene n under condition j is x_{nji}, and the corresponding true expression and the known probelevel measurement error are t_{nji} and s_{nji} respectively, where i=1,…,R_{j} and R_{j} is the number of replicates under condition j. The expression estimate x_{nji} is assumed to be generated from the following Gaussian distribution,
(11)
xnji∼N(tnji,snji). 
The true gene expression t_{nji}’s for the replicates under the same condition is also assumed to be drawn from a Gaussian distribution,
(12)
tnji∼Nwnj,1ηn, 
with the mean expression w_{nj} for condition j and the precision η_{n}. By introducing a latent variable u_{n} for each gene, the tdistribution can be written as a convolution of a Gaussian with a Gamma placed on its precisions,
(13)
Stwnμk,Σk,νk=∫0∞Nwnμk,ΣkunGaunνk2,νk2du, 
where μ_{k} and Σ_{k} denote the mean and covariance matrix, respectively, and ν_{k} is degrees of freedom, for component k. The mean expression vector w_{n} is modelled as a robust mixture of Student’s tdistributions.
(14)
p(wn)=∑k=1KΠkSt(wnμk,Σk,νk). 
We share η_{n} across all conditions for each gene and assume that it captures the biological genespecific variability. The precision η_{n} is assumed to come from a Gamma distribution
(15)
ηnznk=1∼Ga(αk,βk). 
Inference can be carried out using the variational EM algorithm. Specifying the maximum and minimum numbers of components, the algorithm automatically converged to the optimal number of mixture components by employing the minimum message length (MML) principle [^{27}] for model selection.
Project name: puma Software
Project home page:http://www.bioinf.manchester.ac.uk/resources/puma
Operating systems: Platform independent
Programming language: R, C
Other requirements: R
Any restrictions to use: it is available for free download except that puma uses C scripts of donlp [^{28}].
The authors declare that they have no competing interests.
XL developed PUMACLUSTII, partially supervised the development of the extensions of the puma package and wrote the manuscript. ZG developed GME and PMonly multimgMOS methods. LZ developed IPPLR method. MR initiated the puma project and partially supervised the development of the puma package. All authors read and approved the final manuscript.
XL acknowledges support from NSFC (61170152) and Qing Lan Project. LZ acknowledges support by “the Fundamental Research Funds for the Central Universities” (CXZZ11_0217). MR was supported by BBSRC award BB/H018123/2.
References
Łabaj PP,Leparc GG,E LB,Markillie LM,S WH,P KD,Characterization and improvement of RNASeq precision in quantitative transcript expression profilingBioinformaticsYear: 20111413i383i39110.1093/bioinformatics/btr24721685096  
Pearson RD,Liu X,Sanguinetti G,Milo M,D LN,Rattray M,puma: a bioconductor package for propagating uncertainty in microarray analysisBMC BioinformaticsYear: 20091421110.1186/147121051021119589155  
Liu X,Milo M,Lawrence ND,Rattray M,A tractable probabilistic model for Affymetrix probelevel analysis across multiple chipsBioinformaticsYear: 2005143637364410.1093/bioinformatics/bti58316020470  
Sanguinetti G,MIlo M,Rattray M,Lawrence ND,Accounting for probelevel noise in principal component analysis of mmicroarray dataBioinformaticeYear: 2005143748375410.1093/bioinformatics/bti617  
Liu X,Milo M,Lawrence ND,Rattray M,Probelevel measurement error improves accuracy in detecting differential gene expressionBioinformaticsYear: 2006142107211310.1093/bioinformatics/btl36116820429  
Liu X,Lin KK,Andersen B,Rattray M,Including probelevel uncertainty in modelbased gene expression clusteringBMC BioinformaticsYear: 2007149817376221  
AffymetrixGuide to Probe Logarithmic Intensity ErrorYear: 2008 [Technical note].  
Irizarry RA,Hobbs B,Collin F,BeazerBarclay YD,Antonellis KJ,Exploreation, normalization, and summaries of high density oligonucleotide array probe level dataBiostatisticsYear: 20031424926410.1093/biostatistics/4.2.24912925520  
AffymetrixAlternative Transcript Analysis Methods for Exon ArraysYear: 2005 (11 October 2005, date last revised) [Http://media.affymetrix.com/support/technical/whitepapers/exon_alt_transcript_analysis_whitepaper.pdf].  
Purdom E,Simpson KM,Robinson MD,Conboy JG,Lapuk AV,Speed TP,FIRMA: a method for detection of alternative splicing from exon array dataBioinformaticsYear: 2008141707171410.1093/bioinformatics/btn28418573797  
Xing Y,Stoilov P,Kapur K,Han A,Jiang H,Shen S,Black DL,Wong WH,MADS: a new and improved method for analysis of differential alternative splicing by exontiling microarraysRNAYear: 2008141470147910.1261/rna.107020818566192  
Risueño A,Fontanillo C,E DM,J DLR,GATExplorer: genomic and transcriptomic explorer; mapping expression probe to gene loci, transcripts, exons and ncRNAsBMC BioinformaticsYear: 20101422110.1186/147121051122120429936  
Wu Z,Irizarry RA,Gentleman R,MartinezMurillo F,Spencer F,A modelbased background adjustment for oligonucleotide expression arraysJ Am Stat AssocYear: 20041490991710.1198/016214504000000683  
Turro E,Lewin A,Rose A,Dallman MJ,Richardson S,MMBGX: a method for estimating expression at the isoform level and detecting differential splicing using wholetranscript Affymetrix arraysNucleic Acids ResYear: 201014e410.1093/nar/gkp85319854940  
Chen P,Lepikhova T,Hu Y,Monni O,Hautamiemi S,Comprehensive exon array data processing method for quantitative analysis of alternative spliced variantsNucleic Acids ResYear: 201114e12310.1093/nar/gkr51321745820  
Li C,Wong W,Modelbased analysis of oligonucleotide arrays: Expression index computation and outlier detectionProc Natl Acad Sci USAYear: 200114313610.1073/pnas.98.1.3111134512  
Bishop CM,Pattern Recognition and Machine LearningYear: 2006New York: Springer  
Pearson RD,A comprehensive reanalysis of the Golden Spike data: Towards a benchmark for differential expression methodsBMC BioinformaticeYear: 20081416410.1186/147121059164  
Zhang L,Liu X,An improved probabilistic model for finding differential gene expressionProceedings of the 2nd International Conference on BioMedical Engineering and Informatics, BMEI 2009Year: 2009Tianjin, China  
Liu X,Rattray M,Including probelevel measurement error in robust mixture clustering of replicated microarray gene expressionStat Appl Genet Mol BiolYear: 20101442  
Consortium M,The MicroArray Quality Control (MAQC) project shows inter and intraplatform reproducibility of gene expression measurementsNat BiotechnolYear: 2006141151116110.1038/nbt1239  
Canales RD,Luo Y,Willey JC,Austermiller B,Barbacioru CC,Boysen C,Hunkapiller K,Jensen RV,Knight CR,Y LK,Ma Y,Maqsodi B,Papallo A,Peters EH,Poulter K,L RP,Samaha RR,Shi L,Yang W,Zhang L,M GF,Evaluation of DNA microarray results with quantitative gene expression platformsNat BiotechnolYear: 2006141115112210.1038/nbt123616964225  
Bullard JH,Purdom E,Hansen KD,Dudoit S,Evaluation of statistical methods for normalization and differential expression in mRNASeq experimentsBMC BioinformaticsYear: 2010149410.1186/14712105119420167110  
Nagalakshmi U,Wang Z,Waem K,Shou C,Raha D,Gerstein M,Snyder M,The transcriptional lanscape of the yeast genome defined by RNA sequencingScienceYear: 2008141344134910.1126/science.115844118451266  
Katz Y,Wang ET,Airoldi EM,Burge CB,Analysis and design of RNA sequencing experiments for identifying isoform regulationNat MethodsYear: 2010141009101510.1038/nmeth.152821057496  
Glaus P,Honkela A,Rattray M,Identifying differentially expressed transcripts from RNAseq data with biological variationBioinformaticsYear: 2012141721172810.1093/bioinformatics/bts26022563066  
Figueiredo MAT,Jain AK,Unsupervised learning of finite mixture modelsIEEE Trans Pattern Anal Mach IntellYear: 200214381396  
Spellucci PDB,An SQP method for general nonlinear programs using only equality constrained subproblemsMath ProgramYear: 199814413 
Figures
Tables
Number of qRTPCR validated nonDE and DE genes and probesets for Exon arrays and H133 GeneChips

nonDE

DE



DE+  DE  
Exon arrays

87

116

102

U133 GeneChips  204  185  267 
NonDE and DE genes obtained from qRTPCR data with high certainty are mapped to Exon arrays and Affymetrix U133 GeneChips. Exon arrays obtain 305 corresponding genes and U133 GeneChips contain 656 related probesets. The symbols “+” and “” stand for up and downregulation in UHRR, respectively.
Area under ROC curves from different methods for Exon array data
Methods

2 replicates

5 replicates



1  2  3  4  5  Average  
ttest

RMA

0.8945

0.8909

0.9107

0.9346

0.9316

0.9118

0.9475


PLIER

0.8806

0.8852

0.9004

0.9084

0.9083

0.8937

0.9291


GME

0.9082

0.9044

0.9415

0.9544

0.9427

0.9287

0.9580

PPLR_1000

RMA

0.9243

0.9234

0.9385

0.9417

0.9387

0.9323

0.9489


GME

0.9208

0.9093

0.9365

0.9297

0.8969

0.9188

0.9447

*PPLR_10000

RMA

0.9227

0.9226

0.9419

0.9453

0.9432

0.9348

0.9492


GME

0.9353

0.9317

0.9474

0.9374

0.9324

0.9274

0.9503

IPPLR

RMA

0.9246

0.9301

0.9464

0.9468

0.9463

0.9382

0.9493

GME  0.9379  0.9391  0.9457  0.9597  0.9549  0.9475  0.9589 
Gene expression estimation methods are combined with different findingDEgene methods. PPLR and IPPLR require a level of uncertainty associated with expression estimation, and they are therefore combined with GME and RMA since these two methods can provide variance of gene expression measurements. For ttest we use only the point estimates of gene expression. PLIER provides only a point estimate for gene expression and we only evaluate it combining with ttest. The number after PPLR indicates the sample number used in the importance sampling of the algorithm. The best result for each comparison is highlighted in bold.
GME results for the qRTPCR validated transcripts
Comparisons  qRTPCR  GME  m a x ( P P L R ,1− P P L R )  Consistency  

11q13+ vs. 11q13

ORAOV1201

+

+

1.0000

Y


ORAOV1202

+

+

0.9968

Y


NEO1201

+

+

0.8961

Y


NEO1202

+

+

0.9719

Y

ORAOV1201 vs. 202

11q13+

+

+

0.9154

Y


11q13

+

+

0.5782

Y

NEO1201 vs. 202

11q13+





0.9999

Y

11q13      1.0000  Y 
The expression changes between groups 11q13+ and 11q13 for each transcript, and between two transcripts of the same gene for each group, are examined. The results of qRTPCR and GME are “+” or “” for up and downregulation in the first comparison component, respectively. Column of max(PPLR,1PPLR) gives the probability of differential expression. The concordances between qRTPCR validation and GME results are given in the rightmost column.
Run time of PPLR and IPPLR
Datasets  PPLR_1000  PPLR_10000  IPPLR 

2 replicates

73.1

1330.8

27.5

5 replicates  125.4  3127.4  15.9 
The run time (CPU seconds) for 2replicate dataset is the average processing time over the 5 runs. The number after PPLR indicates the sample number used in the importance sampling of the algorithm. The program runs on the machine with Intel Pentium Dualcore 2.6GHz CPU and 8.0G RAM.
Area under ROC curves from different methods for U133 GeneChip data
Groups

# of probesets

PMonly multimgMOS

multimgMOS

RMA



nonDE  DE+  DE  
High

65

21

14

0.9842

0.9952

0.9308

Medium

90

73

82

0.9062

0.8880

0.8827

Low

49

91

171

0.8363

0.8180

0.8147

All  204  185  267  0.8227  0.8130  0.7971 
Genes are divided into three groups, labelled as “high”, “medium” and “low”, according to the expression levels. The numbers of “nonDE”, “DE+” and “DE” probesets are shown. AUC is calculated individually for each of the three groups from PMonly multimgMOS, multimgMOS and RMA combined with PPLR. The overall AUC is also shown in the bottom of the table. The winner is highlighted in bold for each group.
Article Categories:

Previous Document: 13yearold tuberous sclerosis patient with renal cell carcinoma associated with multiple renal angi...
Next Document: Hypoxiainduced Arterial Differentiation Requires Adrenomedullin and Notch Signaling.