A new familybased association test via a leastsquares method.  
Jump to Full Text  
MedLine Citation:

PMID: 16451567 Owner: NLM Status: MEDLINE 
Abstract/OtherAbstract:

To test the association between a dichotomous phenotype and genetic marker based on family data, we propose a leastsquares method using the vector of phenotypes and their cross products within each family. This new approach allows covariate adjustment and is numerically much simpler to implement compared to likelihood based methods. The new approach is asymptotically equivalent to the generalized estimating equation approach with a diagonal working covariance matrix, thus avoiding some difficulties with the working covariance matrix reported previously in the literature. When applied to the data from Collaborative Study on the Genetics of Alcoholism, this new method shows a significant association between the marker rs1037475 and alcoholism. 
Authors:

Song Yang; Jungnam Joo; Ziding Feng; JingPing Lin 
Related Documents
:

8855477  A global odds ratio regression model for bivariate ordered categorical data from ophtha... 11933037  Functional data analysis with application to periodically stimulated foetal heart rate ... 16489417  Assessing the total effect of timevarying predictors in prevention research. 16751277  The group covariance effect and fitness tradeoffs during evolutionary transitions in i... 15462447  Online monitoring by dynamically refining imprecise models. 24678297  Qspike tools: a generic framework for parallel batch preprocessing of extracellular neu... 
Publication Detail:

Type: Journal Article Date: 20051230 
Journal Detail:

Title: BMC genetics Volume: 6 Suppl 1 ISSN: 14712156 ISO Abbreviation: BMC Genet. Publication Date: 2005 
Date Detail:

Created Date: 20100119 Completed Date: 20100304 Revised Date: 20100920 
Medline Journal Info:

Nlm Unique ID: 100966978 Medline TA: BMC Genet Country: England 
Other Details:

Languages: eng Pagination: S110 Citation Subset: IM 
Affiliation:

Office of Biostatistics Research, National Heart, Lung, and Blood Institute, Bethesda, Maryland 20892, USA. yangso@nhlbi.nih.gov 
Export Citation:

APA/MLA Format Download EndNote Download BibTex 
MeSH Terms  
Descriptor/Qualifier:

Family* GenomeWide Association Study / methods* Humans LeastSquares Analysis 
Comments/Corrections 
Full Text  
Journal Information Journal ID (nlmta): BMC Genet ISSN: 14712156 Publisher: BioMed Central, London 
Article Information Download PDF Copyright ? 2005 Yang et al; licensee BioMed Central Ltd openaccess: This is an open access article distributed under the terms of the Creative Commons Attribution License (), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. collection publication date: Year: 2005 Electronic publication date: Day: 30 Month: 12 Year: 2005 Volume: 6 Issue: Suppl 1 First Page: S110 Last Page: S110 ID: 1866828 Publisher Id: 147121566S1S110 PubMed Id: 16451567 DOI: 10.1186/147121566S1S110 
A new familybased association test via a leastsquares method  
Song Yang1  Email: yangso@nhlbi.nih.gov 
Jungnam Joo1  Email: jooj@nhlbi.nih.gov 
Ziding Feng2  Email: zfeng@fhcrc.org 
JingPing Lin1  Email: linj@nhlbi.nih.gov 
1Office of Biostatistics Research, National Heart, Lung, and Blood Institute, Bethesda, Maryland 20892, USA 

2Cancer Prevention and Research Program, Fred Hutchinson Cancer Research Center, Seattle, WA 98109, USA 
Casecontrol studies provide an important tool to test for the association between disease outcomes and genetic markers [^{1},^{2}]. Familybased association studies take advantage of existing data, such as data from a previous linkage analysis [^{3},^{4}].
Incorporation of covariates into the analysis should increase the power to detect associations. However, withinfamily correlations must be considered. For this purpose the generalized estimating equation (GEE) approach [^{5}] is often used. In the case of dichotomous phenotypes, the GEE approach usually specifies that the mean response is related to a set of covariates via a link function. As for the correlation, usually a common correlation is assumed for each pair of relatives in the working correlation matrix, although more accurate correlation structures are possible and may be more efficient. However, a common problem of GEE is that the working correlation matrix may be singular. Recently Slager et al. [^{6}] showed in various simulation studies that the failure rate for the GEE could be quite high in some cases, and should not be ignored. To remedy this problem they proposed a score test approach for tests of association.
In this article we propose a new association test and apply it to the data from Collaborative Study on the Genetics of Alcoholism (COGA). This new test is derived from a leastsquares approach in which the dichotomous responses and their cross products are used, rather than the usual procedure in which the estimating equations only use the observed responses themselves. This approach is asymptotically equivalent to a GEE approach with a diagonal working correlation matrix, and therefore the estimating equation is always well defined.
Let y_{ij }be the dichotomous phenotype from the j^{th }individual of the i^{th }family, where there are k_{i }members from the i^{th }family and n families in the sample. Let x_{ij }be the covariate vector, decomposed as x_{ij }= (x_{ijm}, x_{ije}), where x_{ijm}, x_{ije }represent the marker allele effect and measured covariates respectively. Suppose that, for the i^{th }family, the phenotypes are conditionally independent given a common random effect u_{i}, where the u_{i }values are independent and identically distributed with gamma distribution with mean 1 and variance ? > 0, and that, given u_{i }and x_{ij},
P(y_{ij }= 1 x_{ij}, u_{i}) = exp(u_{i}exp(x_{ij}?)), j = 1, ..., k_{i}, i = 1, ..., n,???(1)
where x_{ij}? = x_{ijm}?_{m }+ x_{ije}?_{e}. From this, we obtain that the mean of y_{ij }is {1 + ? exp(x_{ij}?)}^{1/?}. Numerically, it is more stable to work with the reparametrization ? = log(?). In this reparametrization, the mean of y_{ij }is
P(y_{ij }= 1) = A_{ij }(?, ?), j = 1, ..., k_{i}, i = 1, ..., n,???(2)
where A_{ij }= {1 + exp(? + x_{ij}?)}^{exp(?)}. The joint distribution of Y_{i }= (y_{i1}, ..., ) can also be obtained by integrating out the random effect u_{i }and thus the likelihood function has a closed form. This is an appealing and important feature of the above modelling approach when using the loglog link function and loggamma random effect. In comparison, for dealing with correlated dichotomous responses, a commonly used model specifies that, conditional on a normal random effect, the marginals of the conditional distribution are given by the logistic link. In that situation, the likelihood function does not have a closed form and extensive numerical methods are needed.
Note that Equation (1) imposes the same correlation structure regardless of family relations. More accurate descriptions are possible by assuming different random effects for different family relations, but this increases the number of parameters to estimate. Petersen [^{7}] discussed some random effect models for correlated life times. Similar structures can also be adapted for the dichotomous phenotypes. For ease of presentation and due to space limitation, we work with the simplified but illustrative Equation (1) here.
There have been some results on analysis of Equation (1). Conaway [^{8}] proposed the loglog link and loggamma random effect for correlated binary data. He focused on the case in which there are no covariates, but this model can be easily extended to accommodate covariates. Pulkstenis et al. [^{9}] used the loglog link and loggamma random effect in a case study of longitudinal binary data for pain relievers. Both of these papers focused on the maximum likelihood estimators (MLE) based on the marginal likelihood function.
For families of larger size, the likelihood function becomes increasingly more complicated. Contribution to the likelihood function from the i^{th }cluster has terms. Also MLE may be sensitive to model misspecifications. Here we propose a new leastsquares approach for testing H_{0}:?_{m }= 0. Note that the parameters ? and ? can be identified from the marginal mean response function, thus a natural and simple approach is to use the GEE based on Equation (2). We further observe that, for the cross products y_{ij}y_{il}, j ? l, in the i^{th }family, we have
E(y_{ij}y_{il}) = P(y_{ij }= 1, y_{il }= 1) = B_{ijl }(?, ?),
where
B_{ijl}(?, ?) = {1 + exp(? + x_{ij}?) + exp(? + x_{il}?)}^{exp(?)}.
Considering that ? is involved in the random effect induced correlation among family members, it may be more efficient to work with Y_{i }as well as cross products y_{ij}y_{il}. For the i^{th }family let Z_{i }be the k_{i}(k_{i }+ 1)/2 ? 1 vector consisting of Y_{i }and the k_{i}(k_{i } 1)/2 cross products y_{ij}y_{il}, j ? l, j, l = 1, ..., k_{i}. Let m_{i }= E(Z_{i}), and V_{i }be the diagonal matrix with variance of the components of Z_{i }on the diagonal. Then we define as the minimizer of
For obtaining V_{i}, m_{i}, we have
E(y_{ij}) = P(y_{ij }= 1) = A_{ij}(?, ?),???(4)
with A_{ij}(?, ?), B_{ijl}(?, ?) defined previously. Note that is asymptotically equivalent to the root of the estimating equation
where ?m_{i }is the vector of partial derivatives of m_{i }with respect to (?, ?). In the above estimating equation the working covariance matrix is diagonal, and thus the estimating equation is always well defined. However, numerically it is more stable to use the least squares approach.
Once the estimators are obtained, due to the asymptotic equivalence to the GEE approach, the covariance matrix of can be estimated by the robust estimator
V_{? }= A^{1}BA^{1 }???(8)
with
where (?, ?) are replaced by . A more stable but numerically more intensive alternative for estimating the covariance matrix is to use a bootstrapping method to resample family units a large number of times. Decompose where is the estimator for ?_{m}. Now the hypothesis H_{0}:?_{m }= 0 can be tested using the asymptotic normality of the z score based on .
Note that in the leastsquares approach above, ? can be interpreted as a regression parameter in the mean response function EZ_{i}, which includes the cross product terms. We can similarly define a least squares estimator of (?, ?) by working with Y_{i }and its mean response function EY_{i}, without the addition of the cross product terms. In that case, ? would be interpreted as a regression parameter in the mean response function EY_{i}. In various numerical studies, the addition of the cross product terms improves the efficiency for small and moderate samples sizes.
We applied the proposed approach to the data from COGA. The data provide alcoholism diagnosis on 1,614 individuals from 143 families. We focus on two distinct categories for the alcoholism diagnosis, "affected" as case (609 individuals) and "purely unaffected" as control (261 individuals). The preliminary genome scan carried out for linkage analysis using the microsatellite data identified a gene ADH3 on chromosome 4 as a candidate gene. We found 4 singlenucleotide polymorphisms (SNPs) (rs1036475, rs1491233, rs749407, rs980972), which are located in the physical map location of ADH3 genes from the Illumina SNPs data. Without correcting the correlated structure between family members, a logistic regression on these 4 SNPs suggested that rs1037475 and rs980972 were significant predictors (pvalues of 0.0032 and 0.0284, respectively). Also, a quick look at the 2 ? 2 table stratified by sex showed some differences. This led us to the consideration of using sex as a covariate. Assuming a recessive genetic model, the new leastsquares method showed a significant association between rs1037475 and alcoholism, with a pvalue of 0.013. Further, the analysis showed a significant sex effect with pvalue < 0.001. Without using the cross product terms, the corresponding least squares method also showed a significant association between rs1037475 and alcoholism with a pvalue of 0.002, and a significant sex effect with pvalue < 0.001. The smaller pvalue for the association between rs1037475 and alcoholism might be due to the fact that a common correlation was assumed among all family members for the 870 individuals and 143 families. Violation of this assumption does not affect the mean response function Ey_{ij }but would introduce some bias in the mean response function Ey_{ij}y_{il }for the cross product terms. This in turn might reduce the power of the corresponding association test. When we restricted our analysis to the 499 siblings in 141 families, we still found a significant sex effect, with or without using the cross product terms. However, with the cross product terms, a significant association between rs1037475 and alcoholism was found; and without the cross product terms, no such association was established. In all cases with the reduced dataset, the pvalue was smaller with the cross product terms than without them.
In this article we have proposed a new test of association between dichotomous disease outcomes and genetic markers for family data. When applied to the data from COGA, this new approach indicated an association between SNP marker rs1037475 and alcoholism. This new approach has the flexibility of adjusting for covariates, and sex was a significant covariate in this analysis. The use of complementary loglog link function and the conjugate loggamma random effect, rather than the more common combination of logistic link function and normal random effect, allowed us to obtain closed forms for the means and variances for the responses and their cross products. Using these quantities enables us to derive parametric estimators via the least squares approach that avoids the difficulty in the GEE approach created by singularity of the working correlation matrix. The least squares approach is more robust and computationally much simpler to implement than the likelihood approach.
Simulation studies also yielded evidence that the efficiency of the new approach is high and often its behavior on small samples is better than the more complicated likelihoodbased approach.
COGA: Collaborative Study on the Genetics of Alcoholism
GEE: Generalized estimating equation
MLE: Maximum likelihood estimators
SNP: Singlenucleotide polymorphism
SY was involved in the design of the study and statistical analysis, and drafted the manuscript. JJ and JPL performed the statistical analysis and participated in revising the manuscript. ZF was involved in the design of the study and participated in revising the manuscript.
References
Risch N. Searching for genetic determinants in the new millenniumNature 2000;405:847–856. [pmid: 10866211] [doi: 10.1038/35015718]  
Risch N,Merikangas K. The future of genetic studies of complex human diseasesScience 1996;273:1516–1517. [pmid: 8801636] [doi: 10.1126/science.273.5281.1516]  
Whittaker JC,Morris A. Familybased tests of association and/or linkageAnn Hum Genet 2001;65:407–419. [pmid: 11806850] [doi: 10.1046/j.14691809.2001.6550407.x]  
Witte JS,Gauderman W,Thomas D. Asymptotic bias and efficiency in casecontrol studies of candidate genes and geneenvironment interactions: basic family designsAm J Epidemiol 1999;149:693–705. [pmid: 10206618]  
Liang KY,Zeger S. Longitudinal data analysis using generalized linear modelsBiometrika 1986;73:13–33. [doi: 10.2307/2336267]  
Slager SL,Schaid DJ,Wang L,Thibodeau SN. Candidategene association studies with pedigree data: controlling for environmental covariatesGenet Epidemiol 2003;24:273–283. [pmid: 12687644] [doi: 10.1002/gepi.10228]  
Petersen JH. An additive frailty model for correlated life timesBiometrics 1998;54:646–661. [pmid: 9660631] [doi: 10.2307/3109771]  
Conaway MR. A random effects model for binary dataBiometrics 1990;46:317–328. [doi: 10.2307/2531437]  
Pulkstenis EP,Ten Have TR,Landis JR. Model for the analysis of binary longitudinal pain data subject to informative dropout through remediationJ Am Stat Assoc 1998;93:438–450. [doi: 10.2307/2670091] 
Article Categories:
Conference: Genetic Analysis Workshop 14: Microsatellite and singlenucleotide polymorphism. Noordwijkerhout, The Netherlands. 7?10 September 2004. 
Previous Document: Comparison of marker types and map assumptions using Markov chain Monte Carlobased linkage analysis...
Next Document: Smoothing of the bivariate LOD score for nonnormal quantitative traits.