selscan: an efficient multithreaded program to perform EHHbased scans for positive selection.  
Jump to Full Text  
MedLine Citation:

PMID: 25015648 Owner: NLM Status: Publisher 
Abstract/OtherAbstract:

Haplotypebased scans to detect natural selection are useful to identify recent or ongoing positive selection in genomes. As both real and simulated genomic datasets grow larger, spanning thousands of samples and millions of markers, there is a need for a fast and efficient implementation of these scans for general use. Here we present selscan, an efficient multithreaded application that implements Extended Haplotype Homozygosity (EHH), Integrated Haplotype Score (iHS), and Crosspopulation Extended Haplotype Homozygosity (XPEHH). selscan accepts phased genotypes in multiple formats, including TPED, and performs extremely well on both simulated and real data and over an order of magnitude faster than existing available implementations. It calculates iHS on chromosome 22 (22, 147 loci) across 204 CEU haplotypes in 353s on one thread (33s on 16 threads) and calculates XPEHH for the same data relative to 210 YRI haplotypes in 578s on one thread (52s on 16 threads). Source code and binaries (Windows, OSX and Linux) are available at https://github.com/szpiech/selscan. 
Authors:

Zachary A Szpiech; Ryan D Hernandez 
Related Documents
:

22951808  Design and analysis of multiple diseases genomewide association studies without controls. 24081228  Estimating the range of obesity treatment response variability in humans: methods and i... 23984318  Transmission model of hepatitis b virus with the migration effect. 23583868  Environmental drivers and spatial dependency in wildfire ignition patterns of northwest... 21730748  Plasticitymodulated seizure dynamics for seizure termination in realistic neuronal mod... 9094928  Biochemical markers as additional measurements in dietary validity studies: application... 
Publication Detail:

Type: JOURNAL ARTICLE Date: 2014710 
Journal Detail:

Title: Molecular biology and evolution Volume:  ISSN: 15371719 ISO Abbreviation: Mol. Biol. Evol. Publication Date: 2014 Jul 
Date Detail:

Created Date: 2014712 Completed Date:  Revised Date:  
Medline Journal Info:

Nlm Unique ID: 8501455 Medline TA: Mol Biol Evol Country:  
Other Details:

Languages: ENG Pagination:  Citation Subset:  
Copyright Information:

© The Author(s) 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. 
Export Citation:

APA/MLA Format Download EndNote Download BibTex 
MeSH Terms  
Descriptor/Qualifier:

Full Text  
Journal Information Journal ID (nlmta): Mol Biol Evol Journal ID (isoabbrev): Mol. Biol. Evol Journal ID (publisherid): molbev Journal ID (hwp): molbiolevol ISSN: 07374038 ISSN: 15371719 Publisher: Oxford University Press 
Article Information Download PDF © The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. creativecommons: Print publication date: Month: 10 Year: 2014 Electronic publication date: Day: 10 Month: 7 Year: 2014 pmcrelease publication date: Day: 10 Month: 7 Year: 2014 Volume: 31 Issue: 10 First Page: 2824 Last Page: 2827 PubMed Id: 25015648 ID: 4166924 DOI: 10.1093/molbev/msu211 Publisher Id: msu211 
selscan: An Efficient Multithreaded Program to Perform EHHBased Scans for Positive Selection  
Zachary A. Szpiech*^{1}  
Ryan D. Hernandez^{1}^{2}^{3}  
^{1}Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco 

^{2}Institute for Human Genetics, University of California, San Francisco 

^{3}Institute for Quantitative Biosciences (QB3), University of California, San Francisco 

Correspondence: *Corresponding author: Email: zachary.szpiech@ucsf.edu. Associate editor: Sohini Ramachandran 
Extended Haplotype Homozygosity (EHH) (^{Sabeti et al. 2002}), Integrated Haplotype Score (iHS) (^{Voight et al. 2006}), and Crosspopulation Extended Haplotype Homozygosity (XPEHH) (^{Sabeti et al. 2007}) are statistics designed to use phased genotypes to identify putative regions of recent or ongoing positive selection in genomes. They are all based on the model of a hard selective sweep, where a de novo adaptive mutation arises on a haplotype that quickly sweeps toward fixation, reducing diversity around the locus. If selection is strong enough, this occurs faster than recombination or mutation can act to break up the haplotype, and thus a signal of high haplotype homozygosity can be observed extending from an adaptive locus.
As genetics data sets grow larger both in number of individuals and number of loci, there is a need for a fast and efficient publicly available implementation of these statistics. Below, we introduce these statistics and provide concise definitions for their calculations. We then evaluate the performance of our implementation, selscan.
In a sample of n chromosomes, let C denote the set of all possible distinct haplotypes at a locus of interest (named x_{0}), and let C(xi) denote the set of all possible distinct haplotypes extending from the locus x_{0} to the ith marker either upstream or downstream from x_{0}. For example, if the locus of interest x_{0} is a biallelic single nucleotide polymorphism (SNP) where 0 represents the ancestral allele and 1 represents the derived allele, then C:={0,1}. If x_{1} is an immediately adjacent marker, then the set of all possible haplotypes is C(x1):={11,10,00,01}.
EHH of the entire sample, extending from the locus x_{0} out to marker x_{i}, is calculated as
where n_{h} is the number of observed haplotypes of type h∈C(xi).In some cases, we may want to calculate the haplotype homozygosity of a subsample of chromosomes all carrying a “core” haplotype at locus x_{0}. Let Hc(xi) be a partition of C(xi) containing all distinct haplotypes carrying the core haplotype, c∈C, at x_{0} and extending to marker x_{i}. Note that
(2)
C(xi)=⋃c∈CHc(xi). 
Following the example above, if the derived allele (1) is chosen as the core haplotype, then H1(x1):={11,10}. Similarly, if the ancestral allele is the core haplotype, then H0(x1):={00,01}.
We calculate the EHH of the chromosomes carrying the core haplotype c to marker x_{i} as
where n_{h} is the number of observed haplotypes of type h∈Hc(xi) and n_{c} is the number of observed haplotypes carrying the core haplotype (c∈C).iHS is calculated by using equation (3) to track the decay of haplotype homozygosity for both the ancestral and derived haplotypes extending from a query site. To calculate iHS at a site, we first calculate the integrated haplotype homozygosity (iHH) for the ancestral (0) and derived (1) haplotypes (C:={0,1}) via trapezoidal quadrature.
(4)
iHHc=∑i=1D12(EHHc(xi−1)+EHHc(xi))g(xi−1,xi)+∑i=1U12(EHHc(xi−1)+EHHc(xi))g(xi−1,xi), 
(5)
ln(iHH1iHH0). 
Note that this definition differs slightly from that in ^{Voight et al. (2006)}, where unstandardized iHS is defined with iHH_{1} and iHH_{0} swapped.
Finally, the unstandardized scores are normalized in frequency bins across the entire genome.
(6)
iHS=ln(iHH1iHH0)−Ep[ln(iHH1iHH0)]SDp[ln(iHH1iHH0)], 
In practice, the summations in equation (4) are truncated once EHHc(xi)<0.05. Additionally with low density SNP data, if the physical distance b (in kbp) between two markers is >20, then g(xi−1,xi) is scaled by a factor of 20/b in order to reduce possible spurious signals induced by lengthy gaps. During computation if the start/end of a chromosome arm is reached before EHHc(xi)<0.05 or if a gap of b > 200 is encountered, the iHS calculation is aborted for that locus. iHS is not reported at core sites with minor allele frequency (MAF) < 0.05. In selscan, the EHH truncation value, gap scaling factor, and core site MAF cutoff value are all flexible parameters definable on the command line.
To calculate XPEHH between populations A and B at a marker x_{0}, we first calculate iHH for each population separately, integrating the EHH of the entire sample in the population (eq. 1).
(7)
iHH=∑i=1D12(EHH(xi−1)+EHH(xi))g(xi−1,xi)+∑i=1U12(EHH(xi−1)+EHH(xi))g(xi−1,xi) 
If iHH_{A} and iHH_{B} are the iHHs for populations A and B, then the (unstandardized) XPEHH is
(8)
ln(iHHAiHHB), 
(9)
XPEHH=ln(iHHAiHHB)−E[ln(iHHAiHHB)]SD[ln(iHHAiHHB)]. 
In practice, the sums in each of iHH_{A} and iHH_{B} (eq. 7) are truncated at x_{i}—the marker at which the EHH of the haplotypes pooled across populations is EHH(xi)<0.05. Scaling of g(xi−1,xi) and handling of gaps is done as for iHS, and these parameters are definable on the selscan command line.
Here, we evaluate the performance of selscan (https://github.com/szpiech/selscan, last accessed July 16, 2014) for computing the iHS and XPEHH statistics. In addition, we compare performance on these statistics with the programs rehh (^{Gautier and Vitalis 2012}, http://cran.rproject.org/package=rehh, last accessed July 16, 2014), ihs (^{Voight et al. 2006}), and xpehh (^{Pickrell et al. 2009}). Both ihs and xpehh are available for download at http://hgdp.uchicago.edu/Software/ (last accessed July 16, 2014). All computations were run on a MacPro running OSX 10.8.5 with two 2.4 GHz 6–core Intel Xeon processors with hyperthreading enabled.
For runtime evaluation of iHS calculations, we simulated a 4 Mbp region of DNA with the program ms (^{Hudson 2002}) and generated four independent data sets with varying numbers of sampled haplotypes (θ=1600 and ρ=1600). We sampled 250 haplotypes (9,625 SNP loci), 500 haplotypes (10,646 SNP loci), 1,000 haplotypes (11,655 SNP loci), and 2,000 haplotypes (12,724 SNP loci). We name these data sets IHS250, IHS500, IHS1000, and IHS2000, respectively. These data sets represent a densely typed region similar to nextgeneration sequencing data. Although these data sets are generated via strictly neutral processes, they serve the purpose of runtime evaulation perfectly well. We also use data from The 1000 Genomes Project (^{1000 Genomes Project Consortium 2012}) Omni genotypes, calculating iHS scores at 22,147 SNP loci on chromosome 22 across 102 CEU individuals (204 haplotypes). We name this data set CEU22.
Table 1 summarizes the runtimes of ihs, rehh, and selscan. We note that rehh integrates haplotype homozygosity over a physical map, whereas ihs and selscan integrate over a genetic map by default. This does not affect runtimes (data not shown), which are measured using genetic maps for ihs and selscan. Even operating on a single thread, selscan calculates iHS scores at least an order of magnitude faster than ihs and up to 1.8 × faster than rehh for large data sets.
We compare unstandardized iHS scores for the CEU22 data set using ihs and selscan and find excellent agreement (fig. 1A, Pearson’s r = 0.9946). The slight variance in scores between the two programs is likely due to an undocumented difference in the way ihs calculates its scores (supplementary material of ^{Sabeti et al. 2007}), but the effect is negligible. We also calculate unstandardized iHS scores for the CEU22 data set using rehh and selscan (using a physical map) and again find excellent agreement (Pearson’s r = 0.9953).
For runtime evaluation of XPEHH calculations, we simulated a 4Mbp region of DNA with the program ms (^{Hudson 2002}) with a simple two population divergence model (time to divergence t = 0.05, θ=1600, and ρ=1600) and generated four independent data sets with varying numbers of sampled haplotypes. We sampled 250 haplotypes (125 from each population, 12,920 SNP loci), 500 haplotypes (250 from each population, 14,989 SNP loci), 1,000 haplotypes (500 from each population, 17,142 SNP loci), and 2,000 haplotypes (1,000 from each population, 19,567 SNP loci). We name these data sets XP250, XP500, XP1000, and XP2000, respectively. These data sets represent a densely typed region similar to nextgeneration sequencing data. Although these data sets are generated via strictly neutral processes, they serve the purpose of runtime evaulation perfectly well. We also use data from The 1000 Genomes Project (^{1000 Genomes Project Consortium 2012}) Omni genotypes, calculating XPEHH scores at 22, 147 SNP loci on chromosome 22 across 102 CEU individuals (204 haplotypes) and 105 YRI individuals (210 haplotypes). We name this data set CEUYRI22.
Table 2 summarizes the runtimes of xpehh and selscan. Even operating on a single thread, selscan tends to calculate XPEHH scores at least an order of magnitude faster than xpehh. Figure 1B shows the correlation (Pearson’s r = 0.9999) of CEUYRI22 unstandardized XPEHH scores between the two programs.
selscan achieves a speed up of at least an order or magnitude over both ihs and xpehh and a speed up of nearly 2x over rehh for large data sets through general optimizations of the calculations. We also implement shared memory parallelism with multithreading to further speed up calculations on computers with multiple cores. Because iHS and XPEHH attempt to calculate a score for each site in the data and each score can be calculated indpendently of the others, selscan partitions the workload (sites at which to calculate a score) across threads, while maintaining each thread’s access to the entire data set required to make the calculation.
Additional empirical testing (data not shown) suggests that rehh, ihs, and selscan (for both iHS and XPEHH calculations) are O(ND2), and xpehh is O(N2D2), where N is the number of haploid samples and D is the SNP locus density.
Each of these statistics require phased haplotypes and a genetic or physical map as input data (TPED format) and missing genotypes must either be dropped or imputed. Because of the speed improvements we have implented, we expect that selscan will be a valuable tool for calculating EHHbased genomewide scans for positive selection in very large genetic data sets, including wholegenome sequencing and genomewide association study data, currently being generated for humans and other organisms. selscan will also allow for indepth examination of the performance of these statistics under a wide range of parameters in largescale simulation studies.
The authors would like to thank Trevor Pemberton and Paul Verdu for assistance in testing the Windows binaries. This work was partially supported by NIH (grants P60MD006902, UL1RR024131, 1R21HG007233, 1R21CA178706, 1R01HL11700401, and 1R01HG007644) and a Sloan Foundation Research Fellowship (to R.D.H.).
References
1000 Genomes Project ConsortiumAn integrated map of genetic variation from 1,092 human genomesNatureYear: 20124917422566523128226  
Gautier M,Vitalis R. rehh: an R package to detect footprints of selection in genomewide SNP data from haplotype structureBioinformaticsYear: 20122881176117722402612  
Hudson RR. Generating samples under a wrightfisher neutral model of genetic variationBioinformaticsYear: 200218233733811847089  
Pickrell JK,Coop G,Novembre J,Kudaravalli S,Li JZ,Absher D,Srinivasan BS,Barsh GS,Myers RM,Feldman MW,et al. Signals of recent positive selection in a worldwide sample of human populationsGenome Res.Year: 200919582683719307593  
Sabeti PC,Reich DE,Higgins JM,Levine HZP,Richter DJ,Schaffner SF,Gabriel SB,Platko JV,Patterson NJ,McDonald GJ,et al. Detecting recent positive selection in the human genome from haplotype structureNatureYear: 2002419690983283712397357  
Sabeti PC,Varilly P,Fry B,Lohmueller J,Hostetter E,Cotsapas C,Xie X,Byrne EH,McCarroll SA,Gaudet R,et al. Genomewide detection and characterization of positive selection in human populationsNatureYear: 2007449716491391817943131  
Voight BF,Kudaravalli S,Wen X,Pritchard JK. A map of recent positive selection in the human genomePLoS Biol.Year: 200643e7216494531 
Figures
Tables
Runtime Performance (in seconds) of ihs, rehh, and selscan for Calculating Unstandardized iHS for Various Data Sets.
Data Set  ihs  rehh^{a} 
selscan



Threads =1  2  4  8  16  
IHS250  19,275  563  618  306  162  84  58 
IHS500  45,547  1,652  1,554  782  399  220  150 
IHS1000  >100,000  4,834  4,018  2,019  1,040  566  380 
IHS2000  >100,000  12,652  7,054  3,633  1,869  1,046  752 
CEU22  19,434  588  353  182  93  50  33 
msu211TF1NOTE.—Calculations running over 100,000 s were aborted.
msu211TF2^{a}rehh integrates over a physical map instead of a genetic map. Using a physical map does not affect selscan’s runtime (data not shown).
Runtime Performance (in seconds) of xpehh and selscan for Calculating Unstandardized XPEHH for Various Data Sets.
Data Set  xpehh 
selscan



Threads =1  2  4  8  16  
XP250  11,113  287  141  71  38  25 
XP500  57,006  766  403  194  104  67 
XP1000  >100,000  2,037  1,018  515  274  180 
XP2000  >100,000  5,683  2,798  1,471  763  493 
CEUYRI22  37,271  578  291  150  78  52 
msu211TF3NOTE.—Calculations running over 100,000 s were aborted.
Article Categories:

Previous Document: The Loss of Adipokine Genes in the Chicken Genome and Implications for Insulin Metabolism.
Next Document: Multicenter, Phase III Trial Comparing Selenium Supplementation With Observation in Gynecologic Radi...