Document Detail

Determination of genetic structure of germplasm collections: are traditional hierarchical clustering methods appropriate for molecular marker data?
Jump to Full Text
MedLine Citation:
PMID:  21472410     Owner:  NLM     Status:  MEDLINE    
Despite the availability of newer approaches, traditional hierarchical clustering remains very popular in genetic diversity studies in plants. However, little is known about its suitability for molecular marker data. We studied the performance of traditional hierarchical clustering techniques using real and simulated molecular marker data. Our study also compared the performance of traditional hierarchical clustering with model-based clustering (STRUCTURE). We showed that the cophenetic correlation coefficient is directly related to subgroup differentiation and can thus be used as an indicator of the presence of genetically distinct subgroups in germplasm collections. Whereas UPGMA performed well in preserving distances between accessions, Ward excelled in recovering groups. Our results also showed a close similarity between clusters obtained by Ward and by STRUCTURE. Traditional cluster analysis can provide an easy and effective way of determining structure in germplasm collections using molecular marker data, and, the output can be used for sampling core collections or for association studies.
T L Odong; J van Heerwaarden; J Jansen; T J L van Hintum; F A van Eeuwijk
Publication Detail:
Type:  Journal Article; Research Support, Non-U.S. Gov't     Date:  2011-04-07
Journal Detail:
Title:  TAG. Theoretical and applied genetics. Theoretische und angewandte Genetik     Volume:  123     ISSN:  1432-2242     ISO Abbreviation:  Theor. Appl. Genet.     Publication Date:  2011 Jul 
Date Detail:
Created Date:  2011-06-14     Completed Date:  2011-10-04     Revised Date:  2013-06-30    
Medline Journal Info:
Nlm Unique ID:  0145600     Medline TA:  Theor Appl Genet     Country:  Germany    
Other Details:
Languages:  eng     Pagination:  195-205     Citation Subset:  IM    
Wageningen University and Research, Biometris, Wageningen, The Netherlands.
Export Citation:
APA/MLA Format     Download EndNote     Download BibTex
MeSH Terms
Biological Markers
Cluster Analysis*
Cocos / genetics*
Computer Simulation
Gene Library
Genes, Plant*
Genetic Structures
Genetic Variation*
Phaseolus / genetics*
Solanum / genetics*
Reg. No./Substance:
0/Biological Markers

From MEDLINE®/PubMed®, a database of the U.S. National Library of Medicine

Full Text
Journal Information
Journal ID (nlm-ta): Theor Appl Genet
ISSN: 0040-5752
ISSN: 1432-2242
Publisher: Springer-Verlag, Berlin/Heidelberg
Article Information
Download PDF
© The Author(s) 2011
Received Day: 9 Month: 11 Year: 2010
Accepted Day: 18 Month: 3 Year: 2011
Electronic publication date: Day: 7 Month: 4 Year: 2011
pmc-release publication date: Day: 7 Month: 4 Year: 2011
Print publication date: Month: 7 Year: 2011
Volume: 123 Issue: 2
First Page: 195 Last Page: 205
ID: 3114091
PubMed Id: 21472410
Publisher Id: 1576
DOI: 10.1007/s00122-011-1576-x

Determination of genetic structure of germplasm collections: are traditional hierarchical clustering methods appropriate for molecular marker data?
T. L. Odong1 Address:
J. van Heerwaarden2
J. Jansen1
T. J. L. van Hintum1
F. A. van Eeuwijk1
1Wageningen University and Research, Biometris, Wageningen, The Netherlands
2University of California, Davis, CA USA
Communicated by footnote: Communicated by M. Frisch.


Information about the structure of germplasm collections is of great importance for both the conservation and utilization of genetic resources collected in genebanks. Because of the diverse nature of genebank germplasm materials (landraces, selected lines from landraces, elite breeding lines, released varieties, wild and weedy relatives of the cultigen, and genetic stocks from different areas of origin), they provide all relevant allelic diversity necessary for plant improvement. These materials are therefore very suitable for example for association studies (D’hoop et al. 2010). However, the large numbers of accessions accumulated in genebanks reduce the efficiency and effectiveness with which these genetic resources can be exploited. The approach of forming core collections (core sub-sets) was introduced to solve the above problem. Frankel (1984) defined a core collection as a limited set of accessions representing, with minimum repetitiveness, the genetic diversity of a crop species and its wild relatives. Determination of the genetic structure (partitioning) of heterogeneous germplasm collections is an essential component in the sampling of core collections since partitioning of germplasm collections before sampling ensures that both the genetic and the ecological spectra of germplasm collections are fully represented in core collections (Brown 1995; van Hintum et al.2000). In addition, it may be necessary to associate accessions in the core collection with the entire collection; the association can be based on the group structure.

The determination of genetic structures of germplasm collections is also an important aspect of association studies (Wang et al. 2005; Shriner et al. 2007). General agreement exists among researchers that incorporating population structure into statistical models used in association mapping is necessary to avoid false positives (Pritchard et al. 2000b; Flint-Garcia et al. 2003; Zhu et al. 2008). The general model for association mapping can be written as “phenotype = marker + genotype + error”, and test for a marker effect is equivalent to testing for a QTL. Typically, genotype is a random factor whose effects are structured by kinship or population structure. This simple model can be improved by incorporating information on the relationships between the genotypes a.k.a. population structure. The relationship between phenotype and marker can be tested within the different groups (e.g. Remington et al. 2001; Simko et al. 2004) or genetic groups can be used as an extra factor or as a covariate in modeling the relationship (e.g. Thornsberry et al. 2001; Wilson et al. 2004). Yu et al. (2006) went further by introducing a mixed model approach which incorporates both population structure (Q) and kinship (K) in modeling the relationship between phenotype and marker. Another important method for incorporating population structure in association studies involves the use of principal components (Price et al. 2006).

Whether the genetic structure is needed for use in sampling core collections or for association studies, an important challenge still is the choice of a method for determining the genetic structure of germplasm collections. In the past, determination of the genetic structure of germplasm collections has mainly been done using traditional multivariate statistical methods such as cluster analysis, principal component analysis, and multidimensional scaling, usually based on agronomic data (Peeters and Martinelli 1989; Franco et al. 1997, 2005, 2006).

In recent years, many new methods have been developed especially for studying structure in natural populations using molecular markers, e.g. STRUCTURE (Pritchard et al. 2000a), PCA (Patterson et al. 2006) and PCO-MC (Reeves and Richards 2009). These methods can also be used for studying genetic structure in germplasm collections. However, traditional hierarchical clustering is still a very popular method for studying genetic diversity in crop species (see D’hoop et al 2010; Barro-Kondombo et al. 2010; Perumal et al. 2007; Folkertsma et al. 2005). Its popularity stems from the fact that it requires little computer time compared to other methods, it is available in many general statistical packages, it is frequently used in different types of applications and it is easy to understand. Moreover, it does not require genetic assumptions such as Hardy–Weinberg or linkage equilibrium. Hierarchical clustering requires decisions about the distance measure, the clustering algorithm, and the evaluation of dendrograms, amongst others. Most evaluations of the performance of hierarchical clustering methods were based on data sets of limited size (Milligan and Cooper 1985). In addition, most studies carried out to evaluate the performance of hierarchical clustering methods with respect to germplasm collections were on non-molecular marker data (Peeters and Martinelli 1989; Franco et al. 1997, 2005, 2006). We are not aware of any study in which the performance of hierarchical clustering techniques was evaluated specifically using molecular marker data. With the expected reduction in the cost of genotyping, we will be faced with datasets of thousands of accessions genotyped with several molecular markers, therefore, there is strong need to evaluate the performance of the traditional hierarchical clustering techniques using large sets of molecular marker data. The structure of genetic diversity in germplasm collections is totally different compared to natural populations. It is not clear how traditional clustering will perform under different factors affecting genetic diversity like migration and reproductive system of the materials that constitute germplasm collections. As pointed out by Mohammadi (2003), very few studies in plant genetic diversity have critically analyzed the performance of different clustering procedures especially with respect to molecular markers.

Several methods for evaluating the results of hierarchical clustering techniques exist. When performing hierarchical cluster analysis, we are interested in answering some of the following questions: (1) is there agreement between the original distances and the distances between individuals as represented by the dendrogram (2) what can the dendogram tell us about structure in the data set and (3) what is the optimum number of clusters for a given data set? One of the most popular measures of agreement between the original distances and the distances in dendrogram is the cophenetic correlation coefficient (CPCC) (Sokal and Rohlf 1962); another measure is the stress criterion of Kruskal (1964). Only a few measures for the presence of hierarchical structure can be found in the literature. Kaufman and Rousseeuw (1990) proposed the agglomerative coefficient (AC) as a criterion for measuring the amount of hierarchical structure in the data. A large number of methods have been proposed to deal with the optimum-number-of-clusters problem. A classical study is that of Milligan and Cooper (1985) who examined the performance of 30 of such criteria. Since then many criteria for determining the optimal number of clusters were introduced: the silhouette statistic (Rousseeuw 1987), Krzanowski and Lai’s index (Krzanowski and Lai 1988), the gap method (Tibshirani et al. 2001), the Clest method (Dudoit and Fridlyand 2002), the jump method (Sugar and James 2003) and the weighted gap method (Yan and Ye 2007). In general, little attention has been paid to the behavior of the above measures and methods in relation to molecular marker data from germplasm collections. A literature search indicated that so far no study tried to relate the amount of genetic structure in a germplasm collections to the performance of hierarchical cluster analysis techniques. The main objective of our study is to determine a relationship between dendogram evaluation criteria such as CPCC, AC to subgroup differentiation (genetic structure). In addition, we also compared the performance of hierarchical clustering techniques with model-based clustering methods.

In this paper, the merits of hierarchical clustering techniques for application in germplasm collections will be considered. The “Materials and methods” contains a brief description and overview of clustering techniques, the evaluation criteria and the methods used for generating simulated data. The real data set used for illustration in this paper is also described. In the results section, we present results of cluster analysis of both real and simulated data sets. We compare the results of two traditional hierarchical clustering techniques (UPGMA and Ward) with the model-based cluster analysis program STRUCTURE (Pritchard et al. 2000a), and show using simulated data how different evaluation criteria of hierarchical cluster analysis are related to subpopulation differentiation.

Materials and methods
Motivation of the study

This study was motivated by the need to study genetic diversity of several important food crops under the Generation Challenge Programme-GCP ( The Generation Challenge Programme is a broad network of partners from international agricultural research institutes and national agricultural research programs collectively working to improve crop productivity in the developing world, especially environments prone to drought, low soil fertility, pests and diseases. All the real data sets used in this study were generated under GCP subprogram I—Crop Genetic Diversity.

Real data

The real data that will be used to illustrate methods consist of 1,014 accessions of coconut (Cocos nucifera) genotyped with 30 SSR markers. The accessions were collected from different regions of the world: West Africa (32), North America (52), South Asia (62), Latin America (72), Central America and the Caribbean (109), East Africa (124), South East Asia (183) and the Pacific Islands (380). Coconut is a diploid, mainly out-crossing species. Most of the accessions in this collection were indicated as tall; 43 dwarf accessions were present mainly from South East Asia. Dwarf coconuts have a high degree of self-fertilization. Because of its usefulness, coconut has been extensively distributed around the world. For this study, the coconut data were selected because it contained larger numbers of accessions of each of the diverse origins (a typical genebank germplasm collection).

Two additional data sets, on potato (Solanum species) and common bean (Phaseolus vulgaris), are described, analyzed and discussed in the Electronic Supplementary Material, Appendix 2. The potato data (233 accessions; 50 SSR markers) contained several unique accessions which act like outliers. All accessions used in this study are diploid. Unlike coconut and potato, common bean is a predominantly selfing species. The common bean data (603 accessions; 36 SSR markers) consist of accessions of two distinct types, Mesoamerican and Andean.

Simulated data

Marker data were simulated by SimuPOP (Peng and Kimmel 2005), a forward-time population genetic simulation environment. We used a finite island (Wright 1931) and a stepping stone (Kimura 1953) migration models. In each generation, random mating (with 2% selfing) was assumed to produce a diploid genotype for 30 unlinked loci for each individual, which had a certain probability of migrating to another subpopulation. We simulated 1,000 individuals in five subpopulations of varying subpopulation differentiation levels (differentiation between subpopulations was determined by migration rates and number of generations). The migration rates used in this study were 0, 1 and 2 migrants per subpopulation. At each of the 30 loci, the average allele frequency of coconut data was used as the starting allele frequency for the simulation. Within each parameter set, all the loci had the same mutation dynamics, which occurs according to a K-allele model (KAM). Under the KAM model, there are K possible allelic states, and any allele has a constant probability of mutating into any of the other K − 1 allelic states (Crow and Kimura 1970). A mutation rate of 2 × 10−5 with 50 possible allelic states was used in the simulation. The mutation parameters were set to mimic highly polymorphic markers such as SSR markers. However, in this case, the role of mutation is very limited since we used a limited number of generations in the simulation. In addition to using alleles from real data as starting frequencies for simulation, the numbers of generations for the simulations were restricted (from 5 to 200 generations) to mimic the situation of agricultural crops in the genebanks. Full information about the whole set of simulations is given in the Electronic Supplementary Material, Appendix 3.


In this paper, we used genetic distances (D) based on the proportion of shared alleles (PSA) where D = 1 − PSA, and

[Formula ID: Equa]
[\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{PSA}} = \sum\limits_{m = 1}^{M} {\sum\limits_{a = 1}^{{A_{m} }} {\min (f_{1ma},\,f_{2ma} ),} } $$\end{document}]
where in diploids [\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ f_{1ma} $$\end{document}] and [\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ f_{2ma} $$\end{document}] are the frequencies of allele a (a = 1, 2…Am; Am ≤ 4) for molecular marker m (m = 1, 2…M) in individuals 1 and 2, respectively, and [\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ f_{1ma} ,\,f_{2ma} \, = \,0,\,{\tfrac{1}{2}}\;{\text{or}}\, 1 $$\end{document}] . For more information on the proportion of shared alleles as similarity measure, see Bowcock et al. (1994), Chakraborty and Jin (1994) and Chang et al. (2009). The effect of distance measures on the grouping of accessions will be considered in another paper.

Clustering techniques
Hierarchical clustering techniques

From the literature on determination of the structure of plant germplasm collections, the most popular clustering methods are Unweighted Pair Group Method with Arithmetic Mean (UPGMA; (Sokal and Michener 1958)) and Ward’s method (Ward 1963). For the purpose of this study, only these two hierarchical clustering methods (hereafter referred to as UPGMA and Ward) will be discussed; both methods are well described in Kaufman and Rousseeuw (1990) and Johnson and Wichern (2002).

The differences between hierarchical clustering algorithms lie mainly in how the distances between pairs of objects or clusters are defined. In UPGMA, the distance between two clusters is defined as the unweighted mean of the distances between all pairs of accessions, one from each cluster. At each step, the two nearest clusters are joined. Ward employs analysis of variance (ANOVA) approach for calculating the distances between clusters. For each pair of clusters, the sum of squared deviations between each accession and the centre of the new cluster (error sum of squares) is calculated and the pair of clusters that yields the lowest error sum of squares is merged. In other words, at each step, in the clustering process, the effect of the union of every possible pair of clusters is considered, and the two clusters that produce the smallest increase in error sum of squares are joined. It should be noted that both UPGMA and Ward use Lance and William’s recurrence formula (Lance and Williams 1967) to operate directly on any distance matrix.

Model-based clustering techniques

The most popular model-based clustering technique is STRUCTURE (Pritchard et al. 2000a; Falush et al. 2003, 2007). STRUCTURE assumes a model with K populations; K may be unknown. It is assumed that within populations loci are in linkage equilibrium and Hardy–Weinberg equilibrium; STRUCTURE assigns individuals to populations to achieve this.

Evaluation criteria
Cophenetic Correlation Coefficient

The Cophenetic Correlation Coefficient (CPCC) is a product–moment correlation coefficient between cophenetic distances and distance matrix (input distance matrix) obtained from the data. The cophenetic distance between two accessions is defined as the distance at which two accessions are first clustered together in a dendrogram going from the bottom to the top. The CPCC, therefore, measures the relationships between the original pair wise distance between accessions (true distances) and pair wise distances between accessions predicted using the dendogram. Farris (1969) proved algebraically that among the traditional hierarchical clustering algorithms, UPGMA always produces the highest CPCC; earlier this was shown empirically by Sokal and Rohlf (1962).

Agglomerative coefficient

The Agglomerative Coefficient (AC) described by Kaufman and Rousseeuw (1990), is one of the methods proposed for quantifying hierarchical structure. The agglomerative coefficient is defined as

[Formula ID: Equb]
[\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{AC}} = 1 - {\frac{{d_{\text{average}} }}{{d_{\text{final}} }}}, $$\end{document}]
where daverage denotes the average distance at which each object merges with one or more objects for the first time, dfinal is the distance at which all the objects are merged into one cluster. It is clear from the formula that AC is highly affected by the distance (dfinal) at the final merger of the algorithm, i.e. as long as the value of dfinal is high relative to daverage, AC will always be close to one. The use of AC in plant diversity studies is quite limited but it has been used in other fields.

Determining the optimal number of clusters

Milligan and Cooper (1985) evaluated 30 rules for determining the optimal number of clusters. For illustration, one of the best six methods according to Milligan and Cooper (1985), the point biserial correlation, will be compared with the average silhouette coefficient proposed by Rousseeuw (1987). The two criteria were chosen because of their easy interpretation. The Point-Biserial Correlation (PBC) (Milligan 1981) is defined as the correlation between corresponding entries in the original distance matrix and a matrix consisting of zeros and ones indicating whether two objects are in the same cluster or not. This is an easy measure of the resemblance between the distance matrix and the resulting tree.

The Average Silhouette Coefficient (ASC) (Rousseeuw 1987) combines the concepts of cluster cohesion and separation; it relates distances between objects within the same cluster with distances between objects in different clusters. The silhouette coefficient (s) of an object is calculated as: [\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ s = (b - a)/\max (b,a),$$\end{document}] where a is the average distance of an object to all the objects in the same cluster and b is the minimum average distance between an object to objects in any of the other clusters.

The average silhouette coefficient for each cluster is calculated by averaging the silhouette coefficients of all the objects in the cluster. An overall measure of the quality of the clustering is obtained by computing the average silhouette coefficient over of all objects in the data. Other criteria for determining the optimum number of clusters are discussed in the supplementary material (Appendices 1, 2 and 3). In applying the criteria for determining optimum numbers of clusters, each dendrogram was cut into a specified number of clusters K (= 2, 3,…, 10) and values of the criteria for determining the number of clusters were calculated and plotted against K. For both PBC and ASC, the number of clusters (K) at which the plot of K versus the value of the criterion is maximum is considered as the optimum number of cluster for a given data set. It should be noted that all these criteria do not directly test for the presence of one cluster (K = 1).

Data analysis
Real data

After performing cluster analysis using UPGMA and Ward, CPCC and AC were calculated. The results from hierarchical cluster analysis were also compared with the results from STRUCTURE with regard to cluster composition and appropriate number of clusters.

STRUCTURE was run under the assumption of an admixture model with independent allele frequency model. No prior information was used. Calculations were carried with the number of subgroups K ranging from 2 to 10 with 3 independent repeats for each K and with 100,000 iterations of which the first 30,000 were used as burn-in.

Simulated data

In this paper the analysis of variance (ANOVA) approach (algorithm described by (Yang 1998)) and implemented in Hierfstat package in R by (Goudet 2005) was used to calculate subgroup differentiation (FST). To explore the relationships between FST and clustering evaluation criteria, datasets from different simulations were pooled together and then grouped based on the strength of subgroup differentiation into groups (each containing 100 datasets) with similar realized values of FST. Hierarchical cluster analysis was performed using Agglomerative Nesting (Agnes) procedure (Kaufman and Rousseeuw 1990) of the package Cluster of R.

The ability of UPGMA and Ward to recover the subpopulations in the simulated data was evaluated using overall cluster purity (Zhao and Karypis 2004). Overall purity was calculated as follows. Let [\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ p_{ij} = {\frac{{m_{ij} }}{{m_{i} }}} $$\end{document}] be the probability that a member of cluster i (i = 1, 2,…, I) belongs in reality to subpopulation j (j = 1, 2,…., J), mij is the number of members of subpopulationj allocated to cluster i and mi is the number of members of cluster i. The purity for each cluster (pi) is defined as the maximum probability of correct assignment of cluster i to one of the subpopulations, i.e. [\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ p_{i} \, = \,\mathop {\max }\limits_{j} \left( {p_{ij} } \right), $$\end{document}] and over all purity is defined as [\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ \sum\nolimits_{i = 1}^{k} {{\frac{{m_{i} }}{m}}\,p_{i} } $$\end{document}] .


Both dendrograms (UPGMA and Ward) resulted into two major clusters (Fig. 1), but clear differences were evident within these clusters. For example, any attempt to produce more than two clusters from each dendogram result into groups of very different structures with UPGMA resulting into highly unbalanced clusters in terms of sizes, (many of the clusters contained one or two accessions) compared to Ward. UPGMA (CPCC = 0.82) preserved the original distance matrix better than Ward (CPCC = 0.74). The two dendrograms had very different values of AC (Ward: 0.97; UPGMA: 0.58).

When applied to the Ward dendogram, both criteria for determining the optimum number of clusters (PBC and ASC) identified two as the optimal number of clusters for the coconut data (Fig. 2a, b). However, when applied to UPGMA dendrogram, PBC was not able to identify an optimum number of clusters, i.e. changing the number clusters from 2 to 10 produced very similar correlations (Fig. 2a). STRUCTURE (method by Evanno et al. 2005) also showed two as the optimum number of clusters (see Electronic Supplementary Material, Appendix 1).

Composition of clusters

The two major groups identified by both UPGMA and Ward contained accessions associated with the Pacific Ocean versus accessions associated with the Atlantic and Indian oceans. These two major groups were also observed when clustering was done using STRUCTURE (K = 2) (see Fig. 3). While further subdivision obtained from Ward’s dendrogram led to formation of clusters/groups which coincided with groups based on passport data (region of origin), this was not possible with UPGMA. In terms of grouping of accessions, the results from STRUCTURE are quite similar to those of Ward. In fact, for the number of groups (K) equal two, three or four, the groups formed by STRUCTURE were almost identical to those produced by cutting Ward’s tree to produce the same number of clusters (Fig. 3). For example, by specifying (K = 3), both STRUCTURE and Ward resulted into the following three groups: (1) accessions associated with the Atlantic and Indian oceans, (2) accessions from Central America (Panama), and (3) other accessions associated with the Pacific ocean. Similarity between groups formed by STRUCTURE and Ward was also observed for the potato data (see Electronic Supplementary Material).

Simulated data

The two migration models (Island and Stepping stone) yielded identical results so only the results of the Island model will be shown. The simulated data sets varied greatly with respect to subpopulation differentiation with realized FST ranging from 0.010 to 0.431. In general, the values of CPCC increased with subgroup differentiation (expressed as FST); UPGMA produced a consistently higher CPCC than Ward (Fig. 4). The difference in CPCC between UPGMA and Ward decreased with increasing subgroup differentiation. AC also increased with subpopulation differentiation for both UPGMA and Ward (Fig. 4). In this case, Ward showed a higher AC than UPGMA; Ward reached the maximum value of one with FST just over 0.1, i.e. the curve flattens off much quickly.

Identification of the optimum number of groups

Cutting of UPGMA trees resulted into highly unbalanced clusters (one or two clusters containing the majority of accessions with several other clusters with 1 or 2 accessions like in real data); only results for Ward is presented. The performance of the criteria for determining optimum number of clusters also depended on the amount of subgroup differentiation (Fig. 5). With relatively weak population differentiations (FST < 0.08), all methods performed quite poorly in identifying the correct number of groups. At low differentiation levels, most criteria for determining optimum number of clusters gave two as the appropriate number of clusters. We also noticed that for a number of data sets with weak subgroup differentiations the values of criteria for determining optimum number of clusters either kept rising or falling, or kept fluctuating to an extent which did not allow determination of an optimum number of clusters. At higher levels of population differentiation (FST > 0.2), the performances became similar.

From Fig. 6, it can be observed that Ward performed well in recovering the subpopulations. Except for relatively weak subpopulation differentiation (FST < 0.05), by cutting the trees into five groups, Ward produced clusters of which over 90% of the members were from one subpopulation. The poor performance of UPGMA methods in recovering the original subpopulations, even with high subgroup differentiation, is because UPGMA produced highly unbalanced clusters.


This paper shows that, if used with care, traditional cluster analysis provides a simple and powerful tool for determining the genetic structure of germplasm collections using molecular marker data. Traditional cluster analysis is available in many standard statistical packages and does not require special purpose software like STRUCTURE. In addition, when clustering individual accessions, the performance of hierarchical clustering techniques depends only on subgroup differentiation, not on the migration models used to simulate the data, provided that discrete subgroups are present.

Based on our results, CPCC can be used as an indicator for the strength of subgroup differentiation. A high CPCC ([\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{CPCC}} \ge 0.8 $$\end{document}] ) with both UPGMA and Ward is an indication of the presence of reliable population structure in the data. Although it has been shown theoretically and empirically that UPGMA always produce dendograms with a higher CPCC than other clustering algorithms (Farris 1969), our simulation results showed that, if distinct groups exist, the difference in CPCC between UPGMA and Ward is expected to be small and this difference gets smaller as subgroup differentiation increases. The differences in CPCC between Ward and UPGMA in real data also appear to reflect the degree of distinction between the groups in the data. For example, the common bean data with two distinct groups (Mesoamerican versus Andean) had a much smaller difference (0.07) in CPCC between Ward and UPGMA compared to potato data (0.17) with many unique accessions. For taxonomic applications (see, Rohlf 1992), it is recommended that CPCC should be very high ([\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ {\text{CPCC}} \ge 0.9 $$\end{document}] ) for a dendrogram to be useful. Our results indicate that when clustering large numbers of accessions the CPCC obtained using Ward is not likely to be greater than 0.85 unless the subpopulations are highly differentiated (FST > 0.25). This is due to the fact that Ward tends to form balanced clusters which may include outlying accessions (Jobson 1992); UPGMA tends to form unbalanced clusters assigning outlying accessions to separate clusters.

The usefulness of AC as a method for quantifying the amount of hierarchical structure in the data appears to be quite limited especially when applied to Ward. For Ward, the distance at which all clusters finally join is often much larger than the distance at which objects are joined in a cluster for the first time. All the three real data sets show very similar AC (0.97, 0.94, and 0.90 for coconut, potato and common beans respectively) with Ward but marked differences observed for UPGMA (0.58, 0.77, and 0.67 for coconut, potato and common beans respectively). Several studies in the literature have also obtained high AC values (≥0.95) with Ward and have used these results to either justify the use of Ward clustering algorithms or to conclude that there is substantial amount of structure in the data (Fan et al, 2004, Cushman et al 2010, Negro et al 2010). Based on our results which showed that Ward can result in a high AC even for a homogenous population, these conclusions can be misleading. We suggest that further modification should be made before AC can be used in conjunction with Ward. It should be noted that AC was initially proposed to describe the strength of the hierarchical structure as obtained by UPGMA (Kaufman and Rousseeuw 1990). The rather low values of AC ([\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$ < 0.75 $$\end{document}] ) obtained from UPGMA dendrograms even for highly differentiated subgroups could be attributed to a chaining effect (tendency of a clustering algorithm to pick out long string-like clusters (see, Johnson and Wichern (2002)) caused by outliers. UPGMA dendrograms with high CPCC but a very low AC value (<0.6) often indicate the presence of many unique accessions or small groups of accessions (together with two or more large groups). The use of CPCC and AC (only with UPGMA) together can roughly tell us the degree of fit, the presence and strength of subgroup differentiation.

The poor performance of criteria for determining the number of clusters may be explained by the presence of weak, and often subgroup differentiation found in many germplasm collections. Accessions in genebanks are no random samples but selections based on factors such as geographical distribution/location, accessibility or even perceived uniqueness. The inability of criteria to determine the optimum number of groups or clusters in a dataset is not limited to hierarchical cluster analysis techniques. Falush et al. (2003, 2007) stated that the method for determining the number of populations in STRUCTURE most often fails in real-world data sets due to various reasons (e.g. isolation by distance or inbreeding). The tendency for these criteria to show two as an optimal number clusters for the real data could be attributed to the presence of dominant groups (Evanno et al 2005; Yan and Ye 2007). In the cases where dominant groups overshadow minor subdivision, sequential detection of structure as described by Yan and Ye (2007) could offer solutions. Based on the poor performance of criteria for determining optimum number of clusters with UPGMA, it is clear that when the cluster sizes are highly unequal, as will often be the case in germplasm collections, applying criteria for determining optimum number of clusters makes little sense. In the case of association studies, one way of getting around the problem of identifying optimal number of clusters could be to use the relatedness based on cophenetic distances (predicted pair wise distances between accessions) directly to correct for population structure just like kinship or other relatedness information is used (K matrix). Studies have shown that correcting for population structure using the K matrix may be sufficient (see Zhao et al. 2007; Stich et al 2008; Astle and Balding 2009). Our analysis show a high correlation between cophenetic distances and dissimilarity between accessions based on the first two axes of principal coordinate analysis (see Elecronic supplementary material, Appendix 2). However, further study is required to assess the usefulness of cophenetic distance in association mapping studies.

Our simulation results showed that Ward was very successful in recovering the original subgroups in the data if they were present and distinctly separated. In addition, because the nature of groups formed by Ward, the dendrograms can be evaluated using standard criteria such as those for determining the number of clusters. However, in the presence of many unique or intermediate accessions the groups formed by Ward will not be homogeneous. In this case, the differences in CPCC between UPGMA and Ward can be quite helpful in deciding which method to select. In situations in which both UPGMA and Ward have high CPCC (≥0.8), Ward will have many advantages over UPGMA. However, in a situation in which only UPGMA has CPCC ≥ 0.8 and there is a big difference (>0.1) in the values of CPCC between UPGMA and Ward, it will be preferable to use the groups formed by UPGMA.

In conclusion, traditional cluster analysis (UPGMA and Ward) provides an easy and effective way for determining structure in germplasm collections. In addition to being simple to apply (using standard statistical software) and simple to interpret, it is possible to determine the presence and strength of subgroup differentiation as well as the presence and influence of unique accessions in the collection. It provides a good alternative for STRUCTURE or PCA in association analyses. It can be combined easily with mixed model facilities that are available in standard statistical packages. Although our simulations were based on random mating, similarity of results between the real data from both out-crossing (coconut and potato) and selfing species (common bean) clearly indicate that traditional cluster analysis can be applied in both mating systems.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Click here for additional data file (122_2011_1576_MOESM1_ESM.doc)

We thank two anonymous reviewers for their critical review of the manuscript. This work was supported by the Generation Challenge Programme under GCP subprogram I—Crop Genetic Diversity. We would like to thank various people who participated in generating the data used in this study especially Carmen de Vicente, Patricia Lebrun-Turquay (PI-coconut), Matthew Blair (PI-Common bean) and Marc Ghislain (PI-Potato).

Open Access This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Astle W,Balding DJ. Population structure and cryptic relatedness in genetic association studiesStat SciYear: 200924445147110.1214/09-STS307
Barro-Kondombo C,Sagnard F,Chantereau J,vom Brocke K,Durand P,Goze′ E,Zong JD. Genetic structure among sorghum landraces as revealed by morphological variation and microsatellite markers in three agroclimatic regions of Burkina FasoTheor Appl GenetYear: 20101201511152310.1007/s00122-010-1272-220180097
Bowcock AM,Ruiz-Linares A,Tomfohrde J,Minch E,Kidd JR,Cavalli-Sforza LL. High-resolution of human evolutionary trees with polymorphic microsatellitesNatureYear: 199436845545710.1038/368455a07510853
D’hoop BB,Paulo MJ,Kowitwanich K,Senger M,Visser RGF,Eck HJ,Eeuwijk FA. Population structure and linkage disequilibrium unravelled in tetraploid potatoTheor Appl GenetYear: 20101211151117010.1007/s00122-010-1379-520563789
Brown AHD. Core collections—a practical approach to genetic-resources managementGenomeYear: 19893181882410.1139/g89-144
Brown AHD. Hodgkin T,Brown AHD,Hintum TJL,Morales EAVThe core collection at the crossroadsCore collections of plant genetic resourcesYear: 1995ChichesterWiley319
Chakraborty R,Jin L. Determination of relatedness between individuals using DNA-fingerprinting (VOL 65, PG 875, 1993)Human BiolYear: 199466363
Chang WH,Chu HP,Jiang YN,Li SH,Wang Y,Chen CH,Chen KJ,Lin CY,Ju YT. Genetic variation and phylogenetics of Lanyu and exotic pig breeds in Taiwan analyzed by nineteen microsatellite markersJ Anim SciYear: 2009871810.2527/jas.2007-056218708610
Cushman SA,McKelvey KS,Noon BR,McGarigal K. Use of abundance of one species as a surrogate for abundance of othersConserv BiolYear: 20102483084010.1111/j.1523-1739.2009.01396.x20067487
Crossa J,Franco J. Statistical methods for classifying genotypesEuphyticaYear: 2004137193710.1023/B:EUPH.0000040500.86428.e8
Crow JF,Kimura M. An introduction to population genetics theoryYear: 1970New YorkHarper and Row
Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3:research0036–research0036.21; doi:10.1186/gb-2002-3-7-research0036
Evanno G,Regnaut S,Goudet J. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation studyMol EcolYear: 2005142611262010.1111/j.1365-294X.2005.02553.x15969739
Falush D,Stephens M,Pritchard JK. Inference of population structure using multilocus genotype data: linked loci and correlated allele frequenciesGeneticsYear: 20031641567158712930761
Falush D,Stephens M,Pritchard JK. Inference of population structure using multilocus genotype data: dominant markers and null allelesMol Ecol NotesYear: 2007757457810.1111/j.1471-8286.2007.01758.x18784791
Fan JB,Yeakley JM,Bibikova M,Chudin E,Wickham E,Chen J,Doucet D,Rigault P,Zhang B,Shen R,McBride C,Li HR,Fu XD,Oliphant A,Barker DL,Chee MS. A versatile assay for high-throughput gene expression profiling on universal array matricesGenome ResYear: 20041487888510.1101/gr.216750415123585
Farris JS. On cophenetic correlation coefficientsSyst ZoolYear: 196918327928510.2307/2412324
Flint-Garcia SA,Thornsberry JM,Buckler ES. Structure of linkage disequilibrium in plantsAnnu Rev Plant BiolYear: 20035435737410.1146/annurev.arplant.54.031902.13490714502995
Folkertsma RT,Rattunde FH,Chandra S,Raju GS,Hash CT. The pattern of genetic diversity of guinea-race Sorghum bicolor (L.) Moench landraces as revealed with SSR markersTheor Appl GenetYear: 200511139940910.1007/s00122-005-1949-015965652
Franco J,Crossa J,Villaseñor J,Taba S,Eberhart SA. Classifying Mexican maize accessions using hierarchical and density search methodsCrop SciYear: 19973797298010.2135/cropsci1997.0011183X003700030045x
Franco J,Crossa J,Villaseñor J,Taba S,Eberhart SA. A sampling strategy for conserving genetic diversity when forming core subsetsCrop SciYear: 2005451035104410.2135/cropsci2004.0292
Franco J,Crossa J,Warburton ML,Taba S,Eberhart SA. Sampling strategies for conserving maize diversity when forming core subsets using genetic markersCrop SciYear: 20064685486410.2135/cropsci2005.07-0201
Frankel OH,et al. Arber WK,et al.Genetic perspectives of germplasm conservationGenetic manipulation: impact on man and societyYear: 1984CambridgeCambridge University Press161170
Goudet J. HIERFSTAT, a package for R to compute and test hierarchical F-statisticsMol EcolYear: 2005518418610.1111/j.1471-8286.2004.00828.x
Gouesnard B,Bataillon TM,Decoux G,Rozale C,Schoen DJ,David JL. MSTRAT: an algorithm for building germ plasm core collections by maximizing allelic or phenotypic richnessJ HeredYear: 200192939410.1093/jhered/92.1.9311336240
Gower JC. Classification problemsBull Int Stat InstYear: 197345471477
Jansen J,Hintum TJL. Genetic distance sampling: a novel sampling method for obtaining core collections using genetic distances with an application to cultivated lettuceTheor Appl GenetYear: 200711442142810.1007/s00122-006-0433-917180377
Jobson JD. Applied multivariate data analysis, vol 2. Categorical and multivariate methodsYear: 1992New YorkSpringer
Johnson AR,Wichern DW. Applied multivariate statistical analysis, 5th ednYear: 2002New JerseyPrentice Hall
Kaufman L,Rousseeuw PJ. Finding groups in data. An introduction to cluster analysisYear: 1990New YorkWiley
Kim KW,Chung HK,Cho GT,Ma KH,Chandrabalan D,Gwag JG,Kim TS,Cho EG,Park YJ. PowerCore: a program applying the advanced M strategy with a heuristic search for establishing core setsBioinformaticsYear: 2007232155216210.1093/bioinformatics/btm31317586551
Kimura M. “Stepping Stone” model of populationAnn Rept Nat Inst Genet JpnYear: 195336263
Kruskal JB. Nonmetric multidimensional-scaling—a numerical methodPsychometrikaYear: 19642911512910.1007/BF02289694
Krzanowski WJ,Lai YT. A criterion for determining the number of groups in a data set using sum-of-squares clusteringBiometricsYear: 198844233410.2307/2531893
Lance GN,Williams WT. A general theory of classificatory sorting strategies I. Hierarchical systemComput JYear: 19679373380
Lee C,Abdool A,Huang CH. PCA-based population structure inference with generic clustering algorithmsBMC BioinformYear: 200910Suppl 1S7310.1186/1471-2105-10-S1-S73
Milligan GW. A Monte Carlo study of thirty internal criterion measures for cluster AnalysisPsychometrikaYear: 19814618719910.1007/BF02293899
Milligan GW,Cooper MC. An examination of procedures for determining the number of clusters in a data setPsychometrikaYear: 19855015917910.1007/BF02294245
Mohammadi SA. Analysis of genetic diversity in crop plants—salient statistical tools and considerationsCrop SciYear: 2003431235124810.2135/cropsci2003.1235
Negro SS,Caudron AK,Dubois M,Delahaut P,Gemmell NJ. Correlation between male social status, testosterone levels, and parasitism in a dimorphic polygynous mammalPLoS ONEYear: 201059e1250710.1371/journal.pone.001250720856933
Patterson N,Price AL,Reich D. Population structure and eigen analysisPlos GenetYear: 20062e19010.1371/journal.pgen.002019017194218
Peeters JP,Martinelli JA. Hierarchical cluster analysis as a tool to manage variation in germplasm collectionsTheor Appl GenetYear: 198978424810.1007/BF00299751
Peng B,Kimmel M. SimuPOP: a forward-time population genetics simulation environmentBioinformaticsYear: 2005213686368710.1093/bioinformatics/bti58416020469
Perumal R,Krishnaramanujam R,Menz MA,Katile S,Dahlberg J,Magill CW,Rooney WL. Genetic diversity among sorghum races and working groups based on AFLPs and SSRsCrop SciYear: 2007471375138310.2135/cropsci2006.08.0532
Price A,Patterson N,Plenge R,Weinblatt M,Shadick N,Reich D. Principal components analysis corrects for stratification in genome-wide association studiesNat GenetYear: 20063890490910.1038/ng184716862161
Pritchard JK,Stephens M,Donnelly P. Inference of population structure using multilocus genotype dataGeneticsYear: 200015594595910835412
Pritchard JK,Stephens M,Rosenberg NA,Donnelly P. Association mapping in structured populationsAm J Hum GenetYear: 20006717018110.1086/30295910827107
Reeves PA,Richards CM. Accurate inference of subtle population structure (and other genetic discontinuities) using principal coordinatesPLoS ONEYear: 20094e426910.1371/journal.pone.000426919172174
Remington DL,Thornsberry JM,Matsuoka Y,Wilson LM,Whitt SR,Doebley J,Kresovich S,Goodman MM,Buckler ES. Structure of linkage disequilibrium and phenotypic associations in the maize genomeProc Natl Acad Sci USAYear: 200198114791148410.1073/pnas.20139439811562485
Roger KB. Mixture model tests for cluster analysis: accuracy of four agglomerative hierarchical methodsPsychol BullYear: 19768337738810.1037/0033-2909.83.3.377
Rohlf FJ. NTSYS-pc (Numerical Taxonomy and Multivariate Analysis System). Version 1.70Year: 1992SetauketExeter
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysisJ Comput Appl MathYear: 198720536510.1016/0377-0427(87)90125-7
Shriner D,Vaughan LK,Padilla MA,Tiwari HK. Problems with genome-wide association studiesScienceYear: 20073161840184210.1126/science.316.5833.1840c17600199
Simko I,Haynes KG,Ewing EE,Costanzo S,Christ BJ,Jones RW. Mapping genes for resistance to Verticillium albo-atrum in tetraploid and diploid potato populations using haplotype association tests and genetic linkage analysisMol Genet GenomYear: 200427152253110.1007/s00438-004-1010-z
Sokal RR,Michener C. A statistical method for evaluating systematic relationshipsUniv Kansas Sci BullYear: 19583814091438
Sokal RR,Rohlf FJ. The comparison of dendrograms by objective methodsTaxonYear: 196211334010.2307/1217208
Stich B,Möhring J,Hans-Peter Piepho,Heckenberger M,Buckler ES,Melchinger AE. Comparison of mixed-model approaches for association mappingGeneticsYear: 20081781745175410.1534/genetics.107.07970718245847
Sugar CA,James GM. Finding the number of clusters in a dataset: an information-theoretic approachJ Am Stat AssocYear: 20039875076310.1198/016214503000000666
Thachuk C,Crossa J,Franco J,Dreisigacker S,Warburton M,Davenport GF. Core Hunter: an algorithm for sampling genetic resources based on multiple genetic measuresBMC BioinformYear: 20091024310.1186/1471-2105-10-243
Thornsberry JM,Goodman MM,Doebley J,Kresovich S,Nielsen D,Buckler ES. Dwarf8 polymorphisms associate with variation in flowering timeNat GenetYear: 20012828628910.1038/9013511431702
Tibshirani R,Walther G,Hastie T. Estimating the number of clusters in a data set via the gap statisticJ R Stat Soc BYear: 20016341142310.1111/1467-9868.00293
van Hintum TJL, Brown AHD, Spillane C, Hodgkin T (2000) Core collections of plant genetic resources. IPGRI Technical Bulletin No.3. International Plant Genetic Resources Institute, Rome, Italy
Wang WYS,Barrat BJ,Clayton GG,Todd JA. Genome-wide association studies: theoretical and practical concernsNat Rev GenetYear: 2005610911810.1038/nrg152215716907
Ward JH. Hierarchical groupings to optimize an objective functionJ Am Stat AssocYear: 19635823624410.2307/2282967
Wilson LM,Whitt SR,Ibanez AM,Rocheford TR,Goodman MM,Buckler ESIV. Dissection of maize kernel composition and starch production by candidate gene associationPlant CellYear: 2004162719273310.1105/tpc.104.02570015377761
Wright S. Evolution in Mendelian populationsGeneticsYear: 1931169715917246615
Yan M,Ye K. Determining the number of clusters using the weighted gap statisticBiometricsYear: 2007631031103710.1111/j.1541-0420.2007.00784.x17425640
Yang R. Estimating hierarchical F-statisticsEvolutionYear: 19985295095610.2307/2411227
Yu J,Pressoir G,Briggs WH,Vroh BI I,Yamasaki M,Doebley JF,McMullen MD,Gaut BS,Nielsen DM,Holland JB,Kresovich S,Buckler ES. A unified mixed-model method for association mapping that accounts for multiple levels of relatednessNat GenetYear: 20063820320810.1038/ng170216380716
Zhao K,Aranzana MJ,Kim S,Lister C,Shindo C,Tang C,Toomajian C,Zheng H,Dean C,Marjoram P,Nordborg M. An Arabidopsis example of association mapping in structured samplesPLoS GenetYear: 20073e410.1371/journal.pgen.003000417238287
Zhao Y,Karypis G. Empirical and theoretical comparisons of selected criterion functions for document clusteringMach LearnYear: 20045531133110.1023/B:MACH.0000027785.44527.d6
Zhu C,Gore M,Buckler ES,Yu J. Status and prospects of association mapping in plantsPlant GenomYear: 2008152010.3835/plantgenome2008.02.0089


[Figure ID: Fig1]
Fig. 1 

Dendrograms for the coconut data, a Ward, b UPGMA. Dendrograms produced by Ward and UPGMA are clearly different with respect to branching. Ward dendrogram had Cophenetic Correlation Coefficient (CPCC) of 0.74 and Agglomerative Coefficient (AC) of 0.97 while UPGMA had CPCC of 0.82 and AC of 0.58. The two major clusters in the two dendrograms had similar compositions (Accessions associated with Indian and Atlantic Oceans versus those associated with the Pacific Ocean)

[Figure ID: Fig2]
Fig. 2 

a Plot of the Point-Biserial Correlation (PBC) versus the number of groups for the UPGMA and Ward dendograms for the coconut data. b Plot of the Average Silhouette Coefficient (ASC) versus the number of groups for the UPGMA and Ward dendograms for the coconut data. For both criteria, the number of groups (K) for which the criterion is maximum (or point where the plot flattens off) indicates the optimum number of clusters. Both criteria show two as the optimum number of clusters

[Figure ID: Fig3]
Fig. 3 

a Bar plots for individual coconut accessions generated by cutting the Ward dendrogram into a specified number of clusters/groups; the numbers of clusters from top to bottom were 2, 3, 4 and 5. The clusters are represented by different colors. Each column represents one accession. The labels below the bar plots indicate the regions of origin of the coconut accessions. b Bar plots for individual coconut accessions generated by STRUCTURE 2.2 using the admixture model with independent allele frequency model based on 30 SSR markers; the numbers of clusters from top to bottom were 2, 3, 4 and 5. The groups are represented by different colors. Each bar is partitioned into segments indicating its genetic composition, the longer the segment the more an accession resembles one of the groups. The labels below the bar plots indicate the regions of origin of accessions

[Figure ID: Fig4]
Fig. 4 

a Relationship between Cophenetic Correlation Coefficient (CPCC) and subgroup differentiation (FST) for the simulated data. b Relationship between Agglomerative Coefficient (AC) and subgroup differentiation (FST) for the simulated data. Each data point is the average of 100 datasets with similar subgroup differentiation

[Figure ID: Fig5]
Fig. 5 

Percentages of simulated data sets for which the Point Biserial Correlation (PBC) and the Average Silhouette Coefficient (AC) identified the correct number of clusters versus the subgroup differentiation (FST) (results from Ward only). Each point is based on 30 simulated data sets

[Figure ID: Fig6]
Fig. 6 

Plot showing the difference in ability of Ward and UPGMA to recover known subgroups in the data based on cluster purity. Each point is based on 100 datasets of similar FST values. Data sets with zero migration rates were excluded since we were mainly interested in low to medium subgroup differentiation

Article Categories:
  • Original Paper

Previous Document:  Controlling flow in microfluidic channels with a manually actuated pin valve.
Next Document:  Genetic analysis of the sugarcane (Saccharum spp.) cultivar 'LCP 85-384'. I. Linkage mapping using A...