So you want to develop a survey: practical recommendations for scale development.
Abstract: Scale development represents a salient tool in the repertoire of public health education practitioners and researchers. However, too few professionals have a clear understanding of the complexities and intricacies associated with the development of a survey. Consequently, the purpose of this article is to a) briefly describe inherent issues associated with scale development; and b) provide a step-by-step decision-making process to guide practitioners in developing their own scale. Discussion is devoted to the theoretical foundations of scale development, validity, reliability, and essential pretesting stages. The scale development steps, outlined in this article, provide investigators with the general process involved in planning, developing and validating a new scale. By using the outlined steps discussed in this article, investigators can reduce survey error (i.e. measurement error, nonresponse error) and their associated consequences, providing creditability to the data being collected and analyzed.
Subject: Public health (Surveys)
Authors: Barry, Adam E.
Chaney, Elizabeth H.
Stellefson, Michael L.
Chaney, J. Don
Pub Date: 03/22/2011
Publication: Name: American Journal of Health Studies Publisher: American Journal of Health Studies Audience: Professional Format: Magazine/Journal Subject: Health Copyright: COPYRIGHT 2011 American Journal of Health Studies ISSN: 1090-0500
Issue: Date: Spring, 2011 Source Volume: 26 Source Issue: 2
Product: Product Code: 8000120 Public Health Care; 9005200 Health Programs-Total Govt; 9105200 Health Programs NAICS Code: 62 Health Care and Social Assistance; 923 Administration of Human Resource Programs; 92312 Administration of Public Health Programs
Accession Number: 308742264
Full Text: Scale development is a salient tool in the repertoire of many investigators, and a valuable method for collecting and analyzing data. Specifically outlined in the Responsibilities and Competencies of Health Educators and Public Health Core Competencies, scale development is central to the fields of health education and public health. Despite its obvious relevance to public health researchers, however, too few professionals truly understand how to properly utilize this tool-of-the-trade. DeVellis (2003) incisively contends

"Researchers often 'throw together' or 'dredge up' items and assume they constitute a suitable scale. These researchers may give no thought to whether items share a common cause.. share a common consequence... or merely are examples of a shared superordinate category that does not imply either a common causal antecedent or consequence" (p11).

Due to the complexities associated with measuring and interpreting behavior, scale development represents a tedious, enveloping, scientifically rigorous endeavor. Thyer (1992) has the following recommendation for individuals who are enticed to develop novel scales: "Avoid this temptation like the plague! ... If you ignore this advice and prepare your own scale, your entire study's results may be called into question because the reader will have no evidence that your new measure is a reliable and valid one" (p139-140). While such guidance may appear overly antagonistic, those who are unaware of the intricacies associated with scale development should take heed. However, since numerous research situations will require the development of a novel scale/survey to appropriately measure intended variables, this manuscript seeks to sharpen the scale development skills of public health education researchers/practitioners. Specifically, our aim is to briefly describe inherent issues associated with scale development, and provide a step-by-step decision-making process to guide practitioners in developing their own scale.

Since comprehensive scale development resources already exists, such as the Standards for Educational and Psychological Testing (1999), we seek to condense existing literature addressing scale development complexities into an easily understandable, applicable guide for behavioral scientists/practitioners. However, this article will not serve as an exhaustive summary covering all psychometric properties and analyses. Instead, we have chosen to provide a methodological map that will guide investigators/ practitioners through the dense forests of scale development protocol and assist them in developing efficacious survey instruments.




It is a rarity to directly measure relationships between variables in the social sciences since human behavior seldom has one cause. Consequently, the phenomenon of interest to a public health researcher/practitioner typically originates from theory (DeVellis, 2003). Health behavior theories not only provide a systematic framework for examining and understanding events (NCI, 2005), they also afford insight into why individuals perform/not perform certain behaviors (Glanz, Rimer, Lewis, 2002). Despite the insight garnered by evoking theoretical constructs, investigators "may not only fail to exploit theory in developing a scale but also may reach erroneous conclusions about theory by misinterpreting what a scale measures" (Devellis, 2003, p11).

Considering instruments that have a firm foundation in theory provide the most precise and efficient measurements (Clark & Watson, 1995), it is a matter of practical utility to employ theory when developing a scale. Thus, as public health continues to stress evidence-based practice, scales that are grounded in a theoretical framework become increasingly important to investigators. Consequently, before beginning to develop a scale, it is crucial to consider relevant social science theories (DeVellis, 2003) and begin theorizing (i.e. asking questions that focus on a behavior's causes, or why a phenomenon occurs the way it does) (Goodson, 2010).


Before the development of the Standards (1999), validity was thought of as a "holy trinity" (Guion, 1980), inclusive of content validity, criterion-related validity (subsuming concurrent and predictive validity), and construct validity. Dubbed the "holy trinity" in reference to Christian Theology (i.e., the Father, Son, and Holy Spirit), Guion (1980) contended researchers have a penchant for treating validity as "three different roads to psychometric salvation" (p386). Content validity measured the extent to which an instrument sampled items from the content areas desired (Windsor, Clark, Boyd, & Goodman, 2004); whereas, criterion validity was applied "when one [wished] to infer from a test score an individual's most probable standing on some other variable called a criterion" (Standards, 1999, p. 179-180). Construct validity described whether a scale measured a construct or dimension that it was purported to measure.

Scholars of that time asserted that construct validity served as "the whole of validity theory" (Shepard, 1993, p416), subsuming both content and criterion validity (Goodwin, 2002; Guion, 1980; Loevinger, 1957) and providing a general organizational framework (Kane, 2001; Messick, 1989). However, the Standards (1999) sought to eliminate the "holy trinity" categorization of validity, asserting "validity is a unitary concept. It is the degree to which all of the accumulated evidence supports the intended interpretation of test scores for the intended purpose" (Standards, 1999, p. 11). Consequently, researchers should focus on accumulating validity evidence based on five distinct categories: 1) test content (content-related), 2) response processes (construct-related), 3) internal structure (construct-related), 4) relations to other variables (criterion--and construct--related), and 5) consequences of testing (Goodwin, 2002).

Establishing a theoretical framework for a scale, to guide development and testing of measures, is of further importance because validity is inextricably tied to theory (Cronbach & Meehl, 1955; Goodwin, 2002). Specifically, validity fundamentally consists of subjecting a clearly articulated interaction or interpretation (i.e. theory) to empirical testing (Kane, 2001). Thus, validity is the degree to which the underlying theory supports test score interpretations (i.e. how accurately scores represent the underlying constructs). Therefore, "validation logically begins with an explicit statement of the proposed interpretation of test scores", which refers to "the construct or concepts the test is intending to measure" (Standards, 1999, p. 9).

Quintessentially, validity refers to a scale's accuracy (Huck, 2004) or "trustworthiness of the inferences that are made from the results ..." (JCSEE,1994, p145), while reliability refers to consistency (Huck, 2004). Between these two psychometric mainstays, validity unmistakably takes precedence (Hogan & Agnello, 2004; Standard, 1999). It does not matter how consistent a scale is, if that scale is inaccurate in its measurements. It is fundamental to point out, however, that validity and reliability are not characteristics of a scale (Cronbach & Meehl, 1955; Rowley, 1976; Thompson, 1994, 2003). Rather, validity and reliability are properties of the scores produced by that scale, among a particular sample (Goodwin & Goodwin, 1999; Thompson, 2003). Since a scale can produce scores which are both reliable and unreliable (Rowley, 1976) depending upon given protocol, subjects, and occasion (Eason, 1991), investigators should always refrain from claiming "the test is reliable [or valid]" (Thompson, 1999, p63). This is "sloppy speaking" and declares an obvious untruth (Thompson, 1992, p436). Consequently, a scale's validity and reliability must be examined and reported with each administration.


Before attempting to develop a novel scale, two formative criteria must be met. First, investigators must determine whether applicable scales already exist. As Clark and Watson (1995) contend, "If reasonably good measures of the target construct already exist, why create another?" (p311). Investigators should have concrete methodological, theoretical and/or empirical rationales driving their need to develop a novel scale. Assuming all available scales have been deemed unapplicable, investigators must next establish they have a firm grasp (based on published literature) of the characteristics they intend to examine. If scientific literature cannot inform investigators on these characteristics, employ some manner of exploratory research (e.g., focus groups, interviews). Posavac & Carey (2003) clarify this criteria, asserting "If you wanted to learn about a group you knew nothing about, would you go there and live among them or would you send surveys and questionnaires?" (p.247). Before a novel scale can be developed, there must be empirical evidence to substantiate and inform a) the questions to be developed, b) the interaction between underlying theoretical constructs, and c) the major content areas.



The critical first-step in developing a scale is to outline specific domain and context of the construct(s) (Churchill, 1979; Clark & Watson, 1995; Comrey, 1988; DeVellis, 2003). Here, an investigator is essentially determining the purpose of an instrument (Chaney et al., 2007), or "what you want to know" (O'Rourke & O'Rourke, 2001, p156). Moreover, specific objectives for the study, which will provide a reference point for the questions you develop, are being developed (Aday, 1996). Basically, investigators are beginning to determine the scope of content associated with a construct (i.e., what a construct is and is not).

Investigators should have clearly defined ideas about what actually constitutes the construct's characteristics. Ideally, these notions should be based upon published literature (Churchill, 1979). For example, in developing a culturally sensitive instrument to assess quality of distance education courses, Chaney (2006) based the domain and context of respective constructs from a review of 160 articles and 12 textbooks. It cannot be understated how important it is to begin mulling over these theoretical issues prior to developing a scale since the underlying theory is what will give meaning to the numbers measured.


After determining what you are truly seeking to measure, begin to "delineate the format of items, tasks, or questions; the response format or conditions for responding; and the type of scoring procedures" (Standards, 1999, p38). Investigators can either create new items, or modify existing items. While doing so, it is important to consider your data analysis plan and the appropriate response scale to facilitate those analyses (O'Rourke & O'Rourke, 2001). Consequently, establish the level of data (e.g., ordinal, categorical, continuous) needed to make the analyses possible. For instance, if the content area to be examined cannot be fully captured by using a dichotomous scale, it is useless to provide respondents with 'yes' or 'no' options. Comrey (1988) is quite critical of dichotomous response scales, asserting "multiple-choice formats are more reliable, give more stable results, and produce better scales" (p758). Furthermore, Comrey (1988) proposes a Likert-type response scale, consisting of seven possible choices, as ideal [e.g., "7 = always, 6 = very frequently. 5 = frequently, 4 = occasionally, 3 - rarely, 2 = very rarely, and 1 = never" (p758)]. See Table 1 for scalar concepts that can be employed in place of the aforementioned option.

One important consideration when developing response scales should be respondents' ability to meaningfully discriminate between options (DeVellis, 2003). Distinguishing between options has been shown to be influenced by the specific wording used to separate ordered options (e.g., very few, few, some, many) and physical placement (i.e. order and position) of the responses (DeVellis, 2003). Investigators should maintain spacing between response options that is consistent with measurement intent (Dillman, 2007). If an odd numbered Likert-type scale is to be used, the middle value/label must be selected with great care (Clark & Watson, 1995). Specifically, if using a neutral/undecided option, investigators should realize how such a response may complicate future analyses and interpretation. As is apparent in the example cited above, instead of a neutral/undecided option, respondents were provided a logical, hierarchal set of responses. To avoid such problems entirely, employ an even number of response options. This will force respondents to commit to the direction of one extreme or the other (DeVellis, 2003), ensuring respondents "fall on one side of the fence or the other" (Clark & Watson, 1995, p313). Lastly, investigators should a) only provide mutually exclusive response categories (O'Rourke & O'Rourke, 2001); and b) include enough options to ensure that there will be sufficient distribution and variance across answers. Due to the lack of variance associated with a three--or four--category response scales, Likert-type response scale including at least five response options are preferable (Comrey, 1988).

Investigators should also devise a logical flow or order to their scale (Dillman, 2007; O'Rourke & O'Rourke, 2001). By grouping items addressing a common content area or construct, readers will not be needlessly distracted from 'jumping' from one topic to another. Such abrupt changes can influence respondent's ability to provide a well thought out answer and will induce "top-of-the-head responses" (Dillman, 2007, p87). Once ordered in an efficient manner, employ the strategies outlines in Table 2 to further the logical flow of one's scale.


The importance of having quality items cannot be understated. As Clark & Watson (1995) assert, "No existing data-analytic technique can remedy serious deficiencies in an item pool" (p311). The fundamental objectives when writing an item and/or modifying existing items is to build a question all respondents will a) understand in a similar manner, b) have the opportunity to provide a precise response, and c) be agreeable to answer (Dillman, 2007). During the formation of an initial pool of items, one should exhaust all probable content that could potentially compromise the respective construct (DeVellis, 2003; Loevinger, 1957). A simple rule of thumb is to be as broad and comprehensive as possible; include more items than you believe necessary. Since further psychometric analyses will identify items that fail to capture the essence of the construct, excess items can be removed without losing predictive/explanatory power. While Comrey (1988) contends it is rare that more than twenty multiple-choice (i.e. Likert scale) items are necessary, it is impossible to predict how many items will constitute a final scale (DeVellis, 2003). However, the chances that a scale accurately captures a particular content area are reduced if only one or two items are utilized (Churchill, 1979; Clark & Watson, 1995; DeVellis, 2003). Jacoby (1978) succinctly summarized this point, stating "How comfortable would we feel having our intelligence assessed on the basis of our responses to a single question?" (p93). Some have even advocated for the percentage of items dedicated to each content area be comparative to the relative importance of the content area in the construct (Loevinger, 1957).

Typical items consist of two parts, an item stem and response scale (Comrey, 1988; DeVellis, 2003). Tables 3 provides basic recommendations to follow when writing item stems. Table 4 outlines questions, to be asked of every item developed, which will help investigators diagnose inherent problems.

Finally, the authors suggest instrument developers consult Asking Questions: The Definitive Guide to Questionnaire Design--For Marketing Research, Political Polls, and Social and Health Questionnaires (Bradburn, Sudman & Wansink, 2004) and Mail and Internet Surveys: The Tailored Design Method (Dillman, 2007) when constructing items from scratch. Since "precise wording of questions plays a vital role in determining the answers given by respondents" (Bradburn, Sudman & Wansink, 2004, p. 3), developers should take great care in reducing measurement error caused by poorly worded items.


Adapted from recommendations by DeVellis (2003) and Dillman (2007), the following stages represent the checks and balances enveloped within the fourth phase of scale development.

Stage 1--Assessment of Evidence Based on Test


Have people who are knowledgeable of the content area review the initial item pool to determine each item's clarity and it relevance to the construct/ content area being measured (DeVellis, 2003; Dillman, 2007). The underlying rationale for eliciting expert feedback concerns "maximizing item appropriateness" and examining the extent to which proposed items cover all potential dimensions (DeVellis, 203, p50). Experts can also provide insight to alternative ways of measuring the phenomenon that were not operationalized. In this stage investigators are seeking to finalize substantive content of the scale (Dillman, 2007).

Stage 2--Assessment of Cognitive and Motivational Qualities

Cognitive interviewing will determine whether respondents understand each item as intended and whether questions can be accurately answered (Forsyth & Lessler, 1991). The cognitive interviewing process provides insight into the following questions: (1) "Are all the words understood?" (2) "Are all the questions interpreted similarly by all respondents?" (3) "Do all the questions have an answer that can be marked by every respondent?" (Dillman, 2007, p141). Cognitive interviewing consists of an individual participant reading through a scale in the presence of an interviewer, and 'thinking out loud' (e.g., "I'm not sure how to answer," "None of these response apply to me") as they proceed. The interviewer should probe the respondent in order to get a better understanding of how each question is being interpreted and whether the intent of the question is being realized (Dillman, 2007).

Stage 3--Implementation of a Pilot Study

Administer the instrument to a developmental sample to emulate the proposed data collection procedures (DeVellis, 2003; Dillman, 2007). This phase not only assists in identifying any difficulties in the procedures to be utilized, but also allows the researcher to "make reasonably precise estimates as to whether respondents are clustering into certain categories of questions" (Dillman, 2007, p147). Additionally, a pilot study provides insight into particular questions which frequently go unanswered (i.e., patterns in missing data); investigates whether the scales in the theoretical framework are measuring intended factors; and determines whether particular items should be discarded before final testing. Some scholars suggest a sample size of 100-200 as adequate (Dillman, 2007), while others contend a sample of 300 is the absolute minimum necessary (Clark & Watson, 1995).

Stage 4--A Final Check: Did We Do Something Silly?

To ensure there are no issues that were somehow overlooked during the previous stages, have colleagues who have had no input or participation in the scale development process review the instrument and attempt to answer all items (Dillman, 2007).

Stage 5--Evaluation of the Items

DeVellis (2003) pinpoints the necessity of this stage, calling it "the heart of the scale development process" (p90). In order to evaluate scale items, and ultimately decide which items should complete the final scale, the following statistical processes must be conducted:

Item Normality (Skewness and Kurtosis)

Investigators should first examine response distributions (i.e. skewness and kurtosis) of each item (Clark & Watson, 1995). Specifically, investigators should identify items with highly distorted distributions (i.e., all individuals answer in an identical fashion). Unless a rationale can be established for why the responses to this item are distributed in such a manner, these items should be eliminated. A common rule-of-thumb test for normality is to divide item skewness and kurtosis statistics by their standard errors. Skewness and kurtosis statistics should be within the +2 to -2 range when data are normally distributed.

Item-Scale Correlations

Correlate individual items to other scale items intended to address a common content area or construct. DeVellis (2003) advises examining corrected item-total correlation, which evaluates whether items can discriminate from one another or differentiate between high and low scorers. The 'corrected' implies that the total is not the sum of all item scores, but the sum of item scores without including the item being correlated. In general, a higher value for this correlation is preferable to a lower value. Values usually range from 0 to 1, but they can also be negative. Consider dropping or revising items with discriminations lower than .30. Coefficient Alpha

After examining each item's response distribution and correlation, investigators should next determine internal consistency (i.e., reliability). Cronbach's (1951) coefficient alpha is the most commonly used measure to investigate a scale's reliability (Thompson, 2003). This measure is especially useful since it can be applied to items of any formation, whether dichotomous or Likert-scaled (Thompson, 2003). Cronbach's alpha will provide insight into how well a set of item scores measure a single latent construct. DeVellis (2003) offers the following coefficient alpha "comfort ranges" for scales: Unacceptable = .60 or below; Undesirable = .60 to .65; Minimally Acceptable = .65 to .70; Respectable = .70 to .80; Very Good = .80 to .90 (p95). Nunnally (1978) recommends reliability coefficients of .80 through .90 as minimum standards.

Factor Analysis

Factor analysis represents a means of "boiling down" the sources of variation present and determining the number components underlying the instrument (Stevens, 1986, p365). When using these techniques, Clark & Watson (1995) contend "there is no substitute for good theory and careful thought ..." (p314). Principal components analysis (PCA), the most widely used extraction method for exploratory factor analysis (EFA), is recommended to assess construct validity, identify how many latent variables are underlying each scale, and determine the "substantive content or meaning of the factors" (DeVellis, 2003, p103; Thompson, 2004). EFA should be chosen over confirmatory factor analysis if the investigator is developing a novel scale, and does not have "specific expectations regarding the number or the nature of underlying constructs or factors" (Thompson, 2004, p5). It is important to note that factor analytic techniques require a minimum sample size. Thus, before conducting factor analysis scholars recommend ten participants per item (Thompson, 2004), or a minimum of 200 respondents (Comrey, 1988). For extensive coverage of factor analysis, and its underlying concepts and applications, interested readers are directed to Thompson (2004). Stellefson, Hanik, Chaney and Chaney (2009) supply readers with information regarding EFA factor retention decisions and techniques.


The costs associated with using a poor scale outweighs any benefits that may be attained with its use (DeVellis, 2003). Thus, in order to ensure varaibles can be accurately measured and interpreted, and that subsequent recommendations have sound footing, a thoughtfully conceived, well-designed survey is vital. Jacoby (1978) summarizes the utility of scale development, asserting "What does it mean if a finding is significant or that the ultimate in statistical analytical techniques have been applied, if the data collection instrument generated invalid data at the outset?" (p90). The scale development steps, outlined in this article, provide investigators with the general process involved in planning, developing and validating a new scale. Each step is time--and labor--intensive, and requires instrument developers to make informed decisions regarding item/instrument construction and validation. However, by using these steps, investigators can reduce survey error (i.e. measurement error, nonresponse error) and their consequences, and lend creditability to the data being collected and analyzed.


Aday, L. A. (1996). Designing and conducting health surveys, 2nd ed. San Francisco, CA: Jossey-Bass.

Bradburn, N., Sudman, S. & Wansink, B. (2004). Asking Questions: The Definitive Guide to Questionnaire Design--For Marketing Research, Political Polls, and Social and Health Questionnaires. San Francisco, CA: Jossey-Bass.

Chaney, B. H. (2006). History, theory, and quality indicators of distance education: A literature review. Available online at

Chaney, B. H., Eddy, J. M., Dorman, S. M., Glessner, L., Green, B. L., LaraAlecio, R. (2007). Development of an instrument to assess student opinions of the quality of distance education courses. The American Journal of Distance Education, 21(3), 145-164.

Churchill, G. A. (1979). A paradigm for developing better measures of marketing constructs. Journal of Marketing Research, AVI(February), 64-73.

Clark, L. A., Watson, D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7(3): 309-319.

Comrey, A. L. (1988). Factor-analytic methods of scale development in personality and clinical psychology. Journal of Consulting and Clinical Psychology, 56(5), 754-761.

Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16, 197-334.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological test. Psychological Bulletin, 52, 281-302.

DeVellis, R. F. (2003). Scale Development: Theory and Applications. Thousand Oaks, CA: Sage Publications.

Dillman, D. A. (2007). Mail and internet surveys: The tailored design method. New York: John Wiley & Sons, Inc.

Eason, S. (1991). Why generalizability theory yields better results than classical tests theory: A primer with concrete examples. In B. Thompson (Ed.), Advances in educational research: Substantive findings, methodological developments (Vol. 1, pp. 83-98). Greenwich, CT: JAI Press.

Forsyth, B. H., & Lessler, J. T. (1991). Cognitive laboratory methods: A taxonomy. In P.P. Biemer, R.M. Groves, L.E. Lysberg, N.A. Mathiowetz, S. Sudman (Eds.), Measurement errors in surveys (pp. 393-418). New York, NY: Wiley.

Glanz, K., Rimer, B. K., & Lewis, F. M., eds. (2002). Health Behavior and Health Education, 3rd ed. San Francisco, CA: Jossey-Bass.

Goodwin, L. D. (2002). Changing conception of measurement validity: An update on the new Standards. Journal of Nursing Education, 41(3), 100-106.

Goodwin, L. D., Goodwin, W.L. (1999). Measurement myths and misconceptions. School Psychology Quarterly, 14(4), 408-427.

Goodson, P. (2010). Theory in health promotion practice: Thinking outside the box. Boston, MA: Jones and Bartlett.

Guion, R. M. (1980). On Trinitarian doctrines of validity. Professional Psychology, 11, 385-398.

Hogan, T. P., Agnello, J. (2004). An empirical study of reporting practices concerning measurement validity. Educational and Psychological Measurement, 64(4), 802-812.

Huck, S. W. (2004). Reading statistics and research, 4th ed. Boston: Pearson.

Jacoby, J. (1978). Consumer research: A state of the art review. Journal of Marketing, 42(April), 87-96.

Joint Committee for Standards on Educational Evaluation. (1994). The program evaluation standards: How to assess evaluations of educational programs. Newbury Park, CA: Sage.

Kane, M. T. (2001). Current concerns in validity theory. Journal of Educational Measurement, 38(4), 319-342.

Loevinger, J. (1957). Objective tests as instrument of psychological theory. Psychological Reports, 3, 635694.

Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement, 3rd ed., pp. 13-103). New York, NY: American Council on Education and Macmillan.

National Cancer Institute. (2005). Theory at a Glance: A guide for health promotion practice, 2nd ed. U.S. Department of Health & Human Services: National Institutes of Health. NIH Publication No. 05-3896.

Nunally, J. C. (1978). Psychometric theory, 2nd ed. New York, NY: McGraw-Hill.

O'Rourke, T., O'Rourke, D. (2001). The ordering and wording of questionnaire items: Part 1. American journal of Health Studies, 17(3), 2001.

Posavac, E. J., Carey, R. G. (2003). Program evaluation: Methods and case studies, 6th ed. Upper Saddle River, NJ: Prentice Hall.

Rowley, G. L. (1976). The reliability of observational measures. American Educational Research Journal, 13, 51-59.

Shepard, L. A. (1993). Evaluating test validity. Review of Research in Education, 19, 405-450.

Standards for Educational and Psychological Testing. 1999. Washington, DC: American Educational Research Association, American Psychological Association, and National Council on Measurement in Education.

Stellefson, M.L, Hanik, B., Chaney, B.H, & Chaney, J.D. (2009). Factor retention in EFA: Strategies for health behavior researchers. American Journal of Health Behavior, 33(5), 587-599.

Stevens, J. (1986). Applied multivariate statistics for the social sciences. Hillsdale, NJ: Lawrence Erlbaum Associates.

Thompson, B. (1992). Two and on-half decades of leadership in measurement and evaluation. Journal of Counseling and Development, 70, 434-438.

Thompson, B. (1994, April). Common methodology mistakes in dissertations, revisited. Paper presented at the annual meeting of the American Educational Research Association, New Orleans. (ERIC Document Reproduction Services No. ED 368 771).

Thompson, B. (1999). Five methodological errors in educational research: A pantheon of statistical significance and other faux pas. In Advances in Social Science Methodology (Vol. 5, pp. 23-86). Greenwich, CT: JAI Press.

Thompson, B. (2003). Score reliability: Contemporary thinking on reliability issues. Thousand Oaks, CA: Sage.

Thompson, B. (2004). Exploratory and confirmatory factor analysis. Washington, DC: American Psychological Association.

Thyer, B. A. (1992). Promoting evaluation research in the field of family preservation. In E.S. Morton & R.K. Grigsby (eds.), Advancing family preservation practice (pp. 131-149). Newbury Park, CA: Sage.

Adam E. Barry, PhD Elizabeth H. Chaney, PhD, MCHES Michael L. Stellefson, PhD J. Don Chaney, PhD, CHES

Adam E. Barry, PhD, is affiliated with the University of Florida. Elizabeth H. Chaney, PhD, MCHES, is affiliated with the University of Florida. Michael L. Stellefson, PhD, is affiliated with the University of Florida. J. Don Chaney, PhD, CHES, is affiliated with University of Florida. Corresponding Author: Adam E. Barry, PhD, Department of Health Education & Behavior, PO Box 118210, Gainesville, FL 32611-8210, Office: (352) 392-0583, Fax: (352) 392-1909, Email:
Table 1. Scalar Response Scale Options

* Strongly agree to strongly disagree

* Very favorable to very unfavorable

* Excellent to poor

* Extremely satisfied to extremely dissatisfied

* Complete success to complete failure

* A scale of 1-7, 1-10, or 1-100, where 1 means
lowest possible quality and 7, 10 or 100 mean
highest possible quality

Note: Based on recommendations by Dillman (2007).

Table 2--Strategies for Increasing a Scale's Logical Flow

Create visual navigational cues

--Increase font size

--Increase color brightness or shading

--Use spacing (Dillman, 2007)

Create transitions

--If there is a noticeable change between topics,
provide a lead-in sentence to set-the-stage for what
is to come (O'Rourke & O'Rourke, 2001).

Provide directions when needed

--Instructions for each new section should be provided
exactly where that information will be
needed, not at the beginning of a scale (Dillman, 2007).

Use screening questions

--Use screening questions at the beginning of a new
section to prevent individuals from being subjected
to numerous questions that may not be applicable
(e.g., asking a non-drinker or abstainer about past
thirty day alcohol use).

* For example, in order to establish drinking status,
use a screening item such as "Do you ever consume
alcohol (in any amount)?"

--Screening items will reduce respondent burden and
minimize non-response (O'Rourke & O'Rourke, 2001).

Table 3--Basic Item Development Recommendations

1. Use uncomplicated, easy to understand wording
that is appropriate for respondents' reading level

2. Use as few words as necessary to pose the question
(i.e. avoid exceptionally lengthy items)

3. Use completed sentences

4. Avoid using multiple negatives in the stem

5. Avoid asking respondents to say yes in order to mean no

6. Avoid the use of colloquialisms or fashionable
expressions that may become dated in the foreseeable
future (e.g., Do you believe drinking alcohol is 'off
the chain?')

7. Avoid double-barreled questions, which assess more
two or more factors (e.g., Do you want to be rich and

8. Do not ask questions that respondents cannot
conceivably answer in a reliable fashion (e.g., How
often do your friends engage in sexual relations while
under the influence of alcohol?)

9. Do not ask questions that you believe respondents will
be unwilling to answer truthfully

10. Ensure that all potential respondents will be able to
choose an appropriate response

11. Eliminate check-all-that-apply formats

Note: Based on recommendations by Clark & Watson (1995),
Comrey (1988), DeVellis (2003), and Dillman (2007)

Table 4--Diagnostic Questions for Each Item

1. Does the question require an answer?

2. Will respondents have an accurate, ready-made answer
for each question?

3. Can respondents correctly recall and report past

4. Will respondents be willing to provide the requested

5. Are there any barriers impeding a respondent's
motivation to respond?

6. Will a respondent's interpretation of the response
scale be influenced by factors other than the
wording provided?

Note: Based on recommendations by Dillman (2007).
Gale Copyright: Copyright 2011 Gale, Cengage Learning. All rights reserved.