Building the case for large scale behavioral education adoptions.
Behaviorally-designed educational programs are often based on a
research tradition that is not widely understood by potential users of
the programs. Though the data may be sound and the prediction of
outcomes for individual learners quite good, those advocating adoption
of behaviorally-designed educational programs may need to do more in
order to convince school districts to adopt large-scale implementations
of their programs. This article provides an example of a successful
approach that suggests quickly gathering local evidence using more
familiar evidence-based and experiential methods may increase the
likelihood of program adoption. The actual report to the large urban
district is included,
summative, formative, evidence, pilot programs
|Subject:||Educational programs (Case studies)|
Layng, Zachary R.
Layng, T.V. Joe
|Publication:||Name: The Behavior Analyst Today Publisher: Behavior Analyst Online Audience: Academic Format: Magazine/Journal Subject: Psychology and mental health Copyright: COPYRIGHT 2012 Behavior Analyst Online ISSN: 1539-4352|
|Issue:||Date: Wntr, 2012 Source Volume: 13 Source Issue: 1|
Working with school districts requires the ability to quickly adapt
to the needs of each district and to provide as much useful information
as possible to enable them to make decisions. Districts feel most
comfortable when they can review evidence of program effectiveness
obtained from their own schools. Accordingly, it is necessary to be able
to quickly respond to requests and to provide data that may not meet
"gold standard" requirements, but that do help provide a basis
for sound decision making. This article provides a case study of such an
effort, and may help to elucidate how in-district data may be quickly
generated and combined with other evidence to strengthen the case for
The program being considered for a district-wide summer school adoption was Headsprout[R] Reading Comprehension (HRC). HRC is an online program designed to directly teach children how to comprehend text. It provides instruction in literal, inferential, main idea, and derived word meaning comprehension, as well a vocabulary instruction using stimulus-equivalence like procedures. Learners interact with the 50 episode (lesson) program for about 30 minutes a day. As with all Head-sprout programs, HRC underwent extensive user testing during the course of its development, more than 120 learners, one at a time, interacted with the various program segments, providing data on their effectiveness and occasions for revisions. HRC, first its components and then the entire program, was revised and retested until nearly all learners met criterion. As of this writing, more than 35,000 online learners provided, and continue to provide, data for further evaluation and revision. An overview of HRC and its development can be found in Leon et al (2011). A detailed description is provided of the contingency analytic foundations of HRC by Layng, Sota, & Leon (2011), of the analysis that determined its content by Sota, Leon, and Layng (2011), and of the programming and testing of the repertoires and relations involved by Leon, Layng, & Sota (2011). The methods Headsprout employs in the design, development, and testing of all its programs has been described by Layng, Twyman, & Stikeleather (2003), Twyman et al (2004), and Layng, Stikeleather, and Twyman (2006).
Program evaluation can be of two kinds, formative and summative. In formative evaluation, criteria are established for learner performance. The program, or component thereof, is tested to determine if learners reach the specified criteria. If learner behavior does not meet those criteria, the program is revised and retested until nearly all criteria are met (Layng et al, 2006; Markle, 1967; Twyman et al, 2004). In summative evaluation, evidence is gathered after the program has been developed in an attempt to determine program effectiveness. Summative evaluation often employs pretest versus posttest comparisons or group comparisons of various types.
As noted by Layng et al (2006):
Layng et al (2006) go on to describe a 3 X 3 matrix that illustrates the relation between evidence types found in both formative and summative evaluation, and the results of their intersection. See Table. The columns comprise types of summative evaluation, and the rows types of formative evaluation. For summative evaluation, the columns are: A. Experiential Assessment, B. Evidence-Based Assessment, and C. Scientific Control Group Research & Assessment. For formative evaluation the rows are: 1. Experiential--Program Development, 2. Evidence Based-Program Development, and 3. Scientific-Program Development. The intersection of each cell describes the implications for programmatic decision-making.
Since HRC was developed within a rigorous scientific development process (see Twyman et al, 2004) we are able to predict group performance based on individual performance (see Table, Row 3). For example, Layng, Twyman & Stikeleather (2004) were able to show that data obtained from single subject research in the Headsprout laboratory were nearly identical to the data obtained for about 32,000 learners accessing the program over the Internet in uncontrolled settings. Accordingly, any additional summative data augments the laboratory research and development outcomes and can contribute to district decision making by supplying locally derived data even though the data fall short of the summative evaluation gold standard of using randomized control group comparisons. Whereas, highly controlled scientific summative evaluation can be time consuming, difficult to implement with fidelity, and expensive, shorter duration efforts, often a pilot program that falls into columns one and two of the 3 X 3 matrix, can be used to suggest the potential effectiveness of an educational product in a particular setting. When such data are obtained with a program that has gone through a scientific formative development process, even more confidence may be warranted than would otherwise be provided by either formative or summative data alone. When experiential and evidence based information confirms the scientific program development data (e.g., Twyman, Layng, & Layng, 2011), potential purchasers feel even more comfortable making large purchase decisions. When those data come from their own schools, the comfort is even greater. This report describes an effort to rapidly provide a major school district with data from Columns A & B (experiential and evidence based data), which could be used to augment that obtained from Row 3 (scientific program development). Though edited for reasons of confidentiality, the report below is essentially what was provided to district decision makers.
[FIGURE 1 OMITTED]
* PILOT OVERVIEW
The goal of this pilot was to demonstrate the potential effectiveness HRC over a six-week summer school program. To simulate a summer school setting, a group of 31 third graders from an elementary school designated by the district were chosen to participate in four 45-minute sessions each week over six weeks. During these sessions, students were provided access to HRC's 50 online lessons, called "episodes," via an Internet-connected computer and a short series of worksheets designed to solidify the application of learned strategies to written tests of reading comprehension. These worksheets were chosen from Headsprout Reading Comprehension: Transfer and Extension activities. Sessions were conducted on Mondays and Tuesdays after school, and on Wednesdays and Thursdays in place of scheduled reading class. Of the 31 students, three did not read at a mid-2nd grade level, the required minimum reading level to be placed into the Reading Comprehension program, so instead they were placed into Headsprout Early Reading to improve their basic reading skills. A student's readiness for HRC is determined by a measure of their oral reading fluency on a passage leveled at DRA Level 28.
[FIGURE 2 OMITTED]
To estimate the effectiveness of Headsprout Reading Comprehension, students completed pre and posttests. The pretest included comprehension questions following a short passage, questions on "resources" including maps and tables of contents, and vocabulary items. The grade level of the pretest passage and questions was 2.8 (See Appendix A). After the six-week period, students were given a posttest. This posttest included two long-passage sections from third-grade and fourth-grade state sample Booklets, an alternate form of the resources section of the pretest, and vocabulary items. The state sample tests were chosen to estimate grade levels 3.6 (the third-grade test) and 4.6 (the fourth-grade test). The pretest was administered over approximately 30 minutes. The posttest was administered over approximately one hour.
Over the six-week period, students engaged with the program on 24 days. The class completed an average of 29.7 episodes. The median episode was 29 for the 50-episode program. For all statistical analyses, only data from the 22 students for whom there are both posttest and pretest data, and who qualified on the first day of the pilot for Reading Comprehension are used. Those 22 students completed an average of 33.4 episodes. The median episode was 32. (See Figure 2).
After 24 days of instruction, the students showed substantial estimated gains. On the pretest, the estimated mean grade level score equivalence was 3.13. On the third-grade state sample section, the mean grade level equivalence was 3.86-an estimated 7-month gain. On the fourth-grade state sample section, the students scored an average grade level equivalence of 4.49-an estimated gain of over one year (See Figure 1).
We performed one-tailed t-tests to compare the posttest scores to pretest scores, including an adjustment for two months of the instruction the students would have received with or without the addition of Headsprout. Even adding 0.2 grade level to each student's pretest score, the gains were statistically significant for both third-grade state sample-section results (p<.01) and fourth-grade state sample-section results (p<.01).
[FIGURE 3 OMITTED]
One question that immediately arises is: how can the same students perform at a grade level of 3.86 and minutes later achieve a grade level of 4.49? We believe that our observations from the posttest can account for this difference. During the third-grade section of the exam many students were observed attempting to answer the questions quickly, sometimes neglecting the careful application of the strategies learned in Headsprout Reading Comprehension. When students attempted the fourth-grade section, however, students were observed frequently referring back to the passage (a primary strategy), using their fingers to guide careful reading of all answer choices, and even writing next to each question which of the four comprehension strategies applied to each question (See Figure 3).
We believe that the challenge of the mid-fourth grade material prompted the students to apply the strategies more consistently. This is evidenced in the data showing that students who scored the lowest on the pretest had equivalent posttest scores to those students who scored the highest on the pretest, outperforming the students in the middle quartiles. We believe that the students who scored the lowest on the pretest applied the Headsprout Reading Comprehension strategies to both sections equally, the highest achieving students tended to perform well as a rule, while the students in the middle quartiles only felt it necessary to apply the strategies only to the more challenging sections.
The students averaged a 30% increase in score on the vocabulary section. A directional Mann-Whitney test determined that the 30% increase in vocabulary scores was also significant (p<.05). The vocabulary tested was the same from pretest to posttest.
While scores improved on the "using resources" section, the gains were not shown to be significant (onetailed, p>.05). We believe this is due to discrepancies between testing conditions; the resources section was conducted during the last 10 minutes of an hour-long posttest session, when classroom interruption and distractions were at the highest frequency.
* TEACHER AND STUDENT FEEDBACK
There were significant challenges posed by a frequent rotation of homeroom instructors; however, the class's reading teacher, who was the learners' homeroom teacher the previous school year, was a consistent presence and instructed the students during the two daytime sessions.
The primary benefit of HRC, according to the teacher was a dramatic increase in academic confidence in her students. She provided the insight that students not only feel that they are answering questions correctly, but know why and how they are successful. "When [students are successful], they know that they earned it," she said.
Having taught the same students for over a year, she knows the students very well. She noted that one of her "quiet voices" in the class would always rub his ear when nervous and facing a challenging problem. In one instance, she asked him to tell her the answer to a comprehension story question, which he successfully did. She then challenged him, asking, "Why is that the answer?" To her surprise, this student calmly, confidently provided her with the precise strategy for answering the question and explained how he applied it. When volunteering this story, Ms. Brown said, "I was proud to see that in [him]."
It has been an absolute pleasure working with the instructors and staff at the school, and an even greater pleasure working with the students as they progressed through the program. When a handful of students began approaching the last few episodes, they began asking what they would do next. They were told that they could read just about anything, including newspapers, articles, or poetry. Two days later, one student seemed in a particular hurry to finish the last two episodes and was asked why she was rushing through. She stated, evidently remembering the previous conversation, "I want to finish so I can understand poetry. Will you bring some for me?"
Her ambition is not isolated. On the Monday following the end of the pilot, the students who finished all 50 episodes unanimously and without prompting asked if they could "use the online dictionary to look up new words." They seemed just as excited to instead work in groups of two and three on short-passage story questions from Headsprout Reading Comprehension: Transfer and Extension.
* CHALLENGES TO IMPLEMENTATION
On the first day of the pilot, the homeroom teacher trained to use Headsprout Reading Comprehension with her class was injured and unable to return to work for the entirety of the pilot. During her absence, a series of temporary substitute teachers and two long-term substitute teachers rotated through the classroom. Each was trained in Headsprout Reading Comprehension within 1-2 days of Mimio-Headsprout being notified, but the gaps in the instructional support were significant, with over half of the pilot being unsupervised by a trained teaching professional. In many cases, the volunteer computer-lab manager was the only support for Headsprout after school.
For two weeks in January, students were unable to access the computer lab due to school-wide computerized testing, and had only two after-school sessions each week. Additionally, the pilot was interrupted by the two-week holiday break. A number of students were provided with resets to previous episodes, the typical intervention for below-average episode performance.
The report has been well received, and as of this writing HRC is in the final round of consideration for district-wide implementation. Though not meeting a "gold standard" for summative research, estimates of likely outcomes, combined with seeing and talking with learners, teachers, and the principals can be extraordinarily important to districts making decisions as to whether or not to make large investments in behaviorally-based programs. This process has been replicated elsewhere and the program is seeing adoption by districts throughout the country. Behaviorally-based educational programs that have been painstakingly developed through careful control-analysis preparations may find that purchasing decisions may be more affected by locally derived experiential and evidence-based outcomes than by the more rigorous single-subject research. While we do not advocate making decisions only on the basis of column A. Experiential--Assessment or column B. Evidence-Based Assessment, they can be useful when combined with row C development.
EXAMPLE OF STUDENT WORK ON THE FIRST PAGE OF THE PRETEST
Having a dog can be lots of fun. Rut there is work, too.
A dog must be fed at regular times every day. Some good times are 7:00 in the morning and 5:00 in the evening.
A dog also needs regular exercise. Taking a dog for a walk once every day is a very good idea.
A dog can he fun to play with. Play is one thing that makes a dog a good friend.
Another thing that makes a dog a good friend is loyalty. In good limes and bad times, a dog will stay by your side.
Regular meals, exercise, fun, and friendship: these things make a dog happy.
And guess what? They make a person happy, too?
1. How often is it a good ides to take a dog far a walk?
() once every' hour
() once every day
(*) once every week
2. Why are dogs, good friends for people?
() Dogs can be loyal and fun.
(*) Some dogs don't bark very much.
() Dogs can be led regular meals.
3. What is this part of the story mostly about?
() how to lake a dog for a walk
() feeding a dog at regular times
(*) how lo keep a dog happy
4. What does "loyalty?" most likely mean?
(*) staying friends through ups and downs
() staying at a king or queen's castle
() staying up late at night
FINAL PERFORMANCE REPORT FOR HEADSPROUT READING COMPREHENSION STORY QUESTIONS
Layng, T. V. J., Stikeleather, G. & Twyman, J. S. (2006). Scientific formative evaluation The role of individual learners in generating and predicting successful educational outcomes In: Subotnik, R. & Walberg, H. (Eds,) The scientific basis of educational productivity, Charlotte, NC: Information Age Publishing.
Layng, T. V. J., Twyman, J. S., & Stikeleather, G, (2003), Headsprout Early Reading Reliably teaching children to read, Behavioral Technology Today, 3, 7-20.
Layng, T. V. J., Twyman, J, S,, & Stikeleather, G, (2004), Engineering discovery learning: The contingency adduction of some precursors of textual responding, The Analysis of Verbal Behavior, 20, 99-109.
Layng, T. V. J., Sota, M,, & Leon, M, (2011), Thinking through text comprehension I Foundation and guiding relations, The Behavior Analyst Today, 12, 1-10.
Leon, M, Ford, V., Shimizu, H., Stretz, A., Thompson, J., Sota, M,, Twyman, J. S. and Layng, T. V. J. (2011), Comprehension by design: Teaching young learners to comprehend what they read. Performance Improvement Journal, 50, 40-47.
Leon, M., Layng, T. V. J., & Sota, M, (2011), Thinking through text comprehension III: The programing of verbal and investigative repertoires. The Behavior Analyst Today, 12, 11-20.
Markle, S. M. (1967). Empirical testing of programs. In P. C. Lange (Ed.), Programmed instruction: Sixty-sixth yearbook of the National Society for the Study of Education: 2 (pp. 104-138). Chicago: University of Chicago Press.
Sidman, M., (1960). Tactics of scientific research. Oxford England: Basic Books.
Sota, M., Leon, M., & Layng, T. V. J. (2011), Thinking through text comprehension II: Analysis of verbal and investigative repertoires. The Behavior Analyst Today, 12, 21-33.
Twyman, J. S., Layng, T. V. J., Stikeleather, G. & Hobbins, K. (2004), A Non-linear approach to curriculum design: The role of behavior analysis in building an effective reading program. In: W. L. Heward et al. (Eds.), Focus on behavior analysis in education, Vol. 3. Upper Saddle River, NJ: Merrill/Prentice Hall.
Twyman, J. S., Layng, T. V. J., & Layng, Z. R. (2011). The likelihood of instructionally beneficial, trivial, or negative results for kindergarten and first grade learners who partially complete Headsprout[R] Early Reading. Behavior Technology Today, 6, 1-19.
Zachary R. Layng & T. V. Joe Layng
The authors would like to thank Melinda Sota for her critical comments and editorial contribution
* AUTHOR CONTACT INFORMATION
840 W, Wrightwood Ave.
Chicago, IL. 60614
T.V. JOE LAYNG
4705 S Dakota St.
Seattle, WA 98118
Whereas group designs, typically the basis for summative evaluation, are readily accepted as providing scientific evidence for program effectiveness, single subject designs typically form the basis for formative evaluation. While both group and single subject designs are descended from highly successful scientific traditions, and both may provide equally rigorous and informative results, single subject design is relatively less understood. Both do, however, differ in the questions asked; one asks about the behavior of groups, the other asks about the behavior of individuals.
Table. 3X3 Matrix. The level of rigor for each type of evaluation is indicated by the letters A-C for summative evaluation, with column C representing the most rigorous; the numbers 1-3 indicate the level of rigor for each type of formative evaluation, with row 3 representing the most rigorous. Cell 3-C represents the most rigorous intersection of formative & summative evaluation. Approaches to Summative Evaluation: Approaches Basis for Outcomes Assessment to Formative Evaluation: Basis A. Experiential-- B. Evidence Based-- for Program Revision Assessment Assessment 1. Experiential-- Cannot predict group Provides some Program Development or individual indication that the performance, program may be effective with a group; but Works or not with Cannot confidently groups or individuals predict group or purely subjective; a individual matter of opinion performance. argued on point of view--a matter of social agreement 2. Evidence Based-- If limited tryouts, Provides some Program Development may indicate that the indication that the program might work program may be with those tested; but effective with a group; but Cannot confidently Cannot confidently predict group or predict group or individual individual performance. performance. Still primarily a If works, not really matter of social clear why, if it does agreement, but has not work, can lead to some validity by re-evaluation of relation to past principles or the way research and perhaps they were applied Not limited tryouts. clear where the problem is. 3. Scientific-- Able to predict group Able to predict group Program Development performance based on performance based on individual individual performance; and performance; and Can confidently Can confidently predict individual's predict individual's performance. performance. Since program able to If doesn't work, predict individual's issues are in performance, some transfer,---able to prediction of group identify and isolate performance implied; variables to change may have some validity and revise for retest, by relation to past Individual data not research. lost and can be analyzed in relation to outcome Approaches to Summative Evaluation: Approaches Basis for Outcomes Assessment to Formative Evaluation: Basis C. Scientific--Controlled Group for Program Revision Research & Assessment 1. Experiential-- Can confidently predict group Program Development performance, but Cannot predict individual's performance (Sidman, 1960). If works or not, not clear what program elements, alone or together are responsible. 2. Evidence Based-- Can confidently predict group Program Development performance; but Cannot confidently predict individual's performance. If works or not, not clear what program elements, alone or together are responsible, but can lead to reconsideration of principles or the way they were applied. 3. Scientific-- Can confidently predict group Program Development performance; and Can confidently predict individual's performance. If doesn't work, issues are in differences in formative criteria & summative measurement instruments--able to identify and isolate variables to change and revise criteria & program for retest, or to revise summative measurement instruments, Individual data not lost and can be analyzed in relation to outcome.
|Gale Copyright:||Copyright 2012 Gale, Cengage Learning. All rights reserved.|