Assessment is a valuable complement to instruction. It tells us who has learned and who hasn’t so educators can determine what’s most likely to help a student grow and succeed in the future.
With nearly half a century invested in psychometric research—the study of educational and psychological measurement—I think it’s important to look back at the development of modern assessments to see what trends are likely to shape and improve assessment and instruction in the future.
A brief history of adaptive testing
I happened to attend the University of Minnesota at a fortuitous time for an aspiring psychometrician. My graduate school mentor, David Weiss, was a counseling psychologist. Counselors rely heavily on psychological tests, which led Weiss to become an expert in test theory and practice, and to found the Psychometric Methods graduate degree program at the university. It was 1971, and Dr. Weiss was exploring the use of computers to administer aptitude tests and vocational interests.
As I joined the graduate program, Weiss was in final negotiations with the Office of Naval Research on a contract to research computerized adaptive testing (CAT), which was more of a concept than a reality at the time. I was lucky enough to be invited to join the team. I believe he chose me because I had independently conceived of what is now called adaptive testing—not realizing people had already begun researching and even practicing it—in my graduate school application.
An early example of adaptive testing was the Stanford-Binet intelligence test. Around since the early 20th century, it was administered differently than anything else at the time. Designed to assess examinees ranging in age from early childhood to adult, the Stanford-Binet consists of a battery of several distinct fixed-form subtests at each mental age level. The examiner uses available information about the examinee to start testing at an age level lower than the examinee’s expected mental age, then proceeds to administer subtests at different age levels until a basal age level and a ceiling age level are established. Subtests below the basal or above the ceiling level need not be administered, and the test score was based on performance within the levels that were used.
The adaptive nature of the test meant that you could reach a reasonably accurate score much faster than with a traditional test, but the necessity of a human administrator introduced human judgment and bias, which created errors in the measurement. Using computers to administer the test could eliminate that source of bias and error, but computers were not available when the first several versions of the test were developed.
By the early 1970s, researchers were interested in the concept of adaptive testing; Weiss was one of the first to have some significant funding behind his research.
Weiss acquired what was called a “mini-computer”—it still took up most of a 9’ x 12’ room—that could drive four or five terminals for test-takers at a time and eventually could be hooked up to a few dozen. With 50,000 students at the university, we had a rich source of experimental subjects, and so we were off, trying to validate the concept of CAT and the many varied approaches to it.
Pros and cons of CAT
The big advantage of CAT—or for that matter, non-computerized adaptive testing—is that it can be done in about half the time that it takes to administer a comparably accurate measure of knowledge or skills. Of course, in educational testing applications, that means less time taken away from curriculum and instruction.
A full battery of educational accountability tests can take days to administer to kids in a classroom. Adaptive testing now being used in those settings has reduced testing time considerably: tests that took three days before CAT now take one-and-a-half or two days.
Another big advantage of adaptive testing in education is that the scores are available immediately. There may be organizational delays, such as waiting for all students to take the test, but adaptive test scores can be available as soon as students are done. In contrast, conventional, printed accountability test scores often are not reported to schools until summer break, due to logistics such as shipping answer documents to a central scanning facility, scoring them, printing and communicating results back to the users.
A weakness of adaptive tests, or at least a common complaint about them, is that they are almost exclusively multiple-choice. Of course, that poses limitations in what can be asked. Critics say that it doesn’t present a realistic environment that mirrors how knowledge is put to use in the real world and thus doesn’t capture everything a student knows.
One alternative is a performance measurement in which you create a realistic task, usually a complex one that involves numerous performance aspects, so that students can attempt to show what they know and can do. This is far more complicated and time-consuming. Students take about 18 minutes on average to answer all 34 questions on a Star Reading assessment; the average Star Math assessment takes 22 minutes. At the same time, a student might complete one or two performance measurement tasks, and there are likely several in a full assessment.
Those performance measurements yield rich information, but despite it being an issue of controversy for decades, the fact is that there is not a great deal of difference between performance measurement and CATs in terms of which students are identified as high, middle, and low achievers.
Gathering data to make tests fairer
One promising trend in testing is less a new trend than the opportunity to pursue an old one with better data. All of us who have been involved in research and testing have lamented the difficulty of getting data on the kids who take our tests. That is different from the data that we collect in the course of taking the test itself.
We are constantly either planning or conducting what we call validity research, where we develop and analyze evidence that tells us how well our tests are working as either predictors of how kids will do a little further down the road or of estimating how they’re doing currently on things that are important but that we can’t test directly.