Student assessment trends: A brief history and a bright future
vice president and chief psychometrician, Renaissance
Assessment is a valuable complement to instruction. It tells us who has learned and who hasn’t so educators can determine what’s most likely to help a student grow and succeed in the future.
With nearly half a century invested in psychometric research—the study of educational and psychological measurement—I think it’s important to look back at the development of modern assessments to see what trends are likely to shape and improve assessment and instruction in the future.
A brief history of adaptive testing
I happened to attend the University of Minnesota at a fortuitous time for an aspiring psychometrician. My graduate school mentor, David Weiss, was a counseling psychologist. Counselors rely heavily on psychological tests, which led Weiss to become an expert in test theory and practice, and to found the Psychometric Methods graduate degree program at the university. It was 1971, and Dr. Weiss was exploring the use of computers to administer aptitude tests and vocational interests.
As I joined the graduate program, Weiss was in final negotiations with the Office of Naval Research on a contract to research computerized adaptive testing (CAT), which was more of a concept than a reality at the time. I was lucky enough to be invited to join the team. I believe he chose me because I had independently conceived of what is now called adaptive testing—not realizing people had already begun researching and even practicing it—in my graduate school application.
An early example of adaptive testing was the Stanford-Binet intelligence test. Around since the early 20th century, it was administered differently than anything else at the time. Designed to assess examinees ranging in age from early childhood to adult, the Stanford-Binet consists of a battery of several distinct fixed-form subtests at each mental age level. The examiner uses available information about the examinee to start testing at an age level lower than the examinee’s expected mental age, then proceeds to administer subtests at different age levels until a basal age level and a ceiling age level are established. Subtests below the basal or above the ceiling level need not be administered, and the test score was based on performance within the levels that were used.
The adaptive nature of the test meant that you could reach a reasonably accurate score much faster than with a traditional test, but the necessity of a human administrator introduced human judgment and bias, which created errors in the measurement. Using computers to administer the test could eliminate that source of bias and error, but computers were not available when the first several versions of the test were developed.
By the early 1970s, researchers were interested in the concept of adaptive testing; Weiss was one of the first to have some significant funding behind his research.
Weiss acquired what was called a “mini-computer”—it still took up most of a 9’ x 12’ room—that could drive four or five terminals for test-takers at a time and eventually could be hooked up to a few dozen. With 50,000 students at the university, we had a rich source of experimental subjects, and so we were off, trying to validate the concept of CAT and the many varied approaches to it.
Pros and cons of CAT
The big advantage of CAT—or for that matter, non-computerized adaptive testing—is that it can be done in about half the time that it takes to administer a comparably accurate measure of knowledge or skills. Of course, in educational testing applications, that means less time taken away from curriculum and instruction.
A full battery of educational accountability tests can take days to administer to kids in a classroom. Adaptive testing now being used in those settings has reduced testing time considerably: tests that took three days before CAT now take one-and-a-half or two days.
Another big advantage of adaptive testing in education is that the scores are available immediately. There may be organizational delays, such as waiting for all students to take the test, but adaptive test scores can be available as soon as students are done. In contrast, conventional, printed accountability test scores often are not reported to schools until summer break, due to logistics such as shipping answer documents to a central scanning facility, scoring them, printing and communicating results back to the users.
A weakness of adaptive tests, or at least a common complaint about them, is that they are almost exclusively multiple-choice. Of course, that poses limitations in what can be asked. Critics say that it doesn’t present a realistic environment that mirrors how knowledge is put to use in the real world and thus doesn’t capture everything a student knows.
One alternative is a performance measurement in which you create a realistic task, usually a complex one that involves numerous performance aspects, so that students can attempt to show what they know and can do. This is far more complicated and time-consuming. Students take about 18 minutes on average to answer all 34 questions on a Star Reading assessment; the average Star Math assessment takes 22 minutes. At the same time, a student might complete one or two performance measurement tasks, and there are likely several in a full assessment.
Those performance measurements yield rich information, but despite it being an issue of controversy for decades, the fact is that there is not a great deal of difference between performance measurement and CATs in terms of which students are identified as high, middle, and low achievers.
Gathering data to make tests fairer
One promising trend in testing is less a new trend than the opportunity to pursue an old one with better data. All of us who have been involved in research and testing have lamented the difficulty of getting data on the kids who take our tests. That is different from the data that we collect in the course of taking the test itself.
We are constantly either planning or conducting what we call validity research, where we develop and analyze evidence that tells us how well our tests are working as either predictors of how kids will do a little further down the road or of estimating how they’re doing currently on things that are important but that we can’t test directly.
We want to make sure that our tests are fairly assessing all the various demographic subgroups. All test publishers use differential item functioning analysis to identify questions that may discriminate against one group or another, but it requires a lot of data. Comparing boys and girls is an obvious choice because the data set is so big. But we only get data on which test-takers are boys or girls on about half of the assessments delivered, and we get even less response on information such as who is an English Learner, who has a disability, or who is a member of a racial or ethnic subgroup.
It has historically been up to users to decide if they want to share those data with us or not. Most schools choose not to. In part, that’s because the information is not directly helpful to them. It doesn’t help them assess students, so they don’t see much use in entering it.
To get those data, we have to go to each school or district individually and try to talk them into providing it. But this is protected personal information and schools are, quite reasonably, not eager to let those data go willy-nilly. Of course, we don’t need data that would identify students—just the demographic information—but the concern from schools is understandable and appropriate.
As schools see more and more utility in sharing data through secure systems with appropriate privacy protections in place, those data are becoming increasingly available. I’m optimistic that our recent acquisition of Schoolzilla will help us refine our assessments because the schools using it have already seen some benefit in sharing those data, and it’s already designed to separate the information we need from anything identifying the student personally.
Everyone’s trying to crack this nut, so I’m looking forward to seeing the progress we’ll make on it at Renaissance and in the field more broadly.
The future of assessment
Schools typically test kids at the beginning of the year to screen who’s high, who’s low, and who ought to get special treatment, and then at the end of the year to determine who learned and who didn’t. More frequent but less time-consuming assessments throughout the year can help guide differentiation and instruction. In cases that require frequent progress-monitoring, our Star Assessments can be used monthly or even as often as weekly, although three or four assessments throughout the year should be enough to help teachers make decisions about individual students’ instruction. The trend toward using assessments to guide instruction is pretty well developed here at Renaissance, but it continues to grow around the country. I think that is the direction that things will increasingly go in the coming years.
Kids will be grouped, and students will be treated similarly within their group, but differently across groups, in an effort to bring everyone to the same point of competency. It probably won’t fully succeed in bringing every student to the point we’d like to see them—it has never worked before in all the ways that have been tried—but I think it’s a much more promising approach that we should pursue vigorously in the near future.
Another approach that I think we’re going to see more of in the long term is embedded assessments. These are tests that are folded into instruction so as to be indistinguishable. In theory, students won’t even know they’re being assessed, and the results should be available to inform instruction almost immediately.
This concept is new enough that it needs to be validated more. There will be some surprises (both happy and disappointing) as it’s developed and refined, but we’re likely to see a great deal of evolution on embedded assessments.
Increasingly, artificial intelligence (AI) applications are in development or used in educational settings, whether for assessment or instruction or a combination of the two. As with embedded assessments, the field still has a lot of shaking out to do, but it’s promising.
Again, I am optimistic while maintaining a healthy skepticism about changes like this. When new technologies come along, they need to be explored just as computer-adaptive testing was explored. CAT succeeded. A lot of other technologies have fallen by the wayside for one reason or another, and we’ll see that happen going forward again with other possibilities AI will open for us.
We need to be appropriately conscious of the fact that most exciting innovations will fall short for one reason or another. But that’s “old guy talk” that we don’t want to hear from the younger researchers exploring these new avenues. We need to encourage those innovators so they pursue the work that will uncover technologies and methodologies as effective as CAT.