Item Response Theory and the Digital SAT
A straightforward explanation of how the scoring system works.
The new digital SAT uses Item Response Theory (IRT). How does it work? Are some questions worth more? If there is no curve, how do you get a score? The math behind it is a bit complicated, but I think there is a pretty intuitive way to understand it.
I basically think of IRT as a rating system. It uses a series of battles, with questions on one side and students on the other, to estimate the strength of every player. If a question ‘loses’ (the student answers correctly), its rating goes down and the student’s rating goes up. If the question ‘wins’, then its rating goes up and the student’s rating goes down. The more battles a player has, the more accurate the rating becomes.
It also has a very cool self-correcting mechanism: when there is a very uneven match (let’s say a very highly rated question going against a student with a very low rating), and the highly rated player wins, very little happens to either player’s score. This was an expected outcome. However, if the low-rated student wins that match (an unexpected outcome), then the student’s rating goes up by a lot and the question’s rating goes down by a lot. After many battles, there are very few unexpected outcomes – you can keep playing if you want, but you basically know who will win every time.
It truly is an amazing system, one that has been adopted by chess (ELO), and other gaming rating systems as well. To get a better sense for how it works, imagine we are holding a chess tournament. We have no idea who was a master schooling old men at age 9 and who doesn’t know how a knight moves. So we give everyone a score of 1000 to start. I match up against my buddy Ed. Unfortunately, Ed embarrasses me. He now has a 1030, and I have a 970. After 100 games, Ed has dominated everyone in his path, achieving a 2000 rating. At that rating, when he beats a 1700 player, he only gets 1 point and the 1700 player only loses 1 point. Note that the difference between player ratings is itself a prediction – the difference between their scores was 300 points, which (if you work it through the formula) comes out to an 85% chance that Ed will win. If he plays another player with a 2000 rating, he’ll have a 50% chance of victory. However, if Ed tries to play World Champion Magnus Carlsen (with a rating of 2859), he basically has no chance. (Sorry about it.)
When students take the digital SAT, they are facing off against questions that already have been through thousands of battles. The strengths of these questions are known with extreme precision. That makes the outcome of each battle very meaningful. Imagine each question effectively was pegged to an SAT score – this one is a level 480 question, that one is a level 790 question. When a student ‘beats’ 8 questions that are all above level 700, that tells you something about the student. One right answer, maybe it’s a lucky guess…but 8? That student almost certainly deserves a 700+ rating.
I just want to emphasize how incredibly powerful and accurate these systems are. If you didn’t care about the user experience (just hit them with the hardest questions they can handle every time), you could get a very good estimate within 20 questions.
I’ve skated past all kinds of details – the different models you might use, the discrimination parameter, accounting for guessing – and I can try to explain those too if you are interested (feel free to email me), but I think this captures the essence of what IRT does.
Excellent article. How does IRT differ between section adaptive and question adaptive testing?