In this position paper, I argue that standardized tests for elementary
science such as SAT or Regents tests are not very good benchmarks for measuring
the progress of artificial intelligence systems in understanding basic science.
The primary problem is that these tests are designed to test aspects of
knowledge and ability that are challenging for people; the aspects that are
challenging for AI systems are very different. In particular, standardized
tests do not test knowledge that is obvious for people; none of this knowledge
can be assumed in AI systems. Individual standardized tests also have specific
features that are not necessarily appropriate for an AI benchmark. I analyze
the Physics subject SAT in some detail and the New York State Regents Science
test more briefly. I also argue that the apparent advantages offered by using
standardized tests are mostly either minor or illusory. The one major real
advantage is that the significance is easily explained to the public; but I
argue that even this is a somewhat mixed blessing. I conclude by arguing that,
first, more appropriate collections of exam style problems could be assembled,
and second, that there are better kinds of benchmarks than exam-style problems.
In an appendix I present a collection of sample exam-style problems that test
kinds of knowledge missing from the standardized tests.
Known Files and URLs
|application/pdf 205.3 kB ||
grouping other versions (eg, pre-print) and variants of this release