Early Release
This evaluator reflects early-stage work. We’re continuously improving its accuracy and reliability.
Baseline AI evaluator comparison
We tested this evaluator against a baseline that a time- and resource-strapped edtech company might create to measure grade-level appropriateness. The baseline’s performance varied greatly depending on the model and temperature setting, and achieved an average accuracy of ~50% across all configurations tested.Accuracy
Performance
Grade band source of truth | Correct prediction (target or alternative grade band) | Total records | Accuracy |
---|---|---|---|
K-1 | 5 | 5 | 100% |
2-3 | 8 | 9 | 89% |
4-5 | 11 | 11 | 100% |
6-8 | 10 | 15 | 67% |
9-10 | 19 | 22 | 86% |
11-CCR | 17 | 24 | 71% |
Overall | 70 | 86 | 81% |