Early Release

This evaluator reflects early-stage work. We’re continuously improving its accuracy and reliability.

Foundation and approach

This evaluator was built on Google’s Gemini 2.5 Pro model. All of its behavior was developed through an iterative process of prompt engineering alongside pedagogical experts from Achievement Network (ANet).

Development data

The primary dataset used for development and validation contained 171 Common Core exemplar texts suitable for students in grades K-11. The Common Core dataset was randomly split into a training set of 85 texts and a test set of 86 texts. The training set was used exclusively during the experimentation phase, while the test set was reserved for final accuracy and measurement.

Development and experimentation

Using LangSmith, we ran over 100 experiments to find the best configuration. These experiments tested different permutations of LLMs, prompt formulations, and temperature settings.

Validation and scoring

The evaluator’s performance was validated using the Common Core benchmark dataset. Accuracy was measured by the percentage of times the evaluator’s predicted grade band (either target or alternative) matched the official Common Core grade band for text in the test set. This resulted in a final accuracy of 81%. For an additional layer of human validation, 12 texts were randomly sampled from the CLEAR corpus. Literacy experts from ANet reviewed these texts to perform a soundness check on the evaluator’s grade band assignments and reasoning.