Running the evaluator

What you’ll need
Running the evaluator
Choosing test passages

Early Release

This evaluator reflects early-stage work. We’re continuously improving its accuracy and reliability.

Take a look at the evaluator files.

What you’ll need

To run the Vocabulary evaluator, you’ll need the following:

System and user prompts that create a baseline assumption for a student’s background knowledge.
System and user prompts that measure vocabulary complexity.
A set of models and temperature settings, one for each stage of the evaluator:

Background knowledge

Model: GPT-4o
Temp: 0

Complexity

Model: Gemini-2.5-pro
Temp: 0

Make an API call to GPT-4o using the background knowledge prompts. This returns a string representation of the background knowledge that students in the target grade level are likely to have.
Using the background knowledge produced in step 1, make another API call to Gemini-2.5-pro using the complexity prompt.

Choosing test passages

We tested with texts that range between 130 and 205 words. We recommend evaluating passages that fall within this range.

Run multiple times.We recommend running the evaluator 3 times for each text passage to smooth over any errors the LLM might make.If you are working with text intended for multiple grade levels, you may want to run the evaluator a few times using different grade levels.

About the evaluator

Interpreting results

⌘I

Understanding Evaluators

Getting Started

Literacy Evaluators

Datasets

Resources

Running the evaluator

Early Release

What you’ll need