Evaluator accuracy

Early Release

This evaluator reflects early-stage work. We’re continuously improving its accuracy and reliability.

Baseline AI evaluator comparison

We tested this evaluator against a baseline that a time- and resource-strapped edtech company might create to measure grade-level appropriateness. The baseline’s performance varied greatly depending on the model and temperature setting, and achieved an average accuracy of ~50% across all configurations tested.

Accuracy

The Grade Level Appropriateness Evaluator is over 58% more accurate than the naive LLM baseline.

Performance

Comparison of the evaluator accuracy across grade bands.

Grade band source of truth	Correct prediction (target or alternative grade band)	Total records	Accuracy
K-1	5	5	100%
2-3	8	9	89%
4-5	11	11	100%
6-8	10	15	67%
9-10	19	22	86%
11-CCR	17	24	71%
Overall	70	86	81%

This table gives insight into accuracy per grade level as well as where the evaluation is most likely to make mistakes. For example, the validation dataset included 22 texts for Grades 9-10 in the Common Core exemplar texts. The evaluator labeled 19 of them as correct, which leads to the accuracy score for that grade band as 86%.

Baseline prompt

This is the prompt we used as a baseline to simulate what a time- and resource-strapped edtech organization might use when building an evaluator to determine grade level appropriateness.

System prompt

You are an expert in English literature education for K-12.

Your job is to help evaluate the grade-level appropriateness of a given text.

You will be given a text and you should determine which grade level the text is appropriate for (grade levels include: K-1, 2-3, 4-5, 6-8, 9-10, 11-CCR)

IMPORTANT: You should pay attention to the vocabulary used, topics of the text, and readability of the text.

Please first reason out loud about the vocabulary complexity of the text and then provide an answer between grade level options: K-1, 2-3, 4-5, 6-8, 9-10, 11-CCR.

User prompt

Read the following text and tell me the appropriate grade band for the text.

Here is the text:\n
[BEGIN TEXT]
{text}
[END TEXT]\n

In your response, provide your answer from one the following grade level options based on your assessment:  K-1, 2-3, 4-5, 6-8, 9-10, 11-CCR.

{format_instructions}

Understanding Evaluators

Getting Started

Literacy Evaluators

Datasets

Resources

Early Release

Baseline AI evaluator comparison

Accuracy

Performance

Baseline prompt

System prompt

User prompt

Understanding Evaluators

Getting Started

Literacy Evaluators

Datasets

Resources

Early Release

​Baseline AI evaluator comparison

​Accuracy

​Performance

​Baseline prompt

​System prompt

​User prompt

Baseline AI evaluator comparison

Accuracy

Performance

Baseline prompt

System prompt

User prompt