Early Release

This evaluator reflects early-stage work. We’re continuously improving its accuracy and reliability.

What you’ll see

The feedback from the evaluator will be a JSON object that looks like something like this:
{
  "reasoning": "1. **Quantitative Analysis**: The text has a word count of 141 [...] **Synthesis**: The quantitative Flesch-Kincaid score of 10.7 aligns perfectly with the qualitative analysis [...]",
  "grade": "9-10",
  "alternative_grade": "6-8",
  "scaffolding_needed": "Pre-teaching of vocabulary [...]"
}
The output includes the following pieces of information:
  • Grade: Target grade band for the text when used for independent reading.
  • Reasoning: Rationale for grade.
  • Alternative_grade: An additional grade band that the content may be suitable with additional support.
  • Scaffolding_needed: The supports required for it to be appropriate for the alternative grade.

How to use the information

The Grade Level Appropriateness Evaluator helps you determine the grade band an AI-generated passage is suitable for. This can be valuable in several situations.

Feature development

Optimizing your LLM

Optimize your LLM-generated text by using the evaluator’s grade to validate that your LLM prompts produce grade-appropriate content. You can then aggregate detailed reasoning from multiple runs to diagnose and fix underlying text complexity issues at scale. For example, if you have a dataset of 1000 Grade 4 examples taken from your product and you run the evaluator over this set to find that 60% of them are scored as Grade 6-8, you can filter results and aggregate insights from the reasoning to diagnose and fix the underlying issues.

Suggesting scaffolds to your users

You can also use this evaluator to tailor your product by using the alternative grade and its scaffolding needed to automatically generate scaffolds to support educators in using your content to reach a wider range of learners.

Maintaining performance

Continuously monitor the consistency of your LLM-generated content by regularly sampling your system’s output (e.g., daily or weekly) and running it through the evaluator. By tracking the aggregated target recommended grade band and reasoning period over period, you can detect unintended shifts in output. This allows you to verify that your output remains consistent and reliable for users.

Selecting a model

When comparing different foundational models or assessing prompt strategies for content generation, use the evaluator’s grade band outputs to measure which configuration most accurately and consistently produces text at your desired complexity level. This helps you to understand tradeoffs between speed, cost, and quality to select the optimal option for your use case.