Limitations

Early Release

This evaluator reflects early-stage work. We’re continuously improving its accuracy and reliability.

The Grade Level Appropriateness Evaluator was developed specifically for AI-generated informational texts for grades K–11, and its performance should be understood within those parameters. These limitations reflect the evaluator’s validated testing conditions, known performance boundaries, and intended usage guidelines.

The evaluator is calibrated for use within a defined configuration environment. Its performance reliability has not been established outside these boundaries, and use with alternate model settings or prompt formats may lead to inconsistent results.
It may not perform reliably for texts aimed at lower or higher grade levels than the intended grades K–11.
The GLA validation dataset is drawn from Common Core Appendix B, with more samples in Grade 2 onwards. As a result, performance estimates may be more precise where the sample sizes are larger.
The evaluator was tested on Common Core Appendix B exemplars, with additional expert validation on CLEAR Corpus texts. Performance has not been formally tested on longer passages of over 1,200 words or on texts outside the above-mentioned datasets.
Some variability is inherent in LLM outputs. Occasional inconsistencies in sentence labeling or complexity scoring may occur between runs.
The evaluator is intended for exploratory use only. It is not validated for formal instructional placement, assessment, or other high-stakes educational decisions.
Results should be interpreted with human judgment, especially when informing curriculum development, educational interventions, or product design.
The evaluator is intended for de-identified, general-purpose text inputs only. Users should not submit student information, personally identifiable data, or any content subject to privacy regulations such as FERPA, COPPA, HIPAA, or GDPR.

Understanding Evaluators

Getting Started

Literacy Evaluators

Datasets

Resources

Early Release