How evaluators are designed
| Concept | Definition |
|---|---|
Evaluator | Evaluates AI-generated content by focusing on one dimension of a rubric Example: The Vocabulary evaluator focuses on one dimension (vocabulary) of the Qualitative Text Complexity (rubric). |
Dimension | Describes one specific attribute or criteria of AI-generated content Example: Evaluators can look at Vocabulary, Sentence Structure, and other dimensions of the text when evaluating its complexity. |
Rubric | Provides a framework for evaluating a concept (e.g., Qualitative Text Complexity) through multiple dimensions Example: The Qualitative Text Complexity rubric looks at dimensions like Vocabulary and Sentence Structure to make a judgment on a text’s complexity. |
Evaluators family | Collection of evaluators that evaluate dimensions of one rubric Example: The Literacy evaluators family contains multiple evaluators that each assess one dimension of the Qualitative Text Complexity rubric. |
How outputs are validated
| Concept | Definition |
|---|---|
Evaluator dataset | Dataset rigorously annotated by human domain experts according to a rubric These annotated datasets train and inform our “LLM as a judge” assessments, making our evaluators more accurate with every iteration. Example: We partnered with literacy experts to create the Literacy dataset, which we also used to develop our Literacy evaluators. |
LLM as a judge | Evaluation method that automates human-like reasoning at scale Learning Commons uses this evaluation method to develop our evaluators. We prompt LLMs to use our expert-annotated evaluator datasets to validate our evaluators’ outputs. |
Baseline prompt | Simulates what an edtech developer might construct quickly without expert-annotated datasets, structured rubric alignment, or prompt optimization Learning Commons uses baseline prompts measure how much our expert-annotated evaluator datasets improve our evaluators’ performance. |
Accuracy | Degree to which an evaluator aligns with our evaluator datasets, or how often an evaluator agrees with a human expert Learn more about How accuracy is measured and our recommended use cases. |
How accuracy is measured
An evaluator’s accuracy is calculated on every evaluation run per input (single-run accuracy). Because LLM outputs are probabilistic and inherently non-deterministic, accuracy results may vary across evaluation runs (even with the same inputs).Metrics
Not all evaluators will return all accuracy metrics.
| Accuracy metric | Definition |
|---|---|
Overall accuracy | Percentage of predictions that exactly match the expert-annotated evaluator dataset Some disagreements between our evaluators and human experts are expected, especially in qualititative domains and borderline cases. |
Expert agreement rate | Percentage of reviewed cases where experts agree with evaluator output |
Reasoning quality score | Expert rating of explanation quality (often on a numeric scale) |
Baseline comparison accuracy | Performance relative to a minimal baseline prompt |
Use cases
| Recommended | Not recommended |
|---|---|
| Estimate expected agreement with expert judgment | Fully automate high-stakes decisions |
| Set appropriate thresholds for downstream logic | Assume consistent performance outside the validated scope |
| Replace domain expertise where interpretation is critical |