Core concepts - AI Developer Tools for Education

How evaluators are designed

Concept	Definition
Evaluator	Evaluates AI-generated content by focusing on one dimension of a rubric Example: The Vocabulary evaluator focuses on one dimension (vocabulary) of the Qualitative Text Complexity (rubric).
Dimension	Describes one specific attribute or criteria of AI-generated content Example: Evaluators can look at Vocabulary, Sentence Structure, and other dimensions of the text when evaluating its complexity.
Rubric	Provides a framework for evaluating a concept (e.g., Qualitative Text Complexity) through multiple dimensions Example: The Qualitative Text Complexity rubric looks at dimensions like Vocabulary and Sentence Structure to make a judgment on a text’s complexity.
Evaluators family	Collection of evaluators that evaluate dimensions of one rubric Example: The Literacy evaluators family contains multiple evaluators that each assess one dimension of the Qualitative Text Complexity rubric.

How outputs are validated

Concept	Definition
Evaluator dataset	Dataset rigorously annotated by human domain experts according to a rubric These annotated datasets train and inform our “LLM as a judge” assessments, making our evaluators more accurate with every iteration. Example: We partnered with literacy experts to create the Literacy dataset, which we also used to develop our Literacy evaluators.
LLM as a judge	Evaluation method that automates human-like reasoning at scale Learning Commons uses this evaluation method to develop our evaluators. We prompt LLMs to use our expert-annotated evaluator datasets to validate our evaluators’ outputs.
Baseline prompt	Simulates what an edtech developer might construct quickly without expert-annotated datasets, structured rubric alignment, or prompt optimization Learning Commons uses baseline prompts measure how much our expert-annotated evaluator datasets improve our evaluators’ performance.
Accuracy	Degree to which an evaluator aligns with our evaluator datasets, or how often an evaluator agrees with a human expert Learn more about How accuracy is measured and our recommended use cases.

How accuracy is measured

An evaluator’s accuracy is calculated on every evaluation run per input (single-run accuracy). Because LLM outputs are probabilistic and inherently non-deterministic, accuracy results may vary across evaluation runs (even with the same inputs).

To reduce your evaluators’ variability and increase their reliability in production:

Run the evaluator multiple times on the same input
Treat each output as a vote
Select the majority result (for example, 3 runs with majority selection)

Metrics

Not all evaluators will return all accuracy metrics.

Accuracy metric	Definition
Overall accuracy	Percentage of predictions that exactly match the expert-annotated evaluator dataset Some disagreements between our evaluators and human experts are expected, especially in qualititative domains and borderline cases.
Expert agreement rate	Percentage of reviewed cases where experts agree with evaluator output
Reasoning quality score	Expert rating of explanation quality (often on a numeric scale)
Baseline comparison accuracy	Performance relative to a minimal baseline prompt

Use cases

Use accuracy metrics to inform implementation decisions, not to replace judgment.

Recommended	Not recommended
Estimate expected agreement with expert judgment	Fully automate high-stakes decisions
Set appropriate thresholds for downstream logic	Assume consistent performance outside the validated scope
	Replace domain expertise where interpretation is critical

​How evaluators are designed

​How outputs are validated

​How accuracy is measured

​Metrics

​Use cases

How evaluators are designed

How outputs are validated

How accuracy is measured

Metrics

Use cases