Skip to main content

How evaluators are designed

ConceptDefinition

Evaluator

Evaluates AI-generated content by focusing on one dimension of a rubric
Example: The Vocabulary evaluator focuses on one dimension (vocabulary) of the Qualitative Text Complexity (rubric).

Dimension

Describes one specific attribute or criteria of AI-generated content
Example: Evaluators can look at Vocabulary, Sentence Structure, and other dimensions of the text when evaluating its complexity.

Rubric

Provides a framework for evaluating a concept (e.g., Qualitative Text Complexity) through multiple dimensions

Illustration showing a sample rubric structure used by evaluators.

Example: The Qualitative Text Complexity rubric looks at dimensions like Vocabulary and Sentence Structure to make a judgment on a text’s complexity.

Evaluators family

Collection of evaluators that evaluate dimensions of one rubric
Example: The Literacy evaluators family contains multiple evaluators that each assess one dimension of the Qualitative Text Complexity rubric.

How outputs are validated

ConceptDefinition

Evaluator dataset

Dataset rigorously annotated by human domain experts according to a rubric

These annotated datasets train and inform our “LLM as a judge” assessments, making our evaluators more accurate with every iteration.
Example: We partnered with literacy experts to create the Literacy dataset, which we also used to develop our Literacy evaluators.

LLM as a judge

Evaluation method that automates human-like reasoning at scale

Learning Commons uses this evaluation method to develop our evaluators. We prompt LLMs to use our expert-annotated evaluator datasets to validate our evaluators’ outputs.

Baseline prompt

Simulates what an edtech developer might construct quickly without expert-annotated datasets, structured rubric alignment, or prompt optimization

Learning Commons uses baseline prompts measure how much our expert-annotated evaluator datasets improve our evaluators’ performance.

Accuracy

Degree to which an evaluator aligns with our evaluator datasets, or how often an evaluator agrees with a human expert

Learn more about How accuracy is measured and our recommended use cases.

How accuracy is measured

An evaluator’s accuracy is calculated on every evaluation run per input (single-run accuracy). Because LLM outputs are probabilistic and inherently non-deterministic, accuracy results may vary across evaluation runs (even with the same inputs).
To reduce your evaluators’ variability and increase their reliability in production:
  1. Run the evaluator multiple times on the same input
  2. Treat each output as a vote
  3. Select the majority result (for example, 3 runs with majority selection)

Metrics

Not all evaluators will return all accuracy metrics.
Accuracy metricDefinition

Overall accuracy

Percentage of predictions that exactly match the expert-annotated evaluator dataset
Some disagreements between our evaluators and human experts are expected, especially in qualititative domains and borderline cases.

Expert agreement rate

Percentage of reviewed cases where experts agree with evaluator output

Reasoning quality score

Expert rating of explanation quality (often on a numeric scale)

Baseline comparison accuracy

Performance relative to a minimal baseline prompt

Use cases

Use accuracy metrics to inform implementation decisions, not to replace judgment.
RecommendedNot recommended
Estimate expected agreement with expert judgmentFully automate high-stakes decisions
Set appropriate thresholds for downstream logicAssume consistent performance outside the validated scope
Replace domain expertise where interpretation is critical