Evaluator accuracy measures the degree to which an evaluator’s score aligns with curated or human-annotated validation datasets. They aren’t a guarantee of correctness for every input.Documentation Index
Fetch the complete documentation index at: https://docs.learningcommons.org/llms.txt
Use this file to discover all available pages before exploring further.
Accuracy values are based on a single run per input. Because evaluators
rely on large language models (LLMs), results can vary slightly between runs.
What accuracy measures
In practical terms, accuracy indicates how often the evaluator produces the same result as a qualified expert would. In qualitative domains, some disagreement is expected, especially for borderline cases.How we determine accuracy
Accuracy is calculated using a structured process:- Define the rubric.
- Create and annotate a benchmark dataset.
Domain experts and educators label representative data according to the rubric. - Build the evaluator based on a development dataset.
- Run the evaluator on a separate test dataset annotated by experts to determine accuracy.
Baseline prompt comparison
Some evaluators are also compared to a baseline prompt. A baseline prompt simulates what an edtech developer might construct quickly without:- Expert annotation
- Structured rubric alignment
- Prompt optimization
Types of accuracy metrics
You may see the following measurements:| Measurement | Definition |
|---|---|
| Overall accuracy | Percentage of predictions that exactly match expert-annotated labels. |
| Expert agreement rate | Percentage of reviewed cases where experts agree with evaluator output. |
| Reasoning quality score | Expert rating of explanation quality (often on a numeric scale). |
| Baseline comparison accuracy | Performance relative to a minimal benchmark prompt. |
Single-run accuracy
Accuracy values reflect one execution per input. Because LLM outputs are probabilistic, identical inputs may produce different results across runs. To increase reliability in production:- Run the evaluator multiple times on the same input.
- Treat each output as a vote.
- Select the majority result (for example, 3 runs with majority selection).
How to use the accuracy metrics in your work
Use accuracy to:- Estimate expected agreement with expert judgment.
- Set appropriate thresholds for downstream logic.
- Fully automate high-stakes decisions.
- Assume consistent performance outside the validated scope.
- Replace domain expertise where interpretation is critical.