What accuracy measures
Accuracy measures agreement between human annotators (the benchmark) and the evaluator’s outputs on a test dataset that was not used to build or tune the evaluator. In practical terms, accuracy indicates how often the evaluator produces the same result as a qualified expert would. In qualitative domains, some disagreement is expected, especially for borderline cases.How we determine accuracy
Accuracy is calculated using a structured process:- Define the rubric.
- Create and annotate a benchmark dataset.
Domain experts and educators label representative data according to the rubric. - Build the evaluator based on a development dataset.
- Run the evaluator on a separate test dataset annotated by experts to determine accuracy.
Baseline prompt comparison
Some evaluators are also compared to a baseline prompt. A baseline prompt simulates what an edtech developer might construct quickly without:- Expert annotation
- Structured rubric alignment
- Prompt optimization
Types of accuracy metrics
You may see the following measurements:| Measurement | Definition |
|---|---|
| Overall accuracy | Percentage of predictions that exactly match expert-annotated labels. |
| Expert agreement rate | Percentage of reviewed cases where experts agree with evaluator output. |
| Directional accuracy | Percentage of cases where the evaluator identifies the correct general classification direction, even if the exact category differs. |
| Reasoning quality score | Expert rating of explanation quality (often on a numeric scale). |
| Baseline comparison accuracy | Performance relative to a minimal benchmark prompt. |
Single-run accuracy
Accuracy values reflect one execution per input. Because LLM outputs are probabilistic, identical inputs may produce different results across runs. To increase reliability in production:- Run the evaluator multiple times on the same input.
- Treat each output as a vote.
- Select the majority result (for example, 3 runs with majority selection).
How to use the accuracy metrics in your work
Use accuracy to:- Estimate expected agreement with expert judgment.
- Set appropriate thresholds for downstream logic.
- Fully automate high-stakes decisions.
- Assume consistent performance outside the validated scope.
- Replace domain expertise where interpretation is critical.