Skip to main content
Evaluator accuracy tells you how often an evaluator agrees with expert human judgment on benchmark data. Accuracy values are based on a single run per input. Because evaluators rely on large language models (LLMs), results can vary slightly between runs. Accuracy is a measure of alignment with expert judgment under defined conditions. It is not a guarantee of correctness for every input.

What accuracy measures

Accuracy measures agreement between human annotators (the benchmark) and the evaluator’s outputs on a test dataset that was not used to build or tune the evaluator. In practical terms, accuracy indicates how often the evaluator produces the same result as a qualified expert would. In qualitative domains, some disagreement is expected, especially for borderline cases.

How we determine accuracy

Accuracy is calculated using a structured process:
  1. Define the rubric.
  2. Create and annotate a benchmark dataset.
    Domain experts and educators label representative data according to the rubric.
  3. Build the evaluator based on a development dataset.
  4. Run the evaluator on a separate test dataset annotated by experts to determine accuracy.

Baseline prompt comparison

Some evaluators are also compared to a baseline prompt. A baseline prompt simulates what an edtech developer might construct quickly without:
  • Expert annotation
  • Structured rubric alignment
  • Prompt optimization
We compare baseline performance to the final evaluator to measure the performance improvement provided by expert-driven development.

Types of accuracy metrics

You may see the following measurements:
MeasurementDefinition
Overall accuracyPercentage of predictions that exactly match expert-annotated labels.
Expert agreement ratePercentage of reviewed cases where experts agree with evaluator output.
Directional accuracyPercentage of cases where the evaluator identifies the correct general classification direction, even if the exact category differs.
Reasoning quality scoreExpert rating of explanation quality (often on a numeric scale).
Baseline comparison accuracyPerformance relative to a minimal benchmark prompt.
Not all evaluators report all metrics.

Single-run accuracy

Accuracy values reflect one execution per input. Because LLM outputs are probabilistic, identical inputs may produce different results across runs. To increase reliability in production:
  • Run the evaluator multiple times on the same input.
  • Treat each output as a vote.
  • Select the majority result (for example, 3 runs with majority selection).
This approach reduces variability and increases stability.

How to use the accuracy metrics in your work

Use accuracy to:
  • Estimate expected agreement with expert judgment.
  • Set appropriate thresholds for downstream logic.
Do not use accuracy alone to:
  • Fully automate high-stakes decisions.
  • Assume consistent performance outside the validated scope.
  • Replace domain expertise where interpretation is critical.
Accuracy is a controlled measurement of alignment. It should inform implementation decisions, not replace judgment.