Accuracy

Evaluator accuracy tells you how often an evaluator agrees with expert human judgment on benchmark data. They aren’t a guarantee of correctness for every input. Accuracy values are based on a single run per input. Because evaluators rely on large language models (LLMs), results can vary slightly between runs.

What accuracy measures

In practical terms, accuracy indicates how often the evaluator produces the same result as a qualified expert would. In qualitative domains, some disagreement is expected, especially for borderline cases.

How we determine accuracy

Accuracy is calculated using a structured process:

Define the rubric.
Create and annotate a benchmark dataset.
Domain experts and educators label representative data according to the rubric.
Build the evaluator based on a development dataset.
Run the evaluator on a separate test dataset annotated by experts to determine accuracy.

Baseline prompt comparison

Some evaluators are also compared to a baseline prompt. A baseline prompt simulates what an edtech developer might construct quickly without:

Expert annotation
Structured rubric alignment
Prompt optimization

We compare baseline performance to the final evaluator to measure the performance improvement provided by expert-driven development.

Types of accuracy metrics

You may see the following measurements:

Measurement	Definition
Overall accuracy	Percentage of predictions that exactly match expert-annotated labels.
Expert agreement rate	Percentage of reviewed cases where experts agree with evaluator output.
Reasoning quality score	Expert rating of explanation quality (often on a numeric scale).
Baseline comparison accuracy	Performance relative to a minimal benchmark prompt.

Not all evaluators report all metrics.

Single-run accuracy

Accuracy values reflect one execution per input. Because LLM outputs are probabilistic, identical inputs may produce different results across runs. To increase reliability in production:

Run the evaluator multiple times on the same input.
Treat each output as a vote.
Select the majority result (for example, 3 runs with majority selection).

This approach reduces variability and increases stability.

How to use the accuracy metrics in your work

Use accuracy to:

Estimate expected agreement with expert judgment.
Set appropriate thresholds for downstream logic.

Do not use accuracy alone to:

Fully automate high-stakes decisions.
Assume consistent performance outside the validated scope.
Replace domain expertise where interpretation is critical.

Accuracy is a controlled measurement of alignment. It should inform implementation decisions, not replace judgment.

Understanding evaluators

Getting started

Using evaluators

Literacy evaluators

Datasets

Resources

What accuracy measures

How we determine accuracy

Baseline prompt comparison

Types of accuracy metrics

Single-run accuracy

How to use the accuracy metrics in your work

Understanding evaluators

Getting started

Using evaluators

Literacy evaluators

Datasets

Resources

​What accuracy measures

​How we determine accuracy

​Baseline prompt comparison

​Types of accuracy metrics

​Single-run accuracy

​How to use the accuracy metrics in your work

What accuracy measures

How we determine accuracy

Baseline prompt comparison

Types of accuracy metrics

Single-run accuracy

How to use the accuracy metrics in your work