> ## Documentation Index
> Fetch the complete documentation index at: https://docs.learningcommons.org/llms.txt
> Use this file to discover all available pages before exploring further.

# Accuracy

> Understand how evaluator accuracy is defined, measured, and used as a decision-support tool.

Evaluator accuracy tells you how often an evaluator agrees with expert human judgment on benchmark data. They aren't a guarantee of correctness for every input.

Accuracy values are based on **a single run per input**. Because evaluators rely on large language models (LLMs), results can vary slightly between runs.

## What accuracy measures

In practical terms, accuracy indicates how often the evaluator produces the same result as a qualified expert would. In qualitative domains, some disagreement is expected, especially for borderline cases.

## How we determine accuracy

Accuracy is calculated using a structured process:

1. [****Define the rubric.****](/evaluators/literacy-evaluators/subject-matter-knowledge/rubric)
2. **Create and annotate a benchmark dataset.**\
   Domain experts and educators label representative data according to the rubric.
3. **Build the evaluator based on a development dataset.**
4. **Run the evaluator on a separate test dataset annotated by experts to determine accuracy.**

## Baseline prompt comparison

Some evaluators are also compared to a **baseline prompt**.

A baseline prompt simulates what an edtech developer might construct quickly without:

* Expert annotation
* Structured rubric alignment
* Prompt optimization

We compare baseline performance to the final evaluator to measure the performance improvement provided by expert-driven development.

## Types of accuracy metrics

You may see the following measurements:

| Measurement                  | Definition                                                              |
| :--------------------------- | :---------------------------------------------------------------------- |
| Overall accuracy             | Percentage of predictions that exactly match expert-annotated labels.   |
| Expert agreement rate        | Percentage of reviewed cases where experts agree with evaluator output. |
| Reasoning quality score      | Expert rating of explanation quality (often on a numeric scale).        |
| Baseline comparison accuracy | Performance relative to a minimal benchmark prompt.                     |

Not all evaluators report all metrics.

## Single-run accuracy

Accuracy values reflect **one execution per input**. Because LLM outputs are probabilistic, identical inputs may produce different results across runs.

To increase reliability in production:

* Run the evaluator multiple times on the same input.
* Treat each output as a vote.
* Select the majority result (for example, 3 runs with majority selection).

This approach reduces variability and increases stability.

## How to use the accuracy metrics in your work

Use accuracy to:

* Estimate expected agreement with expert judgment.
* Set appropriate thresholds for downstream logic.

Do not use accuracy alone to:

* Fully automate high-stakes decisions.
* Assume consistent performance outside the validated scope.
* Replace domain expertise where interpretation is critical.

Accuracy is a controlled measurement of alignment. It should inform implementation decisions, not replace judgment.
