> ## Documentation Index
> Fetch the complete documentation index at: https://docs.learningcommons.org/llms.txt
> Use this file to discover all available pages before exploring further.

# Core concepts

> Learn the fundamental concepts of evaluators including what they are, how they work, and key terminology for using evaluators effectively.

## How evaluators are designed

| Concept                                                                                                                       | Definition                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| :---------------------------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <h3 id="evaluator" style={{ margin: 0, fontSize: "inherit !important" }}>[Evaluator](#evaluator)</h3>                         | Evaluates AI-generated content by focusing on one [dimension](#dimension) of a [rubric](#rubric)<br /><Info>**Example:** The [Vocabulary evaluator](/evaluators/literacy-evaluators/vocabulary) focuses on one dimension (vocabulary) of the Qualitative Text Complexity (rubric).</Info>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| <h3 id="dimension" style={{ margin: 0, fontSize: "inherit !important" }}>[Dimension](#dimension)</h3>                         | Describes one specific attribute or criteria of AI-generated content<br /><Info>**Example:** Evaluators can look at Vocabulary, Sentence Structure, and other dimensions of the text when evaluating its complexity.</Info>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
| <h3 id="rubric" style={{ margin: 0, fontSize: "inherit !important" }}>[Rubric](#rubric)</h3>                                  | Provides a framework for evaluating a concept (e.g., Qualitative Text Complexity) through multiple [dimensions](#dimension)<br /><br /><Frame><img src="https://mintcdn.com/czi-60a2a443/UtkD6p7UkcR_E40q/images/evaluators/rubrics-and-evaluators_rubric-example.svg?fit=max&auto=format&n=UtkD6p7UkcR_E40q&q=85&s=e4cc1dfed7bae54c53a959ffe98d1460" alt="Illustration showing a sample rubric structure used by evaluators." width="2276" height="801" data-path="images/evaluators/rubrics-and-evaluators_rubric-example.svg" /></Frame><br /><Info>**Example:** The Qualitative Text Complexity rubric looks at dimensions like Vocabulary and Sentence Structure to make a judgment on a text's complexity.</Info> |
| <h3 id="evaluators-family" style={{ margin: 0, fontSize: "inherit !important" }}>[Evaluators family](#evaluators-family)</h3> | Collection of evaluators that evaluate [dimensions](#dimension) of one [rubric](#rubric)<Info>**Example:** The [Literacy evaluators family](/evaluators/literacy-evaluators/introduction) contains multiple evaluators that each assess one dimension of the Qualitative Text Complexity rubric.</Info>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |

## How outputs are validated

| Concept                                                                                                                       | Definition                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| :---------------------------------------------------------------------------------------------------------------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <h3 id="evaluator-dataset" style={{ margin: 0, fontSize: "inherit !important" }}>[Evaluator dataset](#evaluator-dataset)</h3> | [Dataset](/evaluators/dataset/introduction) rigorously annotated by human domain experts according to a [rubric](#rubric)<br /><br />These annotated datasets train and inform our "LLM as a judge" assessments, making our evaluators more accurate with every iteration.<br /><Info>**Example:** We partnered with literacy experts to create the [Literacy dataset](/evaluators/dataset/literacy-dataset), which we also used to develop our [Literacy evaluators](/evaluators/literacy-evaluators/introduction).</Info> |
| <h3 id="llm-as-a-judge" style={{ margin: 0, fontSize: "inherit !important" }}>[LLM as a judge](#llm-as-a-judge)</h3>          | Evaluation method that automates human-like reasoning at scale<br /><br /><Note>Learning Commons uses this evaluation method to develop our evaluators. We prompt LLMs to use our expert-annotated [evaluator datasets](#evaluator-dataset) to validate our evaluators' outputs.</Note>                                                                                                                                                                                                                                     |
| <h3 id="baseline-prompt" style={{ margin: 0, fontSize: "inherit !important" }}>[Baseline prompt](#baseline-prompt)</h3>       | Simulates what an edtech developer might construct quickly without expert-annotated datasets, structured rubric alignment, or prompt optimization<br /><br /><Note>Learning Commons uses baseline prompts measure how much our expert-annotated [evaluator datasets](/evaluators/dataset/introduction) improve our evaluators' performance.</Note>                                                                                                                                                                          |
| <h3 id="accuracy" style={{ margin: 0, fontSize: "inherit !important" }}>[Accuracy](#accuracy)</h3>                            | Degree to which an evaluator aligns with our [evaluator datasets](/evaluators/dataset/introduction), or how often an evaluator agrees with a human expert<br /><br /><Note>Learn more about [How accuracy is measured](#how-accuracy-is-measured) and our recommended [use cases](#use-cases).</Note>                                                                                                                                                                                                                       |

## How accuracy is measured

An evaluator's accuracy is calculated on every evaluation run per input (**single-run accuracy**).

Because LLM outputs are probabilistic and inherently non-deterministic, accuracy results may vary across evaluation runs (even with the same inputs).

<Tip>
  To reduce your evaluators' variability and increase their reliability in production:

  1. Run the evaluator multiple times on the same input
  2. Treat each output as a vote
  3. Select the majority result (for example, 3 runs with majority selection)
</Tip>

### Metrics

<Note>Not all evaluators will return all accuracy metrics.</Note>

| Accuracy metric                                                                                                                                                | Definition                                                                                                                                                                                                                                                 |
| :------------------------------------------------------------------------------------------------------------------------------------------------------------- | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| <h3 id="overall-accuracy" style={{ margin: 0, fontSize: "inherit !important" }}>[Overall accuracy](#overall-accuracy)</h3>                                     | Percentage of predictions that exactly match the expert-annotated [evaluator dataset](#evaluator-dataset) <Note>Some disagreements between our evaluators and human experts are expected, especially in qualititative domains and borderline cases.</Note> |
| <h3 id="expert-agreement-rate" style={{ margin: 0, fontSize: "inherit !important" }}>[Expert agreement rate](#expert-agreement-rate)</h3>                      | Percentage of reviewed cases where experts agree with evaluator output                                                                                                                                                                                     |
| <h3 id="reasoning-quality-score" style={{ margin: 0, fontSize: "inherit !important" }}>[Reasoning quality score](#reasoning-quality-score)</h3>                | Expert rating of explanation quality (often on a numeric scale)                                                                                                                                                                                            |
| <h3 id="baseline-comparison-accuracy" style={{ margin: 0, fontSize: "inherit !important" }}>[Baseline comparison accuracy](#baseline-comparison-accuracy)</h3> | Performance relative to a minimal [baseline prompt](#baseline-prompt)                                                                                                                                                                                      |

### Use cases

<Warning>
  Use accuracy metrics to inform implementation decisions, not to replace
  judgment.
</Warning>

| Recommended                                      | Not recommended                                           |
| :----------------------------------------------- | :-------------------------------------------------------- |
| Estimate expected agreement with expert judgment | Fully automate high-stakes decisions                      |
| Set appropriate thresholds for downstream logic  | Assume consistent performance outside the validated scope |
|                                                  | Replace domain expertise where interpretation is critical |
