Evaluator last updated June 24, 2026.
The Student Response Anchor feedback evaluator assesses whether a piece of feedback is clearly based on the student’s specific response — i.e., referencing, building on, or responding to the student’s own idea, wording, or use of evidence.
In short, it assesses whether the feedback demonstrates an accurate understanding of what the student wrote.
At a glance
| |
|---|
| Input type | Teacher feedback on a student writing response |
| Supported grades | 8–9 |
| Considers | - Specific reference to the student’s work (the feedback names or builds on the student’s particular idea, wording, or evidence)
- Whether the feedback is generic or template-based (a stock phrase that could apply to any response)
- Whether feedback is based on an accurate understanding and interpretation of the student’s response
- Whether the feedback addresses something the student did not actually say or attempt
|
Model and prompt
See Quickstart to run the evaluator.
| |
|---|
| Model used | GPT-5.4 |
| Temperature | 1.0 |
| Optimization | DSPy + GEPA (Genetic-Pareto) prompt optimization against expert-annotated labels |
| Prompts | Available via the Learning Commons Portal |
GPT-5.4 was selected as the shipped model for its best mean performance across
the Feedback Quality Evaluator Suite. Other models and configurations will
produce different results and may have lower accuracy.
Inputs must be de-identified. Do not submit student PII or any regulated or
sensitive personal information.
| Requirement | Supported | Required |
|---|
| Student text | Student’s constructed-response writing | Yes |
| Feedback text | Teacher or AI-generated feedback to be evaluated | Yes |
{
"student_text": "Some people think AI-powered pets are a good alternative to real pets because they could help around the house etc.",
"feedback_text": "You're right, the AI pets could help around the house. Can you find some other details from the article that you could add to make your claim stronger?"
}
Output
| Field | Description |
|---|
| Anchored in student response score | Binary judgment (1 / 0). A 1 applies when feedback references, builds on, or responds to the student’s own idea, wording, or use of evidence and demonstrates accurate understanding of what they wrote. A 0 signals feedback that is generic or template-like, built on a misinterpretation, or addressed to something the student did not say or attempt. |
| Reasoning | High-level summary of why the feedback received this judgment |
| Key features | One entry per factor driving the judgment (specific reference to the student’s work, not generic or template-based, engagement based on accurate understanding), each marked met (1) / not met (0) with a justification |
| Proposed adjustment | Suggested moves to anchor the feedback more clearly in the student’s response, for developers iterating on prompts or for teachers using the feedback instructionally |
{
"reasoning": "The student's specific claim is that AI-powered pets are a good alternative because \"they could help around the house.\" The teacher feedback directly echoes that exact idea by saying \"the AI pets could help around the house,\" which shows a clear connection to the student's actual wording and reasoning. The follow-up suggestion to add other details from the article is tied to that existing claim rather than being completely generic. The feedback also accurately understands the student's point and does not misread or shift it to a different argument. Therefore, the feedback is sufficiently anchored in the student response.",
"key_features": {
"specific_reference_to_student_work": {
"met": 1,
"justification": "The feedback explicitly repeats the student's reason that AI pets \"could help around the house,\" directly referencing the student's actual claim."
},
"not_generic_or_template_based": {
"met": 1,
"justification": "Although it includes a common suggestion to add details, it is not purely generic because it first identifies the student's specific idea about helping around the house."
},
"engage_based_on_accurate_understanding": {
"met": 1,
"justification": "The feedback accurately understands and responds to the student's stated reason rather than introducing a different idea."
}
},
"proposed_adjustment": "No adjustment needed; the feedback already meets the criterion. It could be even stronger by asking for a specific example of how the pets help around the house.",
"anchored_in_student_response_score": 1
}
Interpreting results
| Output | What it means | How to use it |
|---|
| Score + Reasoning | Whether the feedback is genuinely based on this student’s response, with a synopsis of why it passed or failed the anchoring threshold | Evaluate whether your generated feedback is responding to this student’s actual work or to a generic version of the task. Validate that your prompts produce responsive feedback; aggregate reasoning across runs to detect drift toward safe, generic phrasing |
| Key features + Proposed adjustment | Whether a 0 stems from genericness or from a misreading, plus concrete moves to fix it | Pinpoint and correct the specific failure mode. Require the model to quote or restate a specific element of the student’s response before offering guidance; target the specific failure mode |
Accuracy and validation
This evaluator is provided as Early access. Reported metrics come from a small
held-out test split (24 examples) with wide confidence intervals and should be
read as directional. Validation testing is ongoing.
We assessed performance against an expert-annotated dataset of Quill.org student-response and teacher-feedback pairs, labeled by Leanlab Education using the Productive Coaching rubric.
| Metric | Description | Result |
|---|
| Accuracy | How accurately the evaluator’s anchored-in-student-response score matches expert annotations on the held-out test split. | 79% (naive baseline: 75%) |
| Macro-F1 | Macro-F1 averaged over both classes on the held-out test split. | 79% (naive baseline: 73%) |
| Model selection signal | Mean optimized performance of GPT-5.4 across the 5 feedback dimensions, which drove model selection. | GPT-5.4 had the best mean optimized macro-F1 (≈0.81) and accuracy (≈0.82) across the five feedback dimensions |
| Dataset source | Expert-annotated Quill.org student-response and teacher-feedback pairs used for validation. | Quill.org classroom writing data (64 labeled pairs; 20 train / 20 validation / 24 test), annotated by Leanlab Education |
On this dimension, GEPA optimization improved GPT-5.4 over its naive baseline
on the held-out test set (+4 points accuracy, +6 points macro-F1).
Evaluator release history
| Date | Changed |
|---|
| June 24, 2026 | First release |