Student Response Anchor - AI Developer Tools for Education

Evaluator last updated June 24, 2026. The Student Response Anchor feedback evaluator assesses whether a piece of feedback is clearly based on the student’s specific response — i.e., referencing, building on, or responding to the student’s own idea, wording, or use of evidence. In short, it assesses whether the feedback demonstrates an accurate understanding of what the student wrote.

At a glance


Input type	Teacher feedback on a student writing response
Supported grades	8–9
Considers	Specific reference to the student’s work (the feedback names or builds on the student’s particular idea, wording, or evidence) Whether the feedback is generic or template-based (a stock phrase that could apply to any response) Whether feedback is based on an accurate understanding and interpretation of the student’s response Whether the feedback addresses something the student did not actually say or attempt

Model and prompt

See Quickstart to run the evaluator.


Model used	GPT-5.4
Temperature	1.0
Optimization	DSPy + GEPA (Genetic-Pareto) prompt optimization against expert-annotated labels
Prompts	Available via the Learning Commons Portal

GPT-5.4 was selected as the shipped model for its best mean performance across the Feedback Quality Evaluator Suite. Other models and configurations will produce different results and may have lower accuracy.

Inputs

Inputs must be de-identified. Do not submit student PII or any regulated or sensitive personal information.

Requirement	Supported	Required
Student text	Student’s constructed-response writing	Yes
Feedback text	Teacher or AI-generated feedback to be evaluated	Yes

Example input

{
  "student_text": "Some people think AI-powered pets are a good alternative to real pets because they could help around the house etc.",
  "feedback_text": "You're right, the AI pets could help around the house. Can you find some other details from the article that you could add to make your claim stronger?"
}

Output

Field	Description
Anchored in student response score	Binary judgment (1 / 0). A 1 applies when feedback references, builds on, or responds to the student’s own idea, wording, or use of evidence and demonstrates accurate understanding of what they wrote. A 0 signals feedback that is generic or template-like, built on a misinterpretation, or addressed to something the student did not say or attempt.
Reasoning	High-level summary of why the feedback received this judgment
Key features	One entry per factor driving the judgment (specific reference to the student’s work, not generic or template-based, engagement based on accurate understanding), each marked met (1) / not met (0) with a justification
Proposed adjustment	Suggested moves to anchor the feedback more clearly in the student’s response, for developers iterating on prompts or for teachers using the feedback instructionally

Example output

{
  "reasoning": "The student's specific claim is that AI-powered pets are a good alternative because \"they could help around the house.\" The teacher feedback directly echoes that exact idea by saying \"the AI pets could help around the house,\" which shows a clear connection to the student's actual wording and reasoning. The follow-up suggestion to add other details from the article is tied to that existing claim rather than being completely generic. The feedback also accurately understands the student's point and does not misread or shift it to a different argument. Therefore, the feedback is sufficiently anchored in the student response.",
  "key_features": {
    "specific_reference_to_student_work": {
      "met": 1,
      "justification": "The feedback explicitly repeats the student's reason that AI pets \"could help around the house,\" directly referencing the student's actual claim."
    },
    "not_generic_or_template_based": {
      "met": 1,
      "justification": "Although it includes a common suggestion to add details, it is not purely generic because it first identifies the student's specific idea about helping around the house."
    },
    "engage_based_on_accurate_understanding": {
      "met": 1,
      "justification": "The feedback accurately understands and responds to the student's stated reason rather than introducing a different idea."
    }
  },
  "proposed_adjustment": "No adjustment needed; the feedback already meets the criterion. It could be even stronger by asking for a specific example of how the pets help around the house.",
  "anchored_in_student_response_score": 1
}

Interpreting results

Output	What it means	How to use it
Score + Reasoning	Whether the feedback is genuinely based on this student’s response, with a synopsis of why it passed or failed the anchoring threshold	Evaluate whether your generated feedback is responding to this student’s actual work or to a generic version of the task. Validate that your prompts produce responsive feedback; aggregate reasoning across runs to detect drift toward safe, generic phrasing
Key features + Proposed adjustment	Whether a 0 stems from genericness or from a misreading, plus concrete moves to fix it	Pinpoint and correct the specific failure mode. Require the model to quote or restate a specific element of the student’s response before offering guidance; target the specific failure mode

Accuracy and validation

This evaluator is provided as Early access. Reported metrics come from a small held-out test split (24 examples) with wide confidence intervals and should be read as directional. Validation testing is ongoing.

We assessed performance against an expert-annotated dataset of Quill.org student-response and teacher-feedback pairs, labeled by Leanlab Education using the Productive Coaching rubric.

Metric	Description	Result
Accuracy	How accurately the evaluator’s anchored-in-student-response score matches expert annotations on the held-out test split.	79% (naive baseline: 75%)
Macro-F1	Macro-F1 averaged over both classes on the held-out test split.	79% (naive baseline: 73%)
Model selection signal	Mean optimized performance of GPT-5.4 across the 5 feedback dimensions, which drove model selection.	GPT-5.4 had the best mean optimized macro-F1 (≈0.81) and accuracy (≈0.82) across the five feedback dimensions
Dataset source	Expert-annotated Quill.org student-response and teacher-feedback pairs used for validation.	Quill.org classroom writing data (64 labeled pairs; 20 train / 20 validation / 24 test), annotated by Leanlab Education

On this dimension, GEPA optimization improved GPT-5.4 over its naive baseline on the held-out test set (+4 points accuracy, +6 points macro-F1).

Evaluator release history

Date	Changed
June 24, 2026	First release

​At a glance

​Model and prompt

​Inputs

​Output

​Interpreting results

​Accuracy and validation

​Evaluator release history

At a glance

Model and prompt

Inputs

Output

Interpreting results

Accuracy and validation

Evaluator release history