Skip to main content
Evaluator last updated June 24, 2026. The Student Response Anchor feedback evaluator assesses whether a piece of feedback is clearly based on the student’s specific response — i.e., referencing, building on, or responding to the student’s own idea, wording, or use of evidence. In short, it assesses whether the feedback demonstrates an accurate understanding of what the student wrote.

At a glance

Input typeTeacher feedback on a student writing response
Supported grades8–9
Considers
  • Specific reference to the student’s work (the feedback names or builds on the student’s particular idea, wording, or evidence)
  • Whether the feedback is generic or template-based (a stock phrase that could apply to any response)
  • Whether feedback is based on an accurate understanding and interpretation of the student’s response
  • Whether the feedback addresses something the student did not actually say or attempt

Model and prompt

See Quickstart to run the evaluator.
Model usedGPT-5.4
Temperature1.0
OptimizationDSPy + GEPA (Genetic-Pareto) prompt optimization against expert-annotated labels
PromptsAvailable via the Learning Commons Portal
GPT-5.4 was selected as the shipped model for its best mean performance across the Feedback Quality Evaluator Suite. Other models and configurations will produce different results and may have lower accuracy.

Inputs

Inputs must be de-identified. Do not submit student PII or any regulated or sensitive personal information.
RequirementSupportedRequired
Student textStudent’s constructed-response writingYes
Feedback textTeacher or AI-generated feedback to be evaluatedYes
Example input
{
  "student_text": "Some people think AI-powered pets are a good alternative to real pets because they could help around the house etc.",
  "feedback_text": "You're right, the AI pets could help around the house. Can you find some other details from the article that you could add to make your claim stronger?"
}

Output

FieldDescription
Anchored in student response scoreBinary judgment (1 / 0). A 1 applies when feedback references, builds on, or responds to the student’s own idea, wording, or use of evidence and demonstrates accurate understanding of what they wrote. A 0 signals feedback that is generic or template-like, built on a misinterpretation, or addressed to something the student did not say or attempt.
ReasoningHigh-level summary of why the feedback received this judgment
Key featuresOne entry per factor driving the judgment (specific reference to the student’s work, not generic or template-based, engagement based on accurate understanding), each marked met (1) / not met (0) with a justification
Proposed adjustmentSuggested moves to anchor the feedback more clearly in the student’s response, for developers iterating on prompts or for teachers using the feedback instructionally
Example output
{
  "reasoning": "The student's specific claim is that AI-powered pets are a good alternative because \"they could help around the house.\" The teacher feedback directly echoes that exact idea by saying \"the AI pets could help around the house,\" which shows a clear connection to the student's actual wording and reasoning. The follow-up suggestion to add other details from the article is tied to that existing claim rather than being completely generic. The feedback also accurately understands the student's point and does not misread or shift it to a different argument. Therefore, the feedback is sufficiently anchored in the student response.",
  "key_features": {
    "specific_reference_to_student_work": {
      "met": 1,
      "justification": "The feedback explicitly repeats the student's reason that AI pets \"could help around the house,\" directly referencing the student's actual claim."
    },
    "not_generic_or_template_based": {
      "met": 1,
      "justification": "Although it includes a common suggestion to add details, it is not purely generic because it first identifies the student's specific idea about helping around the house."
    },
    "engage_based_on_accurate_understanding": {
      "met": 1,
      "justification": "The feedback accurately understands and responds to the student's stated reason rather than introducing a different idea."
    }
  },
  "proposed_adjustment": "No adjustment needed; the feedback already meets the criterion. It could be even stronger by asking for a specific example of how the pets help around the house.",
  "anchored_in_student_response_score": 1
}

Interpreting results

OutputWhat it meansHow to use it
Score + ReasoningWhether the feedback is genuinely based on this student’s response, with a synopsis of why it passed or failed the anchoring thresholdEvaluate whether your generated feedback is responding to this student’s actual work or to a generic version of the task. Validate that your prompts produce responsive feedback; aggregate reasoning across runs to detect drift toward safe, generic phrasing
Key features + Proposed adjustmentWhether a 0 stems from genericness or from a misreading, plus concrete moves to fix itPinpoint and correct the specific failure mode. Require the model to quote or restate a specific element of the student’s response before offering guidance; target the specific failure mode

Accuracy and validation

This evaluator is provided as Early access. Reported metrics come from a small held-out test split (24 examples) with wide confidence intervals and should be read as directional. Validation testing is ongoing.
We assessed performance against an expert-annotated dataset of Quill.org student-response and teacher-feedback pairs, labeled by Leanlab Education using the Productive Coaching rubric.
MetricDescriptionResult
AccuracyHow accurately the evaluator’s anchored-in-student-response score matches expert annotations on the held-out test split.79% (naive baseline: 75%)
Macro-F1Macro-F1 averaged over both classes on the held-out test split.79% (naive baseline: 73%)
Model selection signalMean optimized performance of GPT-5.4 across the 5 feedback dimensions, which drove model selection.GPT-5.4 had the best mean optimized macro-F1 (≈0.81) and accuracy (≈0.82) across the five feedback dimensions
Dataset sourceExpert-annotated Quill.org student-response and teacher-feedback pairs used for validation.Quill.org classroom writing data (64 labeled pairs; 20 train / 20 validation / 24 test), annotated by Leanlab Education
On this dimension, GEPA optimization improved GPT-5.4 over its naive baseline on the held-out test set (+4 points accuracy, +6 points macro-F1).

Evaluator release history

DateChanged
June 24, 2026First release