Skip to main content
Evaluator last updated June 24, 2026. The Appropriateness feedback evaluator assesses whether a piece of feedback correctly identifies whether a student’s response has met the task goal and thus does not require further revision.

At a glance

Input typeTeacher feedback on a student writing response
Supported grades8–9
Considers
  • Accurate task assessment (whether the response meets the task’s relevant requirements)
  • Revision-need identification (distinguishing between feedback that should direct revision and feedback that should signal “move on”)
  • Appropriate signal when the task is complete (signaling the student can move on when the goal is met)

Model and prompt

See Quickstart to run the evaluator.
Model usedGPT-5.4
Temperature1.0
OptimizationDSPy + GEPA (Genetic-Pareto) prompt optimization against expert-annotated labels
PromptsAvailable via the Learning Commons Portal
GPT-5.4 was selected as the shipped model for its best mean performance across the Feedback Quality Evaluator Suite. Other models and configurations will produce different results and may have lower accuracy.

Inputs

Inputs must be de-identified. Do not submit student PII or any regulated or sensitive personal information.
RequirementSupportedRequired
Student textStudent’s constructed-response writingYes
Feedback textTeacher or AI-generated feedback to be evaluatedYes
Example input
{
  "student_text": "Some people think AI-powered pets are a good alternative to real pets because they could help around the house etc.",
  "feedback_text": "You're right, the AI pets could help around the house. Can you find some other details from the article that you could add to make your claim stronger?"
}

Output

FieldDescription
Appropriate feedback scoreBinary judgment (1 / 0). A 1 applies when the feedback’s stance matches reality — it directs revision when the response doesn’t meet the goal, or signals “move on” when the response does. A 0 signals feedback that misses needed revision (telling the student they’re done when they aren’t) or asks for revision when the response already meets the task goal.
ReasoningHigh-level summary of why the feedback received this judgment
Key featuresOne entry per factor driving the judgment (accurate task assessment, revision-need identification, appropriate signal when the task is complete), each marked met (1) / not met (0) with a justification
Proposed adjustmentSuggested moves to align the feedback’s stance with the task goal, for developers iterating on prompts or for teachers using the decision instructionally
Example output
{
  "reasoning": "The relevant task goal is to produce a claim with at least one relevant supporting reason or piece of evidence. The student does make a clear claim: AI-powered pets are a good alternative to real pets. The support given is that they could help around the house, which is a relevant reason, but it is very vague and weakened by \"etc.,\" so it does not function as sufficiently specific support on its own. That means the response still needs revision to meet the claim-and-support goal. The teacher feedback correctly recognizes the basic idea as relevant, but then clearly asks the student to add other details from the article to strengthen the claim. That is a clear revision signal and matches the fact that the student needs to revise rather than move on.",
  "key_features": {
    "accurate_task_assessment": {
      "met": 1,
      "justification": "The feedback treats the response as needing improvement, which matches the student text because the support is too vague (\"help around the house etc.\") to fully meet the claim-and-evidence goal."
    },
    "revision_need_identification": {
      "met": 1,
      "justification": "The teacher clearly signals revision is needed by asking the student to add more details from the article, which is the correct next step for this incomplete response."
    },
    "appropriate_signal_when_task_complete": {
      "met": 0,
      "justification": "This feature is not applicable because the student response does not yet meet the relevant task requirements, so a move-on signal would not be appropriate."
    }
  },
  "proposed_adjustment": "The feedback already meets the criterion. It could be even stronger by naming the problem more directly, for example: \"Your claim is clear, but your reason is too vague right now. Add one or two specific details from the article to explain how AI pets help around the house.\"",
  "appropriate_feedback_score": 1
}

Interpreting results

OutputWhat it meansHow to use it
Score + ReasoningWhether the feedback’s revise / move-on decision matches the student’s actual need, with a synopsis of why it passed or failedEvaluate whether your generated feedback is solving the correct problem. Validate that your prompts respond to actual student need rather than defaulting to a fixed stance (always affirm, or always find something to fix)
Key features + Proposed adjustmentWhether a 0 stems from missing needed revision or from asking for unnecessary revision, plus concrete moves to fix itPinpoint and correct the specific failure mode. Require the model to assess whether the response meets the task goal before deciding to direct revision; target the specific failure direction

Accuracy and validation

This evaluator is provided as Early access. Reported metrics come from a small held-out test split (20 examples) with wide confidence intervals and should be read as directional. Validation testing is ongoing.
We assessed performance against an expert-annotated dataset of Quill.org student-response and teacher-feedback pairs, labeled by Leanlab Education using the Productive Coaching rubric.
MetricDescriptionResult
AccuracyHow accurately the evaluator’s appropriate-feedback score matches expert annotations on the held-out test split.70% (naive baseline: 50%)
Macro-F1Macro-F1 averaged over both classes on the held-out test split.70% (naive baseline: 45%)
Model selection signalMean optimized performance of GPT-5.4 across the 5 feedback dimensions, which drove model selection.GPT-5.4 had the best mean optimized macro-F1 (≈0.81) and accuracy (≈0.82) across the five feedback dimensions
Dataset sourceExpert-annotated Quill.org student-response and teacher-feedback pairs used for validation.Quill.org classroom writing data (52 labeled pairs; 16 train / 16 validation / 20 test), annotated by Leanlab Education
On this dimension, GEPA optimization improved GPT-5.4 substantially over its naive baseline on the held-out test set (+20 points accuracy, +25 points macro-F1) — the largest optimization gain in the suite.

Evaluator release history

DateChanged
June 24, 2026First release