Evaluator last updated June 24, 2026.
The Appropriateness feedback evaluator assesses whether a piece of feedback correctly identifies whether a student’s response has met the task goal and thus does not require further revision.
At a glance
| |
|---|
| Input type | Teacher feedback on a student writing response |
| Supported grades | 8–9 |
| Considers | - Accurate task assessment (whether the response meets the task’s relevant requirements)
- Revision-need identification (distinguishing between feedback that should direct revision and feedback that should signal “move on”)
- Appropriate signal when the task is complete (signaling the student can move on when the goal is met)
|
Model and prompt
See Quickstart to run the evaluator.
| |
|---|
| Model used | GPT-5.4 |
| Temperature | 1.0 |
| Optimization | DSPy + GEPA (Genetic-Pareto) prompt optimization against expert-annotated labels |
| Prompts | Available via the Learning Commons Portal |
GPT-5.4 was selected as the shipped model for its best mean performance across
the Feedback Quality Evaluator Suite. Other models and configurations will
produce different results and may have lower accuracy.
Inputs must be de-identified. Do not submit student PII or any regulated or
sensitive personal information.
| Requirement | Supported | Required |
|---|
| Student text | Student’s constructed-response writing | Yes |
| Feedback text | Teacher or AI-generated feedback to be evaluated | Yes |
{
"student_text": "Some people think AI-powered pets are a good alternative to real pets because they could help around the house etc.",
"feedback_text": "You're right, the AI pets could help around the house. Can you find some other details from the article that you could add to make your claim stronger?"
}
Output
| Field | Description |
|---|
| Appropriate feedback score | Binary judgment (1 / 0). A 1 applies when the feedback’s stance matches reality — it directs revision when the response doesn’t meet the goal, or signals “move on” when the response does. A 0 signals feedback that misses needed revision (telling the student they’re done when they aren’t) or asks for revision when the response already meets the task goal. |
| Reasoning | High-level summary of why the feedback received this judgment |
| Key features | One entry per factor driving the judgment (accurate task assessment, revision-need identification, appropriate signal when the task is complete), each marked met (1) / not met (0) with a justification |
| Proposed adjustment | Suggested moves to align the feedback’s stance with the task goal, for developers iterating on prompts or for teachers using the decision instructionally |
{
"reasoning": "The relevant task goal is to produce a claim with at least one relevant supporting reason or piece of evidence. The student does make a clear claim: AI-powered pets are a good alternative to real pets. The support given is that they could help around the house, which is a relevant reason, but it is very vague and weakened by \"etc.,\" so it does not function as sufficiently specific support on its own. That means the response still needs revision to meet the claim-and-support goal. The teacher feedback correctly recognizes the basic idea as relevant, but then clearly asks the student to add other details from the article to strengthen the claim. That is a clear revision signal and matches the fact that the student needs to revise rather than move on.",
"key_features": {
"accurate_task_assessment": {
"met": 1,
"justification": "The feedback treats the response as needing improvement, which matches the student text because the support is too vague (\"help around the house etc.\") to fully meet the claim-and-evidence goal."
},
"revision_need_identification": {
"met": 1,
"justification": "The teacher clearly signals revision is needed by asking the student to add more details from the article, which is the correct next step for this incomplete response."
},
"appropriate_signal_when_task_complete": {
"met": 0,
"justification": "This feature is not applicable because the student response does not yet meet the relevant task requirements, so a move-on signal would not be appropriate."
}
},
"proposed_adjustment": "The feedback already meets the criterion. It could be even stronger by naming the problem more directly, for example: \"Your claim is clear, but your reason is too vague right now. Add one or two specific details from the article to explain how AI pets help around the house.\"",
"appropriate_feedback_score": 1
}
Interpreting results
| Output | What it means | How to use it |
|---|
| Score + Reasoning | Whether the feedback’s revise / move-on decision matches the student’s actual need, with a synopsis of why it passed or failed | Evaluate whether your generated feedback is solving the correct problem. Validate that your prompts respond to actual student need rather than defaulting to a fixed stance (always affirm, or always find something to fix) |
| Key features + Proposed adjustment | Whether a 0 stems from missing needed revision or from asking for unnecessary revision, plus concrete moves to fix it | Pinpoint and correct the specific failure mode. Require the model to assess whether the response meets the task goal before deciding to direct revision; target the specific failure direction |
Accuracy and validation
This evaluator is provided as Early access. Reported metrics come from a small
held-out test split (20 examples) with wide confidence intervals and should be
read as directional. Validation testing is ongoing.
We assessed performance against an expert-annotated dataset of Quill.org student-response and teacher-feedback pairs, labeled by Leanlab Education using the Productive Coaching rubric.
| Metric | Description | Result |
|---|
| Accuracy | How accurately the evaluator’s appropriate-feedback score matches expert annotations on the held-out test split. | 70% (naive baseline: 50%) |
| Macro-F1 | Macro-F1 averaged over both classes on the held-out test split. | 70% (naive baseline: 45%) |
| Model selection signal | Mean optimized performance of GPT-5.4 across the 5 feedback dimensions, which drove model selection. | GPT-5.4 had the best mean optimized macro-F1 (≈0.81) and accuracy (≈0.82) across the five feedback dimensions |
| Dataset source | Expert-annotated Quill.org student-response and teacher-feedback pairs used for validation. | Quill.org classroom writing data (52 labeled pairs; 16 train / 16 validation / 20 test), annotated by Leanlab Education |
On this dimension, GEPA optimization improved GPT-5.4 substantially over its
naive baseline on the held-out test set (+20 points accuracy, +25 points
macro-F1) — the largest optimization gain in the suite.
Evaluator release history
| Date | Changed |
|---|
| June 24, 2026 | First release |