Appropriateness - AI Developer Tools for Education

Evaluator last updated June 24, 2026. The Appropriateness feedback evaluator assesses whether a piece of feedback correctly identifies whether a student’s response has met the task goal and thus does not require further revision.

At a glance


Input type	Teacher feedback on a student writing response
Supported grades	8–9
Considers	Accurate task assessment (whether the response meets the task’s relevant requirements) Revision-need identification (distinguishing between feedback that should direct revision and feedback that should signal “move on”) Appropriate signal when the task is complete (signaling the student can move on when the goal is met)

Model and prompt

See Quickstart to run the evaluator.


Model used	GPT-5.4
Temperature	1.0
Optimization	DSPy + GEPA (Genetic-Pareto) prompt optimization against expert-annotated labels
Prompts	Available via the Learning Commons Portal

GPT-5.4 was selected as the shipped model for its best mean performance across the Feedback Quality Evaluator Suite. Other models and configurations will produce different results and may have lower accuracy.

Inputs

Inputs must be de-identified. Do not submit student PII or any regulated or sensitive personal information.

Requirement	Supported	Required
Student text	Student’s constructed-response writing	Yes
Feedback text	Teacher or AI-generated feedback to be evaluated	Yes

Example input

{
  "student_text": "Some people think AI-powered pets are a good alternative to real pets because they could help around the house etc.",
  "feedback_text": "You're right, the AI pets could help around the house. Can you find some other details from the article that you could add to make your claim stronger?"
}

Output

Field	Description
Appropriate feedback score	Binary judgment (1 / 0). A 1 applies when the feedback’s stance matches reality — it directs revision when the response doesn’t meet the goal, or signals “move on” when the response does. A 0 signals feedback that misses needed revision (telling the student they’re done when they aren’t) or asks for revision when the response already meets the task goal.
Reasoning	High-level summary of why the feedback received this judgment
Key features	One entry per factor driving the judgment (accurate task assessment, revision-need identification, appropriate signal when the task is complete), each marked met (1) / not met (0) with a justification
Proposed adjustment	Suggested moves to align the feedback’s stance with the task goal, for developers iterating on prompts or for teachers using the decision instructionally

Example output

{
  "reasoning": "The relevant task goal is to produce a claim with at least one relevant supporting reason or piece of evidence. The student does make a clear claim: AI-powered pets are a good alternative to real pets. The support given is that they could help around the house, which is a relevant reason, but it is very vague and weakened by \"etc.,\" so it does not function as sufficiently specific support on its own. That means the response still needs revision to meet the claim-and-support goal. The teacher feedback correctly recognizes the basic idea as relevant, but then clearly asks the student to add other details from the article to strengthen the claim. That is a clear revision signal and matches the fact that the student needs to revise rather than move on.",
  "key_features": {
    "accurate_task_assessment": {
      "met": 1,
      "justification": "The feedback treats the response as needing improvement, which matches the student text because the support is too vague (\"help around the house etc.\") to fully meet the claim-and-evidence goal."
    },
    "revision_need_identification": {
      "met": 1,
      "justification": "The teacher clearly signals revision is needed by asking the student to add more details from the article, which is the correct next step for this incomplete response."
    },
    "appropriate_signal_when_task_complete": {
      "met": 0,
      "justification": "This feature is not applicable because the student response does not yet meet the relevant task requirements, so a move-on signal would not be appropriate."
    }
  },
  "proposed_adjustment": "The feedback already meets the criterion. It could be even stronger by naming the problem more directly, for example: \"Your claim is clear, but your reason is too vague right now. Add one or two specific details from the article to explain how AI pets help around the house.\"",
  "appropriate_feedback_score": 1
}

Interpreting results

Output	What it means	How to use it
Score + Reasoning	Whether the feedback’s revise / move-on decision matches the student’s actual need, with a synopsis of why it passed or failed	Evaluate whether your generated feedback is solving the correct problem. Validate that your prompts respond to actual student need rather than defaulting to a fixed stance (always affirm, or always find something to fix)
Key features + Proposed adjustment	Whether a 0 stems from missing needed revision or from asking for unnecessary revision, plus concrete moves to fix it	Pinpoint and correct the specific failure mode. Require the model to assess whether the response meets the task goal before deciding to direct revision; target the specific failure direction

Accuracy and validation

This evaluator is provided as Early access. Reported metrics come from a small held-out test split (20 examples) with wide confidence intervals and should be read as directional. Validation testing is ongoing.

We assessed performance against an expert-annotated dataset of Quill.org student-response and teacher-feedback pairs, labeled by Leanlab Education using the Productive Coaching rubric.

Metric	Description	Result
Accuracy	How accurately the evaluator’s appropriate-feedback score matches expert annotations on the held-out test split.	70% (naive baseline: 50%)
Macro-F1	Macro-F1 averaged over both classes on the held-out test split.	70% (naive baseline: 45%)
Model selection signal	Mean optimized performance of GPT-5.4 across the 5 feedback dimensions, which drove model selection.	GPT-5.4 had the best mean optimized macro-F1 (≈0.81) and accuracy (≈0.82) across the five feedback dimensions
Dataset source	Expert-annotated Quill.org student-response and teacher-feedback pairs used for validation.	Quill.org classroom writing data (52 labeled pairs; 16 train / 16 validation / 20 test), annotated by Leanlab Education

On this dimension, GEPA optimization improved GPT-5.4 substantially over its naive baseline on the held-out test set (+20 points accuracy, +25 points macro-F1) — the largest optimization gain in the suite.

Evaluator release history

Date	Changed
June 24, 2026	First release

​At a glance

​Model and prompt

​Inputs

​Output

​Interpreting results

​Accuracy and validation

​Evaluator release history

At a glance

Model and prompt

Inputs

Output

Interpreting results

Accuracy and validation

Evaluator release history