Reference documentation for the Manageability feedback evaluator.
Evaluator last updated June 24, 2026.The Manageability feedback evaluator assesses whether a piece of feedback is focused enough that a student can realistically process and use it.Manageable feedback is concise, clear, and prioritizes no more than 2 issues, so that a student reading it knows immediately what to do first.
DSPy + GEPA (Genetic-Pareto) prompt optimization against expert-annotated labels
Prompts
Available via the Learning Commons Portal
GPT-5.4 was selected as the shipped model for its best mean performance across
the Feedback Quality Evaluator Suite. Other models and configurations will
produce different results and may have lower accuracy.
Inputs must be de-identified. Do not submit student PII or any regulated or
sensitive personal information.
Requirement
Supported
Required
Student text
Student’s constructed-response writing
Yes
Feedback text
Teacher or AI-generated feedback to be evaluated
Yes
Example input
{ "student_text": "Some people think AI-powered pets are a good alternative to real pets because they could help around the house etc.", "feedback_text": "You're right, the AI pets could help around the house. Can you find some other details from the article that you could add to make your claim stronger?"}
Binary judgment (1 / 0). A 1 applies when feedback is focused enough that a student could realistically process and act on it in a single revision attempt — generally four sentences or fewer, addressing no more than two clearly prioritized issues — or when the response required no revision and the feedback stays appropriately brief. A 0 signals feedback that is too dense, too long, or raises too many issues at once, or feedback so thin it omits a small fix that would have completed the response.
Reasoning
High-level summary of why the feedback received this judgment
Key features
One entry per factor driving the judgment (length, number of distinct issues, clear priority, whether the student knows the next step), each marked met (1) / not met (0) with a justification
Proposed adjustment
Suggested moves to bring the feedback into a manageable range, for developers iterating on prompts or for teachers using the feedback instructionally
Example output
{ "reasoning": "The student response gives a very minimal claim: AI-powered pets are a good alternative because they could help around the house, but it ends with \"etc.\" and does not include enough supporting evidence. The teacher feedback responds to that main gap by affirming the existing idea and asking for additional details from the article to strengthen the claim. For manageability, the feedback is short and focused on one main revision move: add more supporting details. It does not introduce multiple separate problems or a long list of changes. The priority is fairly clear because the only issue raised is strengthening the claim with evidence from the article. A student would likely know the next step is to go back to the article and add another supporting detail or details. Therefore, this feedback is manageable in amount and scope.", "key_features": { "length": { "met": 1, "justification": "The feedback is only two sentences, so it is comfortably within the usual manageable length guideline." }, "number_of_distinct_issues": { "met": 1, "justification": "The feedback raises one distinct issue: the student needs to add more details from the article to support the claim." }, "clear_priority": { "met": 1, "justification": "Because only one issue is raised, the priority is clear: strengthen the claim by adding textual details." }, "student_knows_next_step": { "met": 1, "justification": "The student is directly asked to find other details from the article to make the claim stronger, so the immediate next step is apparent." } }, "proposed_adjustment": "No adjustment needed; the feedback already appears manageable.", "manageable_score": 1}
Whether the feedback is focused enough for the student to process and act on, with a synopsis of why it passed or failed
Evaluate whether your generated feedback sets students up to revise. Validate that your prompts produce focused feedback; aggregate reasoning across runs to detect drift toward longer, more exhaustive feedback
Key features + Proposed adjustment
Whether a 0 stems from too much feedback or too little, plus concrete moves to fix it
Pinpoint and correct the specific failure mode. Require the model to prioritize the one or two highest-impact issues and cap the total it raises; target the specific failure direction
This evaluator is provided as Early access. Reported metrics come from a small
held-out test split (15 examples) with wide confidence intervals and should be
read as directional. Validation testing is ongoing.On this dimension, GEPA optimization improved GPT-5.4 markedly over its naive
baseline on the held-out test set (+26 points accuracy, +52 points macro-F1).
We assessed performance against an expert-annotated dataset of Quill.org student-response and teacher-feedback pairs, labeled by Leanlab Education using the Productive Coaching rubric.
Metric
Result
93% (naive baseline: 67%)
92% (naive baseline: 40%)
GPT-5.4 had the best mean optimized macro-F1 (≈0.81) and accuracy (≈0.82) across the five feedback dimensions
Dataset source
Quill.org classroom writing data (39 labeled pairs; 12 train / 12 validation / 15 test), annotated by Leanlab Education