Skip to main content
Evaluator last updated June 24, 2026. The Manageability feedback evaluator assesses whether a piece of feedback is focused enough that a student can realistically process and use it. Manageable feedback is concise, clear, and prioritizes no more than 2 issues, so that a student reading it knows immediately what to do first.

At a glance

Input typeTeacher feedback on a student writing response
Supported grades8–9
Considers
  • Generally no more than 4 sentences
  • Generally no more than 2 issues raised, with a clear priority order
  • Whether it includes small fixes that would have led to a complete success
  • Whether the student would know immediately what to do first and could act in one revision attempt

Model and prompt

See Quickstart to run the evaluator.
Model usedGPT-5.4
Temperature1.0
OptimizationDSPy + GEPA (Genetic-Pareto) prompt optimization against expert-annotated labels
PromptsAvailable via the Learning Commons Portal
GPT-5.4 was selected as the shipped model for its best mean performance across the Feedback Quality Evaluator Suite. Other models and configurations will produce different results and may have lower accuracy.

Inputs

Inputs must be de-identified. Do not submit student PII or any regulated or sensitive personal information.
RequirementSupportedRequired
Student textStudent’s constructed-response writingYes
Feedback textTeacher or AI-generated feedback to be evaluatedYes
Example input
{
  "student_text": "Some people think AI-powered pets are a good alternative to real pets because they could help around the house etc.",
  "feedback_text": "You're right, the AI pets could help around the house. Can you find some other details from the article that you could add to make your claim stronger?"
}

Output

FieldDescription
Manageable scoreBinary judgment (1 / 0). A 1 applies when feedback is focused enough that a student could realistically process and act on it in a single revision attempt — generally four sentences or fewer, addressing no more than two clearly prioritized issues — or when the response required no revision and the feedback stays appropriately brief. A 0 signals feedback that is too dense, too long, or raises too many issues at once, or feedback so thin it omits a small fix that would have completed the response.
ReasoningHigh-level summary of why the feedback received this judgment
Key featuresOne entry per factor driving the judgment (length, number of distinct issues, clear priority, whether the student knows the next step), each marked met (1) / not met (0) with a justification
Proposed adjustmentSuggested moves to bring the feedback into a manageable range, for developers iterating on prompts or for teachers using the feedback instructionally
Example output
{
  "reasoning": "The student response gives a very minimal claim: AI-powered pets are a good alternative because they could help around the house, but it ends with \"etc.\" and does not include enough supporting evidence. The teacher feedback responds to that main gap by affirming the existing idea and asking for additional details from the article to strengthen the claim. For manageability, the feedback is short and focused on one main revision move: add more supporting details. It does not introduce multiple separate problems or a long list of changes. The priority is fairly clear because the only issue raised is strengthening the claim with evidence from the article. A student would likely know the next step is to go back to the article and add another supporting detail or details. Therefore, this feedback is manageable in amount and scope.",
  "key_features": {
    "length": {
      "met": 1,
      "justification": "The feedback is only two sentences, so it is comfortably within the usual manageable length guideline."
    },
    "number_of_distinct_issues": {
      "met": 1,
      "justification": "The feedback raises one distinct issue: the student needs to add more details from the article to support the claim."
    },
    "clear_priority": {
      "met": 1,
      "justification": "Because only one issue is raised, the priority is clear: strengthen the claim by adding textual details."
    },
    "student_knows_next_step": {
      "met": 1,
      "justification": "The student is directly asked to find other details from the article to make the claim stronger, so the immediate next step is apparent."
    }
  },
  "proposed_adjustment": "No adjustment needed; the feedback already appears manageable.",
  "manageable_score": 1
}

Interpreting results

OutputWhat it meansHow to use it
Score + ReasoningWhether the feedback is focused enough for the student to process and act on, with a synopsis of why it passed or failedEvaluate whether your generated feedback sets students up to revise. Validate that your prompts produce focused feedback; aggregate reasoning across runs to detect drift toward longer, more exhaustive feedback
Key features + Proposed adjustmentWhether a 0 stems from too much feedback or too little, plus concrete moves to fix itPinpoint and correct the specific failure mode. Require the model to prioritize the one or two highest-impact issues and cap the total it raises; target the specific failure direction

Accuracy and validation

This evaluator is provided as Early access. Reported metrics come from a small held-out test split (15 examples) with wide confidence intervals and should be read as directional. Validation testing is ongoing.On this dimension, GEPA optimization improved GPT-5.4 markedly over its naive baseline on the held-out test set (+26 points accuracy, +52 points macro-F1).
We assessed performance against an expert-annotated dataset of Quill.org student-response and teacher-feedback pairs, labeled by Leanlab Education using the Productive Coaching rubric.
MetricResult
93% (naive baseline: 67%)
92% (naive baseline: 40%)
GPT-5.4 had the best mean optimized macro-F1 (≈0.81) and accuracy (≈0.82) across the five feedback dimensions
Dataset sourceQuill.org classroom writing data (39 labeled pairs; 12 train / 12 validation / 15 test), annotated by Leanlab Education

Evaluator release history

DateChanged
June 24, 2026First release