Manageability - AI Developer Tools for Education

Evaluator last updated June 24, 2026. The Manageability feedback evaluator assesses whether a piece of feedback is focused enough that a student can realistically process and use it. Manageable feedback is concise, clear, and prioritizes no more than 2 issues, so that a student reading it knows immediately what to do first.

At a glance


Input type	Teacher feedback on a student writing response
Supported grades	8–9
Considers	Generally no more than 4 sentences Generally no more than 2 issues raised, with a clear priority order Whether it includes small fixes that would have led to a complete success Whether the student would know immediately what to do first and could act in one revision attempt

Model and prompt

See Quickstart to run the evaluator.


Model used	GPT-5.4
Temperature	1.0
Optimization	DSPy + GEPA (Genetic-Pareto) prompt optimization against expert-annotated labels
Prompts	Available via the Learning Commons Portal

GPT-5.4 was selected as the shipped model for its best mean performance across the Feedback Quality Evaluator Suite. Other models and configurations will produce different results and may have lower accuracy.

Inputs

Inputs must be de-identified. Do not submit student PII or any regulated or sensitive personal information.

Requirement	Supported	Required
Student text	Student’s constructed-response writing	Yes
Feedback text	Teacher or AI-generated feedback to be evaluated	Yes

Example input

{
  "student_text": "Some people think AI-powered pets are a good alternative to real pets because they could help around the house etc.",
  "feedback_text": "You're right, the AI pets could help around the house. Can you find some other details from the article that you could add to make your claim stronger?"
}

Output

Field	Description
Manageable score	Binary judgment (1 / 0). A 1 applies when feedback is focused enough that a student could realistically process and act on it in a single revision attempt — generally four sentences or fewer, addressing no more than two clearly prioritized issues — or when the response required no revision and the feedback stays appropriately brief. A 0 signals feedback that is too dense, too long, or raises too many issues at once, or feedback so thin it omits a small fix that would have completed the response.
Reasoning	High-level summary of why the feedback received this judgment
Key features	One entry per factor driving the judgment (length, number of distinct issues, clear priority, whether the student knows the next step), each marked met (1) / not met (0) with a justification
Proposed adjustment	Suggested moves to bring the feedback into a manageable range, for developers iterating on prompts or for teachers using the feedback instructionally

Example output

{
  "reasoning": "The student response gives a very minimal claim: AI-powered pets are a good alternative because they could help around the house, but it ends with \"etc.\" and does not include enough supporting evidence. The teacher feedback responds to that main gap by affirming the existing idea and asking for additional details from the article to strengthen the claim. For manageability, the feedback is short and focused on one main revision move: add more supporting details. It does not introduce multiple separate problems or a long list of changes. The priority is fairly clear because the only issue raised is strengthening the claim with evidence from the article. A student would likely know the next step is to go back to the article and add another supporting detail or details. Therefore, this feedback is manageable in amount and scope.",
  "key_features": {
    "length": {
      "met": 1,
      "justification": "The feedback is only two sentences, so it is comfortably within the usual manageable length guideline."
    },
    "number_of_distinct_issues": {
      "met": 1,
      "justification": "The feedback raises one distinct issue: the student needs to add more details from the article to support the claim."
    },
    "clear_priority": {
      "met": 1,
      "justification": "Because only one issue is raised, the priority is clear: strengthen the claim by adding textual details."
    },
    "student_knows_next_step": {
      "met": 1,
      "justification": "The student is directly asked to find other details from the article to make the claim stronger, so the immediate next step is apparent."
    }
  },
  "proposed_adjustment": "No adjustment needed; the feedback already appears manageable.",
  "manageable_score": 1
}

Interpreting results

Output	What it means	How to use it
Score + Reasoning	Whether the feedback is focused enough for the student to process and act on, with a synopsis of why it passed or failed	Evaluate whether your generated feedback sets students up to revise. Validate that your prompts produce focused feedback; aggregate reasoning across runs to detect drift toward longer, more exhaustive feedback
Key features + Proposed adjustment	Whether a 0 stems from too much feedback or too little, plus concrete moves to fix it	Pinpoint and correct the specific failure mode. Require the model to prioritize the one or two highest-impact issues and cap the total it raises; target the specific failure direction

Accuracy and validation

This evaluator is provided as Early access. Reported metrics come from a small held-out test split (15 examples) with wide confidence intervals and should be read as directional. Validation testing is ongoing.On this dimension, GEPA optimization improved GPT-5.4 markedly over its naive baseline on the held-out test set (+26 points accuracy, +52 points macro-F1).

We assessed performance against an expert-annotated dataset of Quill.org student-response and teacher-feedback pairs, labeled by Leanlab Education using the Productive Coaching rubric.

Metric	Result
	93% (naive baseline: 67%)
	92% (naive baseline: 40%)
	GPT-5.4 had the best mean optimized macro-F1 (≈0.81) and accuracy (≈0.82) across the five feedback dimensions
Dataset source	Quill.org classroom writing data (39 labeled pairs; 12 train / 12 validation / 15 test), annotated by Leanlab Education

Evaluator release history

Date	Changed
June 24, 2026	First release

​At a glance

​Model and prompt

​Inputs

​Output

​Interpreting results

​Accuracy and validation

​Evaluator release history

At a glance

Model and prompt

Inputs

Output

Interpreting results

Accuracy and validation

Evaluator release history