Actionability - AI Developer Tools for Education

Evaluator last updated June 24, 2026. The Actionability feedback valuator assesses whether a piece of feedback gives the student a clear, usable next step they can reasonably act on without additional clarification.

At a glance


Input type	Teacher feedback on a student writing response
Supported grades	8–9
Considers	Presence of a directive verb (e.g., add, replace, clarify, explain, revise) or focused question that points somewhere specific Whether feedback goes beyond evaluative comments (e.g., “good job,” “this is incomplete”) Clarity of the target (is it clear what should be revised?) Specificity of the next move (a concrete directive rather than a vague instruction) Whether the student could reasonably act without further clarification or more information Whether a next step is warranted at all (a response that didn’t require revision should not draw a forced directive)

Model and prompt

See Quickstart to run the evaluator.


Model used	GPT-5.4
Temperature	1.0
Optimization	DSPy + GEPA (Genetic-Pareto) prompt optimization against expert-annotated labels
Prompts	Available via the Learning Commons Portal

GPT-5.4 was selected as the shipped model for its best mean performance across the Feedback Quality Evaluator Suite. Other models and configurations will produce different results and may have lower accuracy.

Inputs

Inputs must be de-identified. Do not submit student PII or any regulated or sensitive personal information.

Requirement	Supported	Required
Student text	Student’s constructed-response writing	Yes
Feedback text	Teacher or AI-generated feedback to be evaluated	Yes

Example input

{
  "student_text": "Some people think AI-powered pets are a good alternative to real pets because they could help around the house etc.",
  "feedback_text": "You're right, the AI pets could help around the house. Can you find some other details from the article that you could add to make your claim stronger?"
}

Output

Field	Description
Actionable revision score	Binary judgment (1 / 0). A 1 applies when feedback names a concrete action the student can take — add, replace, clarify, explain, ask, revise — or appropriately omits a next step when the response required no revision. A 0 signals feedback that is purely evaluative, too vague to act on, or too thin to direct revision.
Reasoning	High-level summary of why the feedback received this judgment
Key features	One entry per factor driving the judgment (directive verb or focused question, clarity of the target, specificity of the next move, whether the student could reasonably act without further clarification), each marked met (1) / not met (0) with a justification
Proposed adjustment	Suggested moves to strengthen the next step, for developers iterating on prompts or for teachers using the directive instructionally

Example output

{
  "reasoning": "The student response includes a concrete, relevant idea: AI-powered pets \"could help around the house.\" Although the phrase \"etc.\" is vague, the helping-around-the-house idea is a promising reason that could be developed with evidence. The teacher feedback explicitly acknowledges that specific idea by saying, \"You're right, the AI pets could help around the house.\" This is a direct, response-specific acknowledgment of a strength or valid point in the student's writing. The follow-up question then invites the student to strengthen the claim by adding more details from the article, which builds on that identified strength rather than ignoring it. Because the acknowledgment is specific, grounded in the student's actual wording, and framed as a next step for developing the idea, the feedback meets the criterion.",
  "key_features": {
    "presence_of_praise": {
      "met": 1,
      "justification": "The feedback explicitly affirms a specific student idea with \"You're right, the AI pets could help around the house.\" This is authentic acknowledgment rather than generic praise."
    },
    "specificity": {
      "met": 1,
      "justification": "The acknowledgment names the particular idea from the student response: that AI pets could help around the house. It is not vague or interchangeable with feedback on any essay."
    },
    "anchoring_to_evidence": {
      "met": 1,
      "justification": "The feedback is clearly tied to wording that appears in the student text, specifically the claim about helping around the house. It is grounded in the student's actual response."
    },
    "process_vs_trait_framing": {
      "met": 1,
      "justification": "The feedback focuses on the student's claim and a revision move—adding details from the article to strengthen it—rather than praising a fixed trait. This frames the strength as something to build on through writing process."
    }
  },
  "proposed_adjustment": "No adjustment needed; the feedback already meets the criterion. If desired, the teacher could make it even stronger by naming one especially useful detail from the article as a model for expansion.",
  "acknowledges_strength_score": 1
}

Interpreting results

Output	What it means	How to use it
Score + Reasoning	Whether the feedback gives a clear, usable next step, with a synopsis of why it passed or failed the actionability threshold	evaluate whether your generated feedback is actually driving revision. Validate that your prompts produce feedback students can act on; aggregate reasoning across runs to detect drift toward describe-only, evaluative feedback
Key features + Proposed adjustment	The exact pattern driving the judgment (e.g., evaluative-only with no directive) and concrete moves to fix it	Pinpoint and correct the specific failure. Require the model to include a directive verb or focused question before closing; surface 0-rated outputs to flag where students may need teacher follow-up

Accuracy and validation

This evaluator is provided as Early access. Reported metrics come from a small held-out test split (19 examples) with wide confidence intervals and should be read as directional. Validation testing is ongoing.

We assessed performance against an expert-annotated dataset of Quill.org student-response and teacher-feedback pairs, labeled by Leanlab Education using the Productive Coaching rubric.

Metric	Description	Result
Accuracy	How accurately the evaluator’s actionable-revision score matches expert annotations on the held-out test split.	95% (naive baseline: 95%)
Macro-F1	Macro-F1 averaged over both classes on the held-out test split.	94% (naive baseline: 94%)
Model selection signal	Mean optimized performance of GPT-5.4 across the 5 feedback dimensions, which drove model selection.	GPT-5.4 had the best mean optimized macro-F1 (≈0.81) and accuracy (≈0.82) across the five feedback dimensions
Dataset source	Expert-annotated Quill.org student-response and teacher-feedback pairs used for validation.	Quill.org classroom writing data (45 labeled pairs; 13 train / 13 validation / 19 test), annotated by Leanlab Education

On this dimension, GEPA optimization did not improve GPT-5.4 over its naive baseline on the held-out test set — both reached 95% accuracy. GPT-5.4 was selected for its strong mean performance across the full suite rather than its margin on this single dimension.

Evaluator release history

Date	Changed
June 24, 2026	First release

​At a glance

​Model and prompt

​Inputs

​Output

​Interpreting results

​Accuracy and validation

​Evaluator release history

At a glance

Model and prompt

Inputs

Output

Interpreting results

Accuracy and validation

Evaluator release history