Evaluator last updated June 24, 2026.
The Strength Acknowledgement feedback evaluator assesses whether a piece of feedback names something specific and authentic that the student did well.
At a glance
| |
|---|
| Input type | Teacher feedback on a student writing response |
| Supported grades | 8–9 |
| Considers | - Whether praise is authentic or reflexive
- Whether feedback is specific or generic
- Whether feedback is anchored to evidence – i.e., a feature of the student’s response
- Whether it uses process-vs.-trait framing – i.e., praising what the student did rather than a fixed ability
- Whether acknowledgment is warranted – e.g., “IDK” should not draw false praise
|
Model and prompt
See Quickstart to run the evaluator.
| |
|---|
| Model used | GPT-5.4 |
| Temperature | 1.0 |
| Optimization | DSPy + GEPA (Genetic-Pareto) prompt optimization against expert-annotated labels |
| Prompts | Available via the Learning Commons Portal |
GPT-5.4 was selected as the shipped model for its best mean performance across
the five feedback-suite dimensions. Other models and configurations will
produce different results and may have lower accuracy.
Inputs must be de-identified. Do not submit student PII or any regulated or
sensitive personal information.
| Requirement | Supported | Required |
|---|
| Student text | Student’s constructed-response writing | Yes |
| Feedback text | Teacher or AI-generated feedback to be evaluated | Yes |
Output
| Field | Description |
|---|
| Acknowledges strength score | Binary judgment (1 / 0). A 1 applies when feedback names a specific, accurate strength tied to what the student wrote — or appropriately refrains from praise when the response warrants none. A 0 signals generic, unanchored, or inaccurate praise. |
| Reasoning | High-level summary of why the feedback received this judgment |
| Key features | One entry per factor driving the judgment (presence of praise, specificity, anchoring to evidence, process-vs.-trait framing), each marked met (1) / not met (0) with a justification |
| Proposed adjustment | Suggested moves to strengthen the feedback’s acknowledgment, for developers iterating on prompts |
{
"reasoning": "The student response includes a concrete, relevant idea: AI-powered pets could help around the house. Although the phrase etc. is vague, the helping-around-the-house idea is a promising reason that could be developed with evidence. The teacher feedback explicitly acknowledges that specific idea by saying, You're right, the AI pets could help around the house. This is a direct, response-specific acknowledgment of a strength or valid point in the student's writing. The follow-up question then invites the student to strengthen the claim by adding more details from the article, which builds on that identified strength rather than ignoring it. Because the acknowledgment is specific, grounded in the student's actual wording, and framed as a next step for developing the idea, the feedback meets the criterion.",
"key_features": {
"presence_of_praise": {
"met": 1,
"justification": "The feedback explicitly affirms a specific student idea with \"You're right, the AI pets could help around the house.\" This is authentic acknowledgment rather than generic praise."
},
"specificity": {
"met": 1,
"justification": "The acknowledgment names the particular idea from the student response: that AI pets could help around the house. It is not vague or interchangeable with feedback on any essay."
},
"anchoring_to_evidence": {
"met": 1,
"justification": "The feedback is clearly tied to wording that appears in the student text, specifically the claim about helping around the house. It is grounded in the student's actual response."
},
"process_vs_trait_framing": {
"met": 1,
"justification": "The feedback focuses on the student's claim and a revision move—adding details from the article to strengthen it—rather than praising a fixed trait. This frames the strength as something to build on through writing process."
}
},
"proposed_adjustment": "No adjustment needed; the feedback already meets the criterion. If desired, the teacher could make it even stronger by naming one especially useful detail from the article as a model for expansion.",
"acknowledges_strength_score": 1
}
Interpreting results
| Output | What it means | How to use it |
|---|
| Score + Reasoning | Whether the feedback authentically names a specific strength, with a synopsis of why it passed or failed the acknowledgment threshold | Validate that your AI-generated feedback is building student awareness, not just adding polite filler. Ensure that it includes specific, earned praise. Aggregate reasoning across runs to detect drift toward generic and/or polite openers. |
| Key features + Proposed adjustment | The exact pattern driving the judgment (e.g., stock praise plus an immediate directive) and concrete moves to fix it | Pinpoint and correct the specific failure mode. Adjust prompts to require the model to name a specific feature of the student’s work before any directive; target the missing feature directly |
Accuracy and validation
This evaluator is provided as Early access. Reported metrics come from small
held-out test splits with wide confidence intervals and should be read as
directional. Validation testing is ongoing.
We assessed performance against an expert-annotated dataset of Quill.org
student-response and teacher-feedback pairs, labeled by Leanlab Education using
the Productive Coaching rubric.
| Metric | Description | Result |
|---|
| Accuracy | How accurately the evaluator’s acknowledges-strength score matches expert annotations on the held-out test split. | 71% (naive baseline: 74%) |
| Macro-F1 | Macro-F1 averaged over both classes on the held-out test split. | 69% (naive baseline: 73%) |
| Model selection signal | Mean optimized performance of GPT-5.4 across the 5 feedback dimensions, which drove model selection. | GPT-5.4 had the best mean optimized macro-F1 (≈0.81) and accuracy (≈0.82) across the five feedback dimensions |
| Dataset source | Expert-annotated Quill.org student-response and teacher-feedback pairs used for validation. | Quill.org classroom writing data (82 labeled pairs; 24 train / 24 validation / 34 test), annotated by Leanlab Education |
On this dimension, GEPA optimization did not improve GPT-5.4 over its naive
baseline on the held-out test set. GPT-5.4 was selected for its strong mean
performance across the full suite rather than its margin on this single
dimension.
Evaluator release history
| Date | Changed |
|---|
| June 24, 2026 | First release (v0.1). |