The problem
Edtech developers often wonder if the AI-generated feedback they’re delivering to students is actually good coaching. Feedback can be warm but generic, accurate but overwhelming, clear but off-task.
Feedback quality encompasses multiple dimensions, making a single quality score for a piece of feedback misleading. As AI-generated feedback enters classrooms at scale, these dimensions can fail independently and quietly.
What we’re building
Our feedback evaluators surface feedback quality as a multidimensional profile, rather than a verdict. Edtech developers can see the qualities that a piece of AI-generated feedback exhibits, which it’s missing, and what to adjust. This helps them measure, monitor, and improve the quality of educators’ feedback to students at scale.
Our feedback evaluators are anchored in the Productive Coaching rubric developed by Quill.org ↗ and Leanlab Education ↗, in partnership with Anastasiya A. Lipnevich ↗.
| Evaluator | Description |
|---|
| Strength Acknowledgement | Determines whether feedback names something specific and authentic the student did well, distinct from generic or unanchored praise |
| Actionability | Determines whether feedback gives a clear, usable next step the student can act on without additional clarification |
| Student Response Anchor | Determines whether feedback is clearly based on the student’s specific response |
| Appropriateness | Determines whether feedback correctly identifies whether the student needed to revise |
| Manageability | Determines whether the amount of feedback is manageable for the student |
Scope and limitations
Remember that LLM scores can vary across runs, especially on borderline cases.
We recommend keeping a human in the loop and treating outputs as directional
signals vs. definitive judgments.
| Limitation | Details |
|---|
| Validated grade band | Evaluator validated for grades 8-9 only |
| Validated task type | Limited to short claim-with-evidence “because” completions from Quill.org – generalization to other prompts, genres, or subject areas is unverified |
| Binary output | Some feedback evaluators (like Actionability, Student Response Anchor, etc.) reduces a nuanced judgment to a binary 0 or 1 output, plus rationale. |
Feedback evaluator outputs should not be used for high-stakes applications
like grading, assessment, or placement decisions.