Skip to main content

The problem

Edtech developers often wonder if the AI-generated feedback they’re delivering to students is actually good coaching. Feedback can be warm but generic, accurate but overwhelming, clear but off-task. Feedback quality encompasses multiple dimensions, making a single quality score for a piece of feedback misleading. As AI-generated feedback enters classrooms at scale, these dimensions can fail independently and quietly.

What we’re building

Our feedback evaluators surface feedback quality as a multidimensional profile, rather than a verdict. Edtech developers can see the qualities that a piece of AI-generated feedback exhibits, which it’s missing, and what to adjust. This helps them measure, monitor, and improve the quality of educators’ feedback to students at scale. Our feedback evaluators are anchored in the Productive Coaching rubric developed by Quill.org ↗ and Leanlab Education ↗, in partnership with Anastasiya A. Lipnevich ↗.
EvaluatorDescription
Strength Acknowledgement Determines whether feedback names something specific and authentic the student did well, distinct from generic or unanchored praise
Actionability Determines whether feedback gives a clear, usable next step the student can act on without additional clarification
Student Response Anchor Determines whether feedback is clearly based on the student’s specific response
Appropriateness Determines whether feedback correctly identifies whether the student needed to revise
Manageability Determines whether the amount of feedback is manageable for the student

Scope and limitations

Remember that LLM scores can vary across runs, especially on borderline cases. We recommend keeping a human in the loop and treating outputs as directional signals vs. definitive judgments.
LimitationDetails
Validated grade bandEvaluator validated for grades 8-9 only
Validated task typeLimited to short claim-with-evidence “because” completions from Quill.org – generalization to other prompts, genres, or subject areas is unverified
Binary outputSome feedback evaluators (like Actionability, Student Response Anchor, etc.) reduces a nuanced judgment to a binary 0 or 1 output, plus rationale.
Feedback evaluator outputs should not be used for high-stakes applications like grading, assessment, or placement decisions.