Introduction - AI Developer Tools for Education

The problem

Edtech developers often wonder if the AI-generated feedback they’re delivering to students is actually good coaching. Feedback can be warm but generic, accurate but overwhelming, clear but off-task.

Feedback quality encompasses multiple dimensions, making a single quality score for a piece of feedback misleading. As AI-generated feedback enters classrooms at scale, these dimensions can fail independently and quietly.

What we’re building

Our feedback evaluators surface feedback quality as a multidimensional profile, rather than a verdict. Edtech developers can see the qualities that a piece of AI-generated feedback exhibits, which it’s missing, and what to adjust. This helps them measure, monitor, and improve the quality of educators’ feedback to students at scale.

Our feedback evaluators are anchored in the Productive Coaching rubric developed by Quill.org ↗ and Leanlab Education ↗, in partnership with Anastasiya A. Lipnevich ↗.

Evaluator	Description
Strength Acknowledgement	Determines whether feedback names something specific and authentic the student did well, distinct from generic or unanchored praise
Actionability	Determines whether feedback gives a clear, usable next step the student can act on without additional clarification
Student Response Anchor	Determines whether feedback is clearly based on the student’s specific response
Appropriateness	Determines whether feedback correctly identifies whether the student needed to revise
Manageability	Determines whether the amount of feedback is manageable for the student

Scope and limitations

Remember that LLM scores can vary across runs, especially on borderline cases. We recommend keeping a human in the loop and treating outputs as directional signals vs. definitive judgments.

Limitation	Details
Validated grade band	Evaluator validated for grades 8-9 only
Validated task type	Limited to short claim-with-evidence “because” completions from Quill.org – generalization to other prompts, genres, or subject areas is unverified
Binary output	Some feedback evaluators (like Actionability, Student Response Anchor, etc.) reduces a nuanced judgment to a binary 0 or 1 output, plus rationale.

Feedback evaluator outputs should not be used for high-stakes applications like grading, assessment, or placement decisions.

​The problem

​What we’re building

​Scope and limitations

The problem

What we’re building

Scope and limitations