Math Alignment - AI Developer Tools for Education

Evaluator last updated June 22, 2026. The Math Alignment evaluator checks a math question against a standard’s individual learning components — not just against the standard’s label. It reports which of a standard’s components the question actually measures.

The Math Alignment evaluator judges whether a question is the right math for a standard. The Math Visual Correctness evaluator (coming soon), judges whether a math visual is mathematically correct.

At a glance


Input type	Math assessment question
Supported subject	Math only
Supported grades	K–12
Jurisdictions	Multi-State/CCSS, all 50 states, plus DC

Model and prompt

For instructions on running the evaluator, see Quickstart.


Model used	claude-haiku-4-5-20251001
Temperature	0 (fixed)
Prompts	View prompts ↗

The prompt was optimized with GEPA (Genetic-Pareto) ↗ via the DSPy framework ↗, and is tuned for the model and temperature listed above – results may vary with other models or parameters.

Inputs

Inputs must be de-identified. Do not submit student PII or any regulated or sensitive personal information.

Input	Description	Required
Math question	Full assessment prompt/instructions students see, K–12 math	Yes
Jurisdiction	Multi-State/CCSS, a state adoption, or DC	Yes
Standard(s)	One or more standard codes from the selected jurisdiction	Yes
Grade	Filters the standards list	No
Coarse filter	Pre-screens relevance before full evaluation, useful at scale	No

The evaluator supports 3 modes:

Mode	Description
Single check	One question against one standard
Batch evaluation	Set of question-standard pairs (e.g. tagging validation), run as a full question × standard cross-product
By-grade evaluation	Question bank against every standard for a grade, for whole-grade coverage analysis

Use Batch and By-grade evaluation to surface which standards (and which learning components) are fully, partially, or not covered by a given question bank.

Output

The evaluator reduces alignment to a binary judgment (plus rationale) per learning component, and is not validated for grading, assessment, or placement decisions.Treat outputs as directional signals, and keep a human in the loop – especially for borderline cases.

Field	Description
`statementCode`	The standard evaluated.
`learningComponents`	One entry per learning component of the standard.
`learningComponents[].description`	Learning component description, sourced from the Knowledge Graph.
`learningComponents[].reasoning`	What the component requires, what the question actually asks, and why that is or is not alignment.
`learningComponents[].aligned`	`true` if aligned, `false` if not.
`learningComponents[].feedback`	Revision guidance when not aligned; brief confirmation when aligned.
`alignedCount`	Count of learning components the question aligns to.
`totalCount`	Total learning components for the standard.
`coarseFiltered`	`true` if the standard was excluded by the coarse filter (batch/by-grade modes only).

Interpreting results

The evaluator assesses standards alignment based on how many of a standard’s learning components a question meets:

Level	Meaning
Fully aligned	The question meets every learning component of the standard.
Partially aligned	The question meets some, but not all, learning components.
Not aligned	The question meets none of the standard’s learning components.

Example: A question that asks students to find the area of a rectangle by multiplying its sides’ lengths is commonly tagged to Common Core 3.MD.C.7. However, that question meets only one of the 4 learing components that make up that standard.At the parent-code level the question looks aligned; at the learning-component level, it covers a quarter of the standard.

Accuracy and validation

This evaluator is provided as Early access. Comprehensive accuracy measures are still evolving, and validation testing is ongoing.

The prompt was optimized using GEPA ↗ via DSPy ↗ on a stratified split of 2,011 question-learning component pairs (724 train / 482 validation / 805 test). These were drawn from Illustrative Mathematics v.360 ↗ cool-down questions and annotated by 3 human experts. The GEPA-optimized prompt was evaluated on the held-out test split, and separately reviewed by math experts on a sniff test of brand-new questions.

Metric	Result
Accuracy (held-out test set)	73% against expert-annotated pairs
Sniff test pass rate (answer + reasoning)	78% (29 of 37 sampled outputs passed; threshold 60%)
Dataset source	Illustrative Mathematics 360 (CC-BY-NC 4.0)

The evaluator was designed and validated on Illustrative Mathematics 360 cool-down questions, which are not evenly distributed across grades K–12 — performance may vary by grade. The evaluation set also may not fully represent the range of math inputs developers could submit, so brand-new or unusual inputs carry more risk of poor results.

Evaluator release history

Date	Changes
June 30, 2026	First release

​At a glance

​Model and prompt

​Inputs

​Output

​Interpreting results

​Accuracy and validation

​Evaluator release history

At a glance

Model and prompt

Inputs

Output

Interpreting results

Accuracy and validation

Evaluator release history