Skip to main content
Evaluator last updated June 22, 2026. The Math Alignment evaluator checks a math question against a standard’s individual learning components — not just against the standard’s label. It reports which of a standard’s components the question actually measures.
The Math Alignment evaluator judges whether a question is the right math for a standard. The Math Visual Correctness evaluator (coming soon), judges whether a math visual is mathematically correct.

At a glance

Input typeMath assessment question
Supported subjectMath only
Supported gradesK–12
JurisdictionsMulti-State/CCSS, all 50 states, plus DC

Model and prompt

For instructions on running the evaluator, see Quickstart.
Model usedclaude-haiku-4-5-20251001
Temperature0 (fixed)
PromptsView prompts
The prompt was optimized with GEPA (Genetic-Pareto) ↗ via the DSPy framework ↗, and is tuned for the model and temperature listed above – results may vary with other models or parameters.

Inputs

Inputs must be de-identified. Do not submit student PII or any regulated or sensitive personal information.
InputDescriptionRequired
Math questionFull assessment prompt/instructions students see, K–12 mathYes
JurisdictionMulti-State/CCSS, a state adoption, or DCYes
Standard(s)One or more standard codes from the selected jurisdictionYes
GradeFilters the standards listNo
Coarse filterPre-screens relevance before full evaluation, useful at scaleNo
The evaluator supports 3 modes:
ModeDescription
Single checkOne question against one standard
Batch evaluationSet of question-standard pairs (e.g. tagging validation), run as a full question × standard cross-product
By-grade evaluationQuestion bank against every standard for a grade, for whole-grade coverage analysis
Use Batch and By-grade evaluation to surface which standards (and which learning components) are fully, partially, or not covered by a given question bank.

Output

The evaluator reduces alignment to a binary judgment (plus rationale) per learning component, and is not validated for grading, assessment, or placement decisions.Treat outputs as directional signals, and keep a human in the loop – especially for borderline cases.
FieldDescription
statementCodeThe standard evaluated.
learningComponentsOne entry per learning component of the standard.
learningComponents[].descriptionLearning component description, sourced from the Knowledge Graph.
learningComponents[].reasoningWhat the component requires, what the question actually asks, and why that is or is not alignment.
learningComponents[].alignedtrue if aligned, false if not.
learningComponents[].feedbackRevision guidance when not aligned; brief confirmation when aligned.
alignedCountCount of learning components the question aligns to.
totalCountTotal learning components for the standard.
coarseFilteredtrue if the standard was excluded by the coarse filter (batch/by-grade modes only).

Interpreting results

The evaluator assesses standards alignment based on how many of a standard’s learning components a question meets:
LevelMeaning
Fully alignedThe question meets every learning component of the standard.
Partially alignedThe question meets some, but not all, learning components.
Not alignedThe question meets none of the standard’s learning components.
Example: A question that asks students to find the area of a rectangle by multiplying its sides’ lengths is commonly tagged to Common Core 3.MD.C.7. However, that question meets only one of the 4 learing components that make up that standard.At the parent-code level the question looks aligned; at the learning-component level, it covers a quarter of the standard.

Accuracy and validation

This evaluator is provided as Early access. Comprehensive accuracy measures are still evolving, and validation testing is ongoing.
The prompt was optimized using GEPA ↗ via DSPy ↗ on a stratified split of 2,011 question-learning component pairs (724 train / 482 validation / 805 test). These were drawn from Illustrative Mathematics v.360 ↗ cool-down questions and annotated by 3 human experts. The GEPA-optimized prompt was evaluated on the held-out test split, and separately reviewed by math experts on a sniff test of brand-new questions.
MetricResult
Accuracy (held-out test set)73% against expert-annotated pairs
Sniff test pass rate (answer + reasoning)78% (29 of 37 sampled outputs passed; threshold 60%)
Dataset sourceIllustrative Mathematics 360 (CC-BY-NC 4.0)
The evaluator was designed and validated on Illustrative Mathematics 360 cool-down questions, which are not evenly distributed across grades K–12 — performance may vary by grade. The evaluation set also may not fully represent the range of math inputs developers could submit, so brand-new or unusual inputs carry more risk of poor results.

Evaluator release history

DateChanges
June 30, 2026First release