We started with the SCASS rubric’s vocabulary element of language features.
Language features
ConventionalityVocabularySentence structure
Slightly complexExplicit, literal, straightforward, easy to understand.Contemporary, familiar, conversational language.Mainly simple sentences.
Moderately complexLargely explicit and easy to understand, with some occasions for more complex meaning.Mostly contemporary, familiar, conversational; rarely overly academic.Structure: Primarily simple and compound sentences, with some complex structures.
Very complexFairly complex; contains some abstract, ironic, and/or figurative language.Fairly complex language that is sometimes unfamiliar, archaic, subject-specific, or overly academic.Many complex sentences with several subordinate phrases or clauses and transition words.
Exceedingly complexDense and complex; contains considerable abstract, ironic, and/or figurative language.Complex, generally unfamiliar, archaic, subject-specific, or overly academic language; may be ambiguous or purposefully misleading.Mainly complex sentences with several subordinate clauses or phrases and transition words; sentences often contain multiple concepts.
Taken from the SCASS rubric for qualitative text complexity from Student Achievement Partners (SAP).

Creating a machine-readable rubric

The SCASS rubric provides a great starting point, however, the definitions of each dimension leave room for interpretation due to words like: fairly, mostly, and overly.  Added to that, there are several things educators think about when they use the rubric that aren’t explicitly specified in the original rubric. For example, educators know that if the content of a text helps a student understand a vocabulary word, that reduces some of the complexity of an unfamiliar word.  Through several rounds of co-design with 6 experts from Student Achievement Partners (SAP) and Achievement Network (ANET) supporting student literacy development, we created a more human and machine-compatible framing for each level of complexity. This framing allows the evaluator to work consistently and accurately.
1 (Slightly complex)2 (Moderately complex)3 (Very complex)4 (Exceedingly complex)
RubricVocabulary that is almost entirely not complex: - contemporary, - conversational, and/or - familiar.
***
A very low proportion of complex words (archaic, subject-specific, academic) is OK — i.e., doesn’t need to be 0.
Vocabulary that is mostly not complex: - contemporary, - conversational, and/or - familiar.
***
A low proportion of complex words (archaic, subject-specific, academic) is OK
Vocabulary that is often complex: - unfamiliar, - archaic, - subject-specific, and/or - overly academic. ***Vocabulary that is mostly complex: - unfamiliar, - archaic, - subject-specific, and/or - overly academic.
***
May be ambiguous or purposefully misleading
SummaryOverall, vocabulary is easy to understand and does not impede comprehension of the bulk of the text (including main idea and support claims).Overall, vocabulary generally allows students to comprehend the bulk of the text with little difficulty, though there may be occasional pauses for clarification.Overall, vocabulary often presents challenges that may slow down comprehension but does not completely block the comprehension of the bulk of the text.Overall, vocabulary is so complex that it makes comprehension of the bulk of the text very challenging and requires careful effort to interpret.
Adapted from the SCASS rubric for qualitative text complexity from Student Achievement Partners (SAP)

Human annotation

We then created a full, reliable benchmark dataset based on more than 580 text passages from the CLEAR corpus, which were in turn annotated for sentence structure. The dataset consists of informational topics where the Flesch–Kincaid grade level is lower than 9. This allowed us to capture texts that were approximately appropriate for students in grades 3 and 4, while also including a few more difficult texts with “Very Complex” and “Exceedingly Complex” ratings. The dataset is composed of two parts:
  • Expert-annotated data: 
    • ~180 rows were annotated by at least two pedagogical experts from SAP and ANET. 
    • If the two experts provided different scores, a third expert would also provide a score. 
    • In total, we worked with eight pedagogical experts from SAP and ANET, all of whom had prior experience in literacy or curriculum development.
  • Educator-annotated data: 
    • ~400 texts were annotated by three educators who had passed a pre-test.
    • Each texts are annotated by 3 annotators independently.
    • Educators were also given some “honeypot” texts to score. These are texts with 2 or more experts agreeing on the scores, and we used these texts to track each educator’s agreement with experts. 
    • We used the Dawid-Skene model to calibrate the final score for educators.
    • We worked with 21 educators.

Evaluator creation

Once we had a high-quality dataset, we were able to develop the vocabulary evaluator following the annotators’ approach. We use DSPy’s GEPA (Genetic-Pareto) optimizer to conduct prompt optimization on the training dataset. The program tested out over 1000 variations of prompts, LLM, and temperature settings to arrive at the final prompt with the highest accuracy performance. We use LangChain in the final evaluator instrumentation for system compatibility.