Skip to main content

Original Qualitative Text Complexity rubric (SAP)

We started with the Qualitative Text Complexity rubric (SAP) and modified it to be more usable for annotation. Then, after data collection for grades 3-4, we performed extensive statistical analysis to identify the key sentence features that impact annotators’ scores. This resulted in a machine-compatible rubric, which we used to build our final evaluator.
Language features
ConventionalityVocabularySentence structure
Slightly complexExplicit, literal, straightforward, easy to understand.Contemporary, familiar, conversational language.Mainly simple sentences.
Moderately complexLargely explicit and easy to understand, with some occasions for more complex meaning.Mostly contemporary, familiar, conversational; rarely overly academic.Structure: Primarily simple and compound sentences, with some complex structures.
Very complexFairly complex; contains some abstract, ironic, and/or figurative language.Fairly complex language that is sometimes unfamiliar, archaic, subject-specific, or overly academic.Many complex sentences with several subordinate phrases or clauses and transition words.
Exceedingly complexDense and complex; contains considerable abstract, ironic, and/or figurative language.Complex, generally unfamiliar, archaic, subject-specific, or overly academic language; may be ambiguous or purposefully misleading.Mainly complex sentences with several subordinate clauses or phrases and transition words; sentences often contain multiple concepts.
Taken from the Qualitative Text Complexity rubric (SAP) ↗ from Student Achievement Partners (SAP).

Modified Qualitative Text Complexity rubric with assumptions for annotation

With substantial input from experts, we created an adapted rubric with accompanying assumptions for annotators to use. Annotators provided their scores based on this rubric.

Modified rubric

Slightly complexModerately complexVery complexExceedingly complex
Mainly simple sentences, few sentences contain multiple concepts.Primarily, simple and compound sentences, with some complex (or compound-complex) constructions, some sentences contain multiple concepts.Many complex (or compound-complex) constructions with several subordinate phrases or clauses and transition words; many sentences contain multiple concepts.Mainly complex (or compound-complex) constructions with several subordinate clauses, phrases, and transition words; most sentences contain multiple concepts.

Annotation assumptions

Student assumptionText assumptionWhat to score vs ignoreSentence structure complexity is relative
The student is on grade level and proficient in all core content areas, including reading fluency, comprehension, science, & social studies. The student is moving through a common progression of topics. The student is fluent in English. The student is in the middle of the academic year.The text is for independent reading/work, without direct instruction.When scoring Sentence Structure complexity, please ignore all other text features such as vocabulary, background knowledge, topic, etc. A text may be less readable for a student because of vocabulary and background knowledge, but still be composed of mostly simple sentences. In this case, the sentence structure is still Slightly Complex.Sentence Structure complexity is relative to the grade level of the student.

Expert annotation

We created a full, reliable benchmark dataset based on 500+ text passages from the CLEAR corpus ↗, which were in turn annotated for sentence structure. The dataset consists of informational topics with a Flesch–Kincaid grade level below 9. This allowed us to capture texts that were appropriate for students in grades 3 and 4, while also including a few more difficult texts with “Very Complex” and “Exceedingly Complex” ratings. The dataset is composed of two parts:
  • Expert-annotated data
    • ~80 texts (160 rows of data) were annotated by at least two pedagogical experts from SAP and ANET.
    • If the two experts provided different scores, a third expert would also provide a score.
    • In total, we worked with eight pedagogical experts from SAP and ANET, all of whom had prior experience in literacy or curriculum development.
  • Educator-annotated data
    • ~400 texts (800 rows of data) were annotated by three educators who had passed a pre-test.
    • Educators were also given some “honeypot” texts to score. These are texts with 2+ expert agreement scores, and we used them to track each educator’s agreement with experts.
    • We used the Dawid-Skene model to calibrate the final score for educators.
    • In total, we worked with 21 educators.

Machine-compatible rubric development

After we finalized the benchmark dataset, we conducted extensive data analysis:
  • We calculated the F-statistic of 30+ sentence features, which allowed us to identify the most important sentence features for a text’s sentence structure complexity
  • We used tree-based models to identify the thresholds that made a text fall within a particular category (e.g., average words per sentence < 12 for “Slightly Complex”).
This led to our development of a machine-compatible rubric.

Evaluator creation

Finally, we developed the Sentence Structure Evaluator. We tested different permutations of prompts, models, and temperature settings, looking for patterns in errors, and tuning our approach. In total, we ran over 150 experiments across 4 models, more than 20 prompts, and a range of temperature settings to define this evaluator.