Original Qualitative Text Complexity rubric (SAP)
We started with the Qualitative Text Complexity rubric (SAP) and modified it to be more usable for annotation. Then, after data collection for grades 3-4, we performed extensive statistical analysis to identify the key sentence features that impact annotators’ scores. This resulted in a machine-compatible rubric, which we used to build our final evaluator.| Language features | |||
|---|---|---|---|
| Conventionality | Vocabulary | Sentence structure | |
| Slightly complex | Explicit, literal, straightforward, easy to understand. | Contemporary, familiar, conversational language. | Mainly simple sentences. |
| Moderately complex | Largely explicit and easy to understand, with some occasions for more complex meaning. | Mostly contemporary, familiar, conversational; rarely overly academic. | Structure: Primarily simple and compound sentences, with some complex structures. |
| Very complex | Fairly complex; contains some abstract, ironic, and/or figurative language. | Fairly complex language that is sometimes unfamiliar, archaic, subject-specific, or overly academic. | Many complex sentences with several subordinate phrases or clauses and transition words. |
| Exceedingly complex | Dense and complex; contains considerable abstract, ironic, and/or figurative language. | Complex, generally unfamiliar, archaic, subject-specific, or overly academic language; may be ambiguous or purposefully misleading. | Mainly complex sentences with several subordinate clauses or phrases and transition words; sentences often contain multiple concepts. |
Modified Qualitative Text Complexity rubric with assumptions for annotation
With substantial input from experts, we created an adapted rubric with accompanying assumptions for annotators to use. Annotators provided their scores based on this rubric.Modified rubric
| Slightly complex | Moderately complex | Very complex | Exceedingly complex |
|---|---|---|---|
| Mainly simple sentences, few sentences contain multiple concepts. | Primarily, simple and compound sentences, with some complex (or compound-complex) constructions, some sentences contain multiple concepts. | Many complex (or compound-complex) constructions with several subordinate phrases or clauses and transition words; many sentences contain multiple concepts. | Mainly complex (or compound-complex) constructions with several subordinate clauses, phrases, and transition words; most sentences contain multiple concepts. |
Annotation assumptions
| Student assumption | Text assumption | What to score vs ignore | Sentence structure complexity is relative |
|---|---|---|---|
| The student is on grade level and proficient in all core content areas, including reading fluency, comprehension, science, & social studies. The student is moving through a common progression of topics. The student is fluent in English. The student is in the middle of the academic year. | The text is for independent reading/work, without direct instruction. | When scoring Sentence Structure complexity, please ignore all other text features such as vocabulary, background knowledge, topic, etc. A text may be less readable for a student because of vocabulary and background knowledge, but still be composed of mostly simple sentences. In this case, the sentence structure is still Slightly Complex. | Sentence Structure complexity is relative to the grade level of the student. |
Expert annotation
We created a full, reliable benchmark dataset based on 500+ text passages from the CLEAR corpus ↗, which were in turn annotated for sentence structure. The dataset consists of informational topics with a Flesch–Kincaid grade level below 9. This allowed us to capture texts that were appropriate for students in grades 3 and 4, while also including a few more difficult texts with “Very Complex” and “Exceedingly Complex” ratings. The dataset is composed of two parts:- Expert-annotated data
- ~80 texts (160 rows of data) were annotated by at least two pedagogical experts from SAP and ANET.
- If the two experts provided different scores, a third expert would also provide a score.
- In total, we worked with eight pedagogical experts from SAP and ANET, all of whom had prior experience in literacy or curriculum development.
- Educator-annotated data
- ~400 texts (800 rows of data) were annotated by three educators who had passed a pre-test.
- Educators were also given some “honeypot” texts to score. These are texts with 2+ expert agreement scores, and we used them to track each educator’s agreement with experts.
- We used the Dawid-Skene model to calibrate the final score for educators.
- In total, we worked with 21 educators.
Machine-compatible rubric development
After we finalized the benchmark dataset, we conducted extensive data analysis:- We calculated the F-statistic of 30+ sentence features, which allowed us to identify the most important sentence features for a text’s sentence structure complexity
- We used tree-based models to identify the thresholds that made a text fall within a particular category (e.g., average words per sentence < 12 for “Slightly Complex”).