Early Release
This evaluator reflects early-stage work. We’re continuously improving its accuracy and reliability.
- We calculated the F-statistic of 30+ sentence features, which allowed us to identify the most important sentence features for a text’s sentence structure complexity
- We used tree-based models to identify the thresholds that made a text fall within a particular category (e.g., average words per sentence < 12 for “Slightly Complex”).
- Finally, we put it all together by testing 150+ combinations of 3 main variables:
- System & user prompts
- Models (including several from OpenAI and Google Gemini)
- Temperature settings
Reported accuracyWe report the accuracy of our evaluator on a subset of our annotated dataset (i.e. the test data). Our dataset contains 1000+ annotated rows for sentence structure. To avoid over-fitting our prompts to our existing dataset, we chose to split our dataset into two parts: Training data: 60% of our dataset was used for data analysis and to test out different prompts. 40% of our dataset was used for the final accuracy reporting.
Accuracy comparison
The Sentence Structure evaluator is, on average, 26% more accurate than the naive LLM baseline. To calculate this metric, we created a performance baseline with a naive LLM prompt. This prompt represents what an EdTech company might build without this evaluator. The baseline performance varied greatly depending on the model and temperature setting used. In the example below, we show the performance using the same model and temperature setting as our final evaluator.Overall Accuracy
Across multiple runs, the baseline accuracy on the test data averaged 42%, while the Sentence Structure evaluator averaged 54%.Accuracy for Grade 3
The Sentence Structure evaluator was, on average, 39% more accurate than the naive LLM baseline in terms of assessing Grade 3 sentence structure. Across multiple runs, the baseline accuracy on the test data for Grade 3 averaged 39%, while the Sentence Structure evaluator averaged 53%.Accuracy for Grade 4
The Sentence Structure evaluator was, on average, 20% more accurate than the naive LLM baseline in terms of assessing Grade 3 sentence structure. Across multiple runs, the baseline accuracy on the test data for Grade 4 averaged 45%, while the Sentence Structure evaluator averaged 54%.Confusion matrix comparison
When an evaluator isn’t as accurate as we’d like, it is better for it to be “almost” right rather than very wrong. One way to assess this is by looking at the evaluator’s confusion matrix. This provides details on where an evaluator is most accurate and where it is making the most mistakes. When assessing a confusion matrix, consider:- Within one accuracy: the percentage of times that the evaluator is accurate within one complexity level (indicated by the greenish regions).
- Critical error rate: the percentage of times that the evaluator is wrong by more than one complexity level (indicated by the red regions).
Baseline evaluator confusion matrix
This chart gives insight into accuracy per complexity level for the naive LLM, as well as where the evaluation is most likely to make mistakes. It is clear that the naive LLM tends to under score more complex texts. For example, the test data included 35 texts that were labeled exceedingly complex by the annotators. The evaluator labeled:
- 0 text as slightly complex
- 15 text as moderately complex
- 17 as very complex
- 3 as exceedingly complex
Sentence Structure Evaluator confusion matrix
This chart gives insight into accuracy per complexity level for the Sentence Structure evaluator, as well as where the evaluation is most likely to make mistakes. The Sentence Structure Evaluator doesn’t perform as well as the baseline moderately complex sentences, but performs better for every other category. For example, the test data included 35 texts that were labeled Exceedingly Complex by the annotators. The evaluator labeled:
- 0 text as slightly complex
- 1 text as moderately complex
- 19 as very complex
- 15 as exceedingly complex