Accuracy

Early Release

This evaluator reflects early-stage work. We’re continuously improving its accuracy and reliability.

The current configuration of this evaluator achieves 53% accuracy against our validation dataset. Our final evaluator is based on extensive data analysis:

We calculated the F-statistic of 30+ sentence features, which allowed us to identify the most important sentence features for a text’s sentence structure complexity
We used tree-based models to identify the thresholds that made a text fall within a particular category (e.g., average words per sentence < 12 for “Slightly Complex”).
Finally, we put it all together by testing 150+ combinations of 3 main variables:
- System & user prompts
- Models (including several from OpenAI and Google Gemini)
- Temperature settings

This evaluator is over 94% accurate within one complexity level. This means that in 94% of cases, its predictions are within one level of the true complexity. For example, it is highly unlikely for a slightly complex text to be rated as Very complex, or for a Moderately complex text to be rated as Exceedingly complex.

Reported accuracyWe report the accuracy of our evaluator on a subset of our annotated dataset (i.e. the test data). Our dataset contains 1000+ annotated rows for sentence structure. To avoid over-fitting our prompts to our existing dataset, we chose to split our dataset into two parts: Training data: 60% of our dataset was used for data analysis and to test out different prompts. 40% of our dataset was used for the final accuracy reporting.

Accuracy comparison

The Sentence Structure evaluator is, on average, 26% more accurate than the naive LLM baseline. To calculate this metric, we created a performance baseline with a naive LLM prompt. This prompt represents what an edtech company might build without this evaluator. The baseline performance varied greatly depending on the model and temperature setting used. In the example below, we show the performance using the same model and temperature setting as our final evaluator.

Overall Accuracy

Across multiple runs, the baseline accuracy on the test data averaged 42%, while the Sentence Structure evaluator averaged 54%.

Accuracy for Grade 3

The Sentence Structure evaluator was, on average, 39% more accurate than the naive LLM baseline in terms of assessing Grade 3 sentence structure. Across multiple runs, the baseline accuracy on the test data for Grade 3 averaged 39%, while the Sentence Structure evaluator averaged 53%.

Bar chart that shows the overall accuracy of this evaluator at 53% and the baseline at 39%.

Accuracy for Grade 4

The Sentence Structure evaluator was, on average, 20% more accurate than the naive LLM baseline in terms of assessing Grade 3 sentence structure. Across multiple runs, the baseline accuracy on the test data for Grade 4 averaged 45%, while the Sentence Structure evaluator averaged 54%.

Bar chart that shows the overall accuracy of this evaluator at 54% and the baseline at 45%.

Confusion matrix comparison

When an evaluator isn’t as accurate as we’d like, it is better for it to be “almost” right rather than very wrong. One way to assess this is by looking at the evaluator’s confusion matrix. This provides details on where an evaluator is most accurate and where it is making the most mistakes. When assessing a confusion matrix, consider:

Within one accuracy: the percentage of times that the evaluator is accurate within one complexity level (indicated by the greenish regions).
Critical error rate: the percentage of times that the evaluator is wrong by more than one complexity level (indicated by the red regions).

The Sentence Structure evaluator maintains the same within-one accuracy rate (94%) and critical error rate (6%) from the baseline. It performs much better for “Very Complex” and “Exceedingly Complex” texts, which the naive LLM tends to under-score.

Baseline evaluator confusion matrix

Confusion matrix showing where the baseline prompts perform well and where it doesn't.

This chart gives insight into accuracy per complexity level for the naive LLM, as well as where the evaluation is most likely to make mistakes. It is clear that the naive LLM tends to under score more complex texts. For example, the test data included 35 texts that were labeled exceedingly complex by the annotators. The evaluator labeled:

0 text as slightly complex
15 text as moderately complex
17 as very complex
3 as exceedingly complex

In other words, the naive LLM categorized only 9% of Exceedingly Complex texts correctly. Also, 42% of texts are categorized as Moderately Complex, which is a critical error.

Sentence Structure Evaluator confusion matrix

Confusion matrix showing where the evaluator prompts perform well and where it doesn't.

This chart gives insight into accuracy per complexity level for the Sentence Structure evaluator, as well as where the evaluation is most likely to make mistakes. The Sentence Structure Evaluator doesn’t perform as well as the baseline moderately complex sentences, but performs better for every other category. For example, the test data included 35 texts that were labeled Exceedingly Complex by the annotators. The evaluator labeled:

0 text as slightly complex
1 text as moderately complex
19 as very complex
15 as exceedingly complex

In other words, the Sentence Structure Evaluator categorized 43% of Exceedingly Complex texts correctly. Moreover, the critical error rate for this category is now only 3%.

Baseline prompt

This is the prompt we used as a baseline to simulate what a time- and resource-strapped edtech organization might use when building an evaluator against the sentence structure portion of the SCASS rubric.

System prompt

You are an expert in grammar and literacy.

User prompt

Your job is to rate the complexity of a text with respect to its sentence structure given a complexity rubric.

In your answer, respond with just the complexity label ("Exceedingly Complex", "Very Complex", "Moderately Complex", or "Slightly Complex").

Here is the rubric:
[BEGIN RUBRIC]
- "Exceedingly Complex": Mainly complex sentences with several subordinate clauses or phrases and transition words; sentences often contains multiple concepts
- "Very Complex": Many complex sentences with several subordinate phrases or clauses and transition words
- "Moderately Complex": Primarily simple and compound sentences, with some complex constructions	
- "Slightly Complex": Mainly simple sentences
[END RUBRIC]

Here is the text:
[BEGIN TEXT]
{text}
[END TEXT]

{format_instructions}

Understanding Evaluators

Getting Started

Literacy Evaluators

Datasets

Resources

Early Release