Accuracy

Early Release

This evaluator reflects early-stage work. We’re continuously improving its accuracy and reliability.

The current configuration of this evaluator achieves 52% accuracy with our test dataset. To determine this, we tested more than 1000 combinations of the following variables:

System prompts
Models (including several from OpenAI, Google Gemini, Anthropic, and Meta)
Temperature settings
Few shot learning examples

Reported accuracyWe report the accuracy of our evaluator on the test split subset of our annotated dataset. Our dataset contains 580+ annotated rows for vocabulary. To avoid over-fitting our prompts to our existing dataset, we chose to split our dataset into two parts: - Training data: 60% of our dataset was used for data analysis and to test out different prompts - Test data: 40% of our dataset was used for the final accuracy reporting

Overall accuracy

We created a performance baseline with a naive LLM prompt. This prompt represents what an edtech company might build without this evaluator. The baseline performance varied greatly depending on the model and temperature setting, and achieved an average accuracy of about ~39%, whereas the Vocabulary Evaluator prompts achieved 52% accuracy on the same dataset (test split of our dataset).

Bar chart showing the baseline prompts are 39% accurate and the evaluator is
52%
accurate.

The Vocabulary evaluator is 33% more accurate (relative improvement, or 13% absolute improvement) than the “naive” LLM baseline.

Accuracy for Grade 3

The evaluator was, on average 28% relatively more accurate than the baseline LLM in terms of assessing Grade 3 vocabulary complexity Across multiple runs, the baseline accuracy on the test data for Grade 3 is at 39%, compared to Evaluator at 50%.

Bar chart showing the baseline prompts are 39% accurate and the evaluator is
50% accurate for Grade
3

Accuracy for Grade 4

The evaluator was, on average 42% relatively more accurate than the baseline LLM in terms of assessing Grade 4 vocabulary complexity Across multiple runs, the baseline accuracy on the test data for Grade 3 is at 38%, compared to Evaluator at 42%.

Bar chart showing the baseline prompts are 38% accurate and the evaluator is
54% accurate for Grade
4.

Confusion matrix comparison

Below you will see 2 charts showing how Baseline AI and Vocabulary Evaluator performed relative to human annotation. This is often called a confusion matrix. These charts provide details on where predictions by Baseline AI and Vocabulary Evaluator are the most accurate and where they are making the most mistakes. This can help you understand the bias and risks of using the evaluator. The confusion matrix is measured on the test split of our dataset with ~234 texts.

Baseline LLM confusion matrix

Confusion matrix showing where the baseline prompts performed well and where
they
didn't.

This chart shows the accuracy in complexity level categorization as well as where the baseline AI is most likely to make mistakes when compared to human annotators. For example, the test dataset included 62 texts that were labeled exceedingly complex by the annotators. The evaluator labeled these texts:

1 text as slightly complex (3 level difference)
13 as moderately complex (2 level difference)
45 as very complex (1 level difference)
3 as exceedingly complex (correct)

This means that the baseline AI tends to be at least 1 level off when the text is exceedingly complex.

Vocabulary Evaluator confusion matrix

Confusion matrix showing where the evaluator performed well and where they
didn't.

This chart shows the accuracy in complexity level categorization as well as where the evaluator is most likely to make mistakes when compared to human annotators. For example, the test dataset included 62 texts that were labeled exceedingly complex by the annotators. The evaluator labeled these texts:

1 text as slightly complex (3 level difference)
1 as moderately complex (2 level difference)
23 as very complex (1 level difference)
37 as exceedingly complex (correct)

This means that the Vocabulary Evaluator is much better than the baseline AI at evaluating text that is exceedingly.

Baseline prompt

This is the prompt we used as a baseline to simulate what a time- and resource-strapped edtech organization might use when building an evaluator against the SCASS rubric.

System instructions

Read the text and determine the text's vocabulary complexity from least to most complex: slightly complex, moderately complex, very complex, or exceedingly complex.

User input

Text: {text}
Student grade level: {student_grade_level}
Student background knowledge: {student_background_knowledge}
Flesch–Kincaid grade level: {fk_level}

Understanding Evaluators

Getting Started

Literacy Evaluators

Datasets

Resources

Early Release