Early Release
This evaluator reflects early-stage work. We’re continuously improving its accuracy and reliability.
- System prompts
- Models (including several from OpenAI, Google Gemini, Anthropic, and Meta)
- Temperature settings
- Few shot learning examples
Reported accuracyWe report the accuracy of our evaluator on the test split subset of our annotated dataset. Our dataset contains 580+ annotated rows for vocabulary. To avoid over-fitting our prompts to our existing dataset, we chose to split our dataset into two parts: - Training data: 60% of our dataset was used for data analysis and to test out different prompts - Test data: 40% of our dataset was used for the final accuracy reporting
Overall accuracy
We created a performance baseline with a naive LLM prompt. This prompt represents what an edtech company might build without this evaluator. The baseline performance varied greatly depending on the model and temperature setting, and achieved an average accuracy of about ~39%, whereas the Vocabulary Evaluator prompts achieved 52% accuracy on the same dataset (test split of our dataset).Overall Accuracy
Accuracy for Grade 3
The evaluator was, on average 28% relatively more accurate than the baseline LLM in terms of assessing Grade 3 vocabulary complexity Across multiple runs, the baseline accuracy on the test data for Grade 3 is at 39%, compared to Evaluator at 50%.Accuracy for Grade 4
The evaluator was, on average 42% relatively more accurate than the baseline LLM in terms of assessing Grade 4 vocabulary complexity Across multiple runs, the baseline accuracy on the test data for Grade 3 is at 38%, compared to Evaluator at 42%.Confusion matrix comparison
Below you will see 2 charts showing how Baseline AI and Vocabulary Evaluator performed relative to human annotation. This is often called a confusion matrix. These charts provide details on where predictions by Baseline AI and Vocabulary Evaluator are the most accurate and where they are making the most mistakes. This can help you understand the bias and risks of using the evaluator. The confusion matrix is measured on the test split of our dataset with ~234 texts.Baseline LLM confusion matrix
- 1 text as slightly complex (3 level difference)
- 13 as moderately complex (2 level difference)
- 45 as very complex (1 level difference)
- 3 as exceedingly complex (correct)
Vocabulary Evaluator confusion matrix
- 1 text as slightly complex (3 level difference)
- 1 as moderately complex (2 level difference)
- 23 as very complex (1 level difference)
- 37 as exceedingly complex (correct)