Running the evaluator

Early Release

This evaluator reflects early-stage work. We’re continuously improving its accuracy and reliability.

Requirements

To run the Sentence Structure Evaluator, you need the Python code, which runs the entire workflow. The code includes several components:

Two Python functions to conduct sentence analysis. These functions work together to do the following:
- The first function calculates statistics such as the number of sentences and words using the Python library textstat. It then feeds these statistics into an LLM, which uses the ground truth data to compute additional text features such as the number of sentences by sentence type, the number of subordinate clauses, etc.
- The second function deterministically normalizes those statistics. We split out this step because the LLM can produce inconsistent results for normalized statistics.
The prompts to assign complexity. This takes the text, grade level, and sentence statistics as input, and produces a score of the text’s sentence structure complexity for a particular grade.
A model and temperature setting. We found this to be the most accurate:
- Model: gpt-4o
- Temperature: 0

We recommend copying the sentence structure evaluation workflow into your notebook environment (e.g. Jupyter notebook, Databricks notebook, etc).
Then, provide all data to be evaluated as a dataframe, and run the workflow.

If you prefer to run by function:

Run the Python function that calculates initial text statistics.
Run the Python function that normalizes text statistics from step 1.
Using the statistics produced in step 2, make an API call to gpt-4o using the prompts that assign complexity.

Choosing test passages

As you choose texts to test, keep in mind:

Personal data: Use only anonymous, general-purpose, informational text. Never use student information, real-world personal data, or any content subject to privacy regulations such as FERPA, COPPA, HIPAA, or GDPR.
Grade level: This evaluator was tested on AI-generated informational texts for grades 3-4. It might not be as reliable for texts meant for lower or higher grades.
Passage length: We tested with texts that range between approximately 100 to 200 words (max 1200 characters). We recommend evaluating passages that fall within this range.

Run multiple times.We recommend repeating these steps 3 times per text passage to smooth over any errors the LLM might make

Run sentence analysis in two functions and complexity assignment in a separate prompt.We split the evaluation into multiple stages because the evaluator can get confused when the stages are combined.

Apply a voting mechanism.
Because LLMs can return different results for the same input, even when their temperature settings are low, we recommend running the evaluation 3x for each input and treating each complexity rating as a “vote” towards the true rating. You may want to try a higher number of runs for even greater confidence in the LLM’s output.

Understanding Evaluators

Getting Started

Literacy Evaluators

Datasets

Resources

Running the evaluator

Early Release

Requirements