Our evaluators allow you to get specific about characteristics of your LLM-generated text.Let’s say you’re working on a feature that’s designed to evaluate a student’s vocabulary. You might want to make the vocabulary more challenging and keep the sentence structure less challenging.
You can set targets for how closely sentence structure and vocabulary need to match the grade-level appropriateness of generated text. Then use the Sentence Structure Evaluator and Vocabulary Evaluator on some sample texts generated by your LLM to make sure that they are in acceptable ranges for both dimensions.Or maybe you’re working on a feature that’s designed to help students with reading aloud, where the focus is on building prosody—rhythm and flow—without stumbling over complex or unfamiliar words. You will want to optimize for different parameters of text complexity in what the LLM generates for vocabulary. It might look like this:
In which case, you could evaluate Vocabulary against a lower grade-level, or choose to accept a lower rating as defined by the evaluator, when you test your text samples.
The last thing you want is your AI output generating things you don’t expect. Sometimes models drift, or you may introduce something new that subtly changes the way your system behaves. Language is highly nuanced, and without the right safeguards in place, it’s easy for unexpected content to go undetected until tickets start coming in.For this reason, we recommend setting up regular regression tests that measure the consistency of LLM-generated text to ensure that it’s as expected period over period. Some of our partners run these tests weekly, and others do it daily, but the premise is the same—ensuring consistency from one moment to the next so that your user experience is reliable.
It seems like there’s a new model coming out every week. It’s exciting but requires companies to have some consistent methodology for assessing them to determine which is best for their needs.That’s where evaluators come in. By establishing a gold set of expected scores against a set of parameters (including product inputs like grade, topic, etc.), you can use Evaluators as a consistent measuring stick to explore the tradeoffs of speed, quality, and cost associated with selecting a model. This allows you to experiment and make updates confidently because detecting drift from your baseline is no longer a manual process.
Districts and educators are increasingly interested in how edtech companies know that their product is delivering consistent, high-quality AI output. Evaluators allow you to draw a direct line from learning science to the content your product generates. Telling customers about your approach to measuring and improving AI content is a meaningful way to highlight the quality and rigor in your offering.