What it does
A CLI and CI workflow for testing prompts, agents, and RAG systems across models, with evals and red-team style checks.
Why it belongs here
A bad prompt can spend tokens forever and still be wrong. Evals let you find the cheap-enough prompt before production does.
Best use case
Teams that want CI-style prompt, model, RAG, and agent checks before routing changes or prompt edits reach users.
How to use it
Create test cases for high-value workflows, compare models and prompts, and block changes that raise cost without preserving quality.
Limits
Evals are only as useful as the examples and grading criteria. They need maintenance as product behavior changes.
