Why it matters
Evals are the bottleneck on every serious agent deployment, and this paper has a startling result about which method actually works.
The tokenmaxxing angle
If pairwise judging correlates at 0.908 versus 0.150 for rubrics, every LLM-as-judge pipeline burning tokens on rubric scores should rethink.
From the organizers
Russell Yang, AI Engineering Fellow at Stanford Law School, presents at 101 Second Street; doors at 3pm, talk at 3:30, boba provided.