Name: Reading Group (+🧋): JUDGEMENTBENCH: Comparing Rubric and Preference Evaluation for Quality Assessment
Start: 2026-06-17T22:00:00.000Z
End: 2026-06-18T00:30:00.000Z
Location: SoMa, San Francisco, CA

Why it matters

Evals are the bottleneck on every serious agent deployment, and this paper has a startling result about which method actually works.

The tokenmaxxing angle

If pairwise judging correlates at 0.908 versus 0.150 for rubrics, every LLM-as-judge pipeline burning tokens on rubric scores should rethink.

From the organizers

Russell Yang, AI Engineering Fellow at Stanford Law School, presents at 101 Second Street; doors at 3pm, talk at 3:30, boba provided.