Events / San Francisco

Reading Group (+๐Ÿง‹): JUDGEMENTBENCH: Comparing Rubric and Preference Evaluation for Quality Assessment

A Snorkel AI reading group on JudgmentBench, a study comparing rubric scoring against pairwise preferences for judging LLM output quality, boba included.

Wed, Jun 17, 3:00 PMSoMa, San Francisco, CA

Why it matters

Evals are the bottleneck on every serious agent deployment, and this paper has a startling result about which method actually works.

The tokenmaxxing angle

If pairwise judging correlates at 0.908 versus 0.150 for rubrics, every LLM-as-judge pipeline burning tokens on rubric scores should rethink.

From the organizers

Russell Yang, AI Engineering Fellow at Stanford Law School, presents at 101 Second Street; doors at 3pm, talk at 3:30, boba provided.