Why it matters
The build-vs-API decision just shifted. With llm-d, KAI Scheduler, and vendor-neutral GPU allocation under open governance, platform teams can run credible inference — and fractional GPU quotas finally make per-team utilization visible to finance.
Tokenmaxxing read
The post proposes tokens-per-watt-per-namespace as the 2026 efficiency metric — the self-hosting analog of tokens-per-successful-task. Fractional GPUs end the utilization lie the same way token attribution ends the usage-leaderboard lie: by tying consumption to an owner.
Source takeaway
The author's migration checklist is the practical core: be on v1.34+ for DRA, evaluate llm-d before writing custom serving code, add quota-aware scheduling, and instrument efficiency per namespace rather than trusting cluster-level averages.


