news

Kubernetes Becomes the AI Substrate: 66% of GenAI Inference, DRA GA, llm-d

A practitioner reading of June's CNCF news: 66% of orgs running GenAI inference do it on Kubernetes, DRA went GA, gang scheduling landed natively, and Nvidia and Google donated their DRA drivers — self-hosted inference is complete.

Published 2026-06-09Source: abhs.in — Abhishek Gautam

abhs.in — Abhishek Gautam source artwork

Why it matters

The build-vs-API decision just shifted. With llm-d, KAI Scheduler, and vendor-neutral GPU allocation under open governance, platform teams can run credible inference — and fractional GPU quotas finally make per-team utilization visible to finance.

Tokenmaxxing read

The post proposes tokens-per-watt-per-namespace as the 2026 efficiency metric — the self-hosting analog of tokens-per-successful-task. Fractional GPUs end the utilization lie the same way token attribution ends the usage-leaderboard lie: by tying consumption to an owner.

Source takeaway

The author's migration checklist is the practical core: be on v1.34+ for DRA, evaluate llm-d before writing custom serving code, add quota-aware scheduling, and instrument efficiency per namespace rather than trusting cluster-level averages.

Topic links

ai-spendtopic cost-controltopic cost-governancetopic finopstopicinfrastructuretokenmaxxing

Related projects

Tools that match this angle

#1Direct

Routing

LiteLLM

BerriAI/litellm

An OpenAI-compatible gateway and SDK for calling many model providers with budgets, logging, load balancing, guardrails, and cost tracking.

55.2K10.2KSource-available

gatewaycost-trackingrouting

Project profile GitHub

#2Direct

Observability

Langfuse

langfuse/langfuse

Open-source LLM engineering platform for observability, traces, metrics, evals, prompt management, datasets, and playground workflows.

32.2K3.5KSource-available

tracesevalscosts

Project profile GitHub

#10Direct

Routing

Portkey Gateway

Portkey-AI/gateway

An AI gateway for routing across LLMs with guardrails, provider abstraction, and an OpenAI-compatible API surface.

12.6K1.2KMIT

gatewayguardrailsrouting

Project profile GitHub

Related feed

More source-linked context

newsM&

news2026-07-20

The cost of intelligence: How CIOs can manage AI demand at scale - McKinsey & Company

McKinsey’s July 20 report finds 93% of enterprises are already blowing past their AI budgets, with spend jumping nearly 4x as pilots go company-wide. The fix it prescribes: run “FinOps for AI” and treat tokens like cloud cost.

tokenmaxxingfinopsai-spend

Read note

newsS

news2026-07-07

FinOps for AI: Snowflake's AI Cost Management and Governance Tools

Snowflake's product team makes the case for 'FinOps for AI' — governing model spend the way cloud bills got governed — and rolls out per-user token quotas, budgets, and org-level cost views to meter Cortex and agent usage.

tokenmaxxingfinopsai-spend

Read note

newsPM

news2026-06-09

How Ramp is Fuelling AI Spend Management Expansion

Ramp closed a $750M round at a $44B valuation and is launching AI token spend management, procurement agents, and accounting agents on top of $1B+ annualized revenue and 70,000+ customers.

agentsai-spendcost-governance

Read note