news

Kubernetes Becomes the AI Substrate: 66% of GenAI Inference, DRA GA, llm-d

A practitioner reading of June's CNCF news: 66% of orgs running GenAI inference do it on Kubernetes, DRA went GA, gang scheduling landed natively, and Nvidia and Google donated their DRA drivers — self-hosted inference is complete.

Published 2026-06-09Source: abhs.in — Abhishek Gautam
abhs.in — Abhishek Gautam source artwork

Why it matters

The build-vs-API decision just shifted. With llm-d, KAI Scheduler, and vendor-neutral GPU allocation under open governance, platform teams can run credible inference — and fractional GPU quotas finally make per-team utilization visible to finance.

Tokenmaxxing read

The post proposes tokens-per-watt-per-namespace as the 2026 efficiency metric — the self-hosting analog of tokens-per-successful-task. Fractional GPUs end the utilization lie the same way token attribution ends the usage-leaderboard lie: by tying consumption to an owner.

Source takeaway

The author's migration checklist is the practical core: be on v1.34+ for DRA, evaluate llm-d before writing custom serving code, add quota-aware scheduling, and instrument efficiency per namespace rather than trusting cluster-level averages.

Topic links

Related projects

Tools that match this angle

#1Direct
Routing

LiteLLM

BerriAI/litellm

An OpenAI-compatible gateway and SDK for calling many model providers with budgets, logging, load balancing, guardrails, and cost tracking.

50.7K8.9KSource-available
gatewaycost-trackingrouting
#2Direct
Observability

Langfuse

langfuse/langfuse

Open-source LLM engineering platform for observability, traces, metrics, evals, prompt management, datasets, and playground workflows.

29.3K3KSource-available
tracesevalscosts
#10Direct
Routing

Portkey Gateway

Portkey-AI/gateway

An AI gateway for routing across LLMs with guardrails, provider abstraction, and an OpenAI-compatible API surface.

12.1K1.1KMIT
gatewayguardrailsrouting
Related feed

More source-linked context

Procurement Magazine source artwork
newsPM
news

How Ramp is Fuelling AI Spend Management Expansion

Ramp closed a $750M round at a $44B valuation and is launching AI token spend management, procurement agents, and accounting agents on top of $1B+ annualized revenue and 70,000+ customers.

agentsai-spendcost-governance
Read note
AIMultiple source artwork
newsA
news

15 AI Agent Observability Tools in 2026: AgentOps & Langfuse

AIMultiple compares 15 observability platforms for LLM apps and AI agents, emphasizing traces, dashboards, and real-world instrumentation tradeoffs rather than treating monitoring as a generic logging problem.

tokenmaxxingagentstoken-consumption
Read note
Towards Data Science source artwork
newsTD
news

RAG Is Burning Money — I Built a Cost Control Layer to Fix It | Towards Data Science

Most RAG systems are optimized for answer quality, not cost-and that blind spot gets expensive fast. In this article, I break down a production-ready cost control layer combining semantic caching, query routing, token budgeting, and circui…

tokenmaxxingcost-governanceai-spend
Read note