Read the full breakdown on the Google Cloud blog
cloud.google.com/blog/produc...
@ivnardini.bsky.social
DevRel Engineer @googlecloud in π In love with minds, souls, Rock π€π» and Photography π· My religion is LARMATS.
Read the full breakdown on the Google Cloud blog
cloud.google.com/blog/produc...
4/4 The impact?
> 35% Faster responses (TTFT)
> Cache hit rate doubled (35% -> 70%)
3/4 The GKE Inference Gateway is smart. It inspects the request prefix. It sends the user to the exact Pod that already has their data loaded in VRAM. Result: Instant response, no re-compute.
10.02.2026 17:15 β π 0 π 0 π¬ 1 π 02/4 Standard Load Balancers are naive. They just look for an open slot. But LLMs rely on the KV Cache (memory). If you send a request to a GPU with an empty cache, it wastes huge compute time rebuilding the context window.
10.02.2026 17:15 β π 0 π 0 π¬ 1 π 01/4 Routing LLM traffic is harder than routing web traffic.
If you get it wrong, you force the GPU to "re-read" the whole prompt every time.
Vertex AI just fixed this with Content-Aware Routing. Here is the simple version of how they cut latency by over 35%.
7/7 More info about the project here: github.com/sgl-project...
14.01.2026 20:27 β π 0 π 0 π¬ 0 π 06/7 The framework exposes an OpenAI-compatible API to use with LangChain, LlamaIndex, or raw requests
14.01.2026 20:27 β π 0 π 0 π¬ 1 π 05/7 SGL-JAX uses JAXβs native Tensor Parallelism. It handles the sharding for you to serve models like Qwen 3 MoE across multiple TPU v5e chips
14.01.2026 20:27 β π 0 π 0 π¬ 1 π 04/7 It uses a Radix Tree (like PagedAttention). This allows automatic Prefix Sharing. If you send 100 requests with the same system prompt, it only computes that prompt once.
14.01.2026 20:27 β π 0 π 0 π¬ 1 π 03/7 SGL-JAX integrates a FlashAttention kernel directly into JAX for memory efficiency with long-context workloads
14.01.2026 20:27 β π 0 π 0 π¬ 1 π 02/7 SGL-JAX implements Continuous Batching with a custom scheduler. It dynamically batches incoming requests, keeping TPU cores saturated and cutting tail latency significantly compared to naive serving.
14.01.2026 20:27 β π 0 π 0 π¬ 1 π 01/7 Iβve been looking at inference engines for TPUs, and I just notice that @sgl_project released SGL-JAX.
Itβs a native JAX engine designed for serving on TPU, and the framework supports interesting features π§΅π
The TPU Research Cloud (TRC) offers accepted researchers access to over 1,000 Cloud TPUs. They can use 45 (v2), 123 (v3), or 275 (v4) teraflops TPUs with TensorFlow, PyTorch, Julia, or JAX
Apply for the TRC program here: sites.research.google/trc/about/
Blog: www.anthropic.com/research/bloom
Tech paper: alignment.anthropic.com/2025/bloom-...
Github: github.com/safety-rese...
Bloom eval pipeline output example: claude.ai/redirect/we...
I missed Bloom, Anthropic's open-source framework for automating AI behavioral tests. It approaches LLMs' evaluation by:
> Using a 4-stage pipeline: "understanding," "ideation," "rollouts," "judgment."
> Generating new scenarios each time to reduce training set risks.
More in the π§΅
This week, Vertex AI introduced Agent Designer in Agent Builder.
It is a low-code interface lets you design agents and export logic to the Agent Development Kit (ADK).
The preview version limits MCP authentication and advanced ADK patterns, but overall looks promising!
Docs in π§΅
Currently, Vertex AI Agent Engine offers two paths based on your agent's lifecycle stage:
> Serialization (pickle): great for quick deployment tests in notebooks
> In-line source deployment: relevant for ci/cd and tools like terraform.
If you could pick any way to fill boxes, what would it be?
Interesting blog on separating AI agents from chat interfaces. It explains using CopilotKit's AG-UI protocol to standardize interfaces by wrapping your Google ADK agent with ag_ui_adk, enabling seamless text, tool, and state streaming to frontends.
> Blog: discuss.google.dev/t/integrati...
Blog: discuss.google.dev/t/where-is-...
Tutorial notebook: github.com/GoogleCloud...
Google Cloud's Cloud API Registry for Vertex AI Agent Builder is a new centralized catalog. It helps developers find tools and deploy agents using MCP servers for services like BigQuery.
Tutorial and blog on deploying an agent via Cloud API Registry on Vertex AI Agent Engine in the π§΅
Vertex AI Agent Engine just dropped a massive update for Vertex AI Agent Engine with a major global expansion, production-ready features, and a pricing reduction just in time for the holidays.
Release notes: docs.cloud.google.com/agent-build...
LiteLLM Agent Gateway NOW supports Vertex AI Agent Engine. You can now invoke your deployed agents using the OpenAI format or the A2A Protocol, complete with centralized logging and access management.
> Docs: docs.litellm.ai/docs/provid...
Distributed training deadlocks are tough for ML teams.
Reddit tackled this with smarter Ray job controls and "all-or-nothing" scheduling. They modernized their "Gazette" platform with KubeRay, Kueue, and Google Cloud.
> Blog: www.reddit.com/r/RedditEng...
> Kimi-K2-Thinking: artificialanalysis.ai/models/kimi...
> MiniMax-M2: artificialanalysis.ai/models/mini...
Vertex AI is now the fastest provider for Kimi K2 Thinking and MiniMax M2 on Artificial Analysis , with per-token pricing on par with the rest of the industry. We are preparing a deep-dive engineering blog to explain the implementation.
Link in the π§΅
Blog: substack.com/home/post/p...
08.12.2025 14:00 β π 0 π 0 π¬ 0 π 0I've been reflecting on this past year, the lessons, and my time. It's led me to ask daily: What truly matters for me to do in AI? I think I've found the answer, and yours might differ. If you're trying to find your path, perhaps this will be useful to you too.
Blog in the π§΅
Vertex AI is really providing capabilities to customize Gemini for your production applications when prompt engineering isn't enough.
Aniket has published an excellent walkthrough on using Supervised Fine-Tuning Gemini on Vertex AI, achieving 97.7% accuracy in predictive maintenance.
Blog in the π§΅