Ivan Nardini @ivnardini - Bluesky Profile

How GKE Inference Gateway improved latency for Vertex AI | Google Cloud Blog Running on GKE Inference Gateway with its load-aware and context-aware routing helped the Vertex AI engineering team unlock performance at scale.

Read the full breakdown on the Google Cloud blog
cloud.google.com/blog/produc...

10.02.2026 17:15 — 👍 0 🔁 0 💬 0 📌 0

4/4 The impact?

> 35% Faster responses (TTFT)
> Cache hit rate doubled (35% -> 70%)

10.02.2026 17:15 — 👍 0 🔁 0 💬 1 📌 0

3/4 The GKE Inference Gateway is smart. It inspects the request prefix. It sends the user to the exact Pod that already has their data loaded in VRAM. Result: Instant response, no re-compute.

10.02.2026 17:15 — 👍 0 🔁 0 💬 1 📌 0

2/4 Standard Load Balancers are naive. They just look for an open slot. But LLMs rely on the KV Cache (memory). If you send a request to a GPU with an empty cache, it wastes huge compute time rebuilding the context window.

10.02.2026 17:15 — 👍 0 🔁 0 💬 1 📌 0

1/4 Routing LLM traffic is harder than routing web traffic.

If you get it wrong, you force the GPU to "re-read" the whole prompt every time.

Vertex AI just fixed this with Content-Aware Routing. Here is the simple version of how they cut latency by over 35%.

10.02.2026 17:15 — 👍 0 🔁 0 💬 1 📌 0

GitHub - sgl-project/sglang-jax: JAX backend for SGL JAX backend for SGL. Contribute to sgl-project/sglang-jax development by creating an account on GitHub.

7/7 More info about the project here: github.com/sgl-project...

14.01.2026 20:27 — 👍 0 🔁 0 💬 0 📌 0

6/7 The framework exposes an OpenAI-compatible API to use with LangChain, LlamaIndex, or raw requests

14.01.2026 20:27 — 👍 0 🔁 0 💬 1 📌 0

5/7 SGL-JAX uses JAX’s native Tensor Parallelism. It handles the sharding for you to serve models like Qwen 3 MoE across multiple TPU v5e chips

14.01.2026 20:27 — 👍 0 🔁 0 💬 1 📌 0

4/7 It uses a Radix Tree (like PagedAttention). This allows automatic Prefix Sharing. If you send 100 requests with the same system prompt, it only computes that prompt once.

14.01.2026 20:27 — 👍 0 🔁 0 💬 1 📌 0

3/7 SGL-JAX integrates a FlashAttention kernel directly into JAX for memory efficiency with long-context workloads

14.01.2026 20:27 — 👍 0 🔁 0 💬 1 📌 0

2/7 SGL-JAX implements Continuous Batching with a custom scheduler. It dynamically batches incoming requests, keeping TPU cores saturated and cutting tail latency significantly compared to naive serving.

14.01.2026 20:27 — 👍 0 🔁 0 💬 1 📌 0

1/7 I’ve been looking at inference engines for TPUs, and I just notice that @sgl_project released SGL-JAX.

It’s a native JAX engine designed for serving on TPU, and the framework supports interesting features 🧵👇

14.01.2026 20:27 — 👍 0 🔁 0 💬 1 📌 0

The TPU Research Cloud (TRC) offers accepted researchers access to over 1,000 Cloud TPUs. They can use 45 (v2), 123 (v3), or 275 (v4) teraflops TPUs with TensorFlow, PyTorch, Julia, or JAX

Apply for the TRC program here: sites.research.google/trc/about/

26.12.2025 18:00 — 👍 0 🔁 0 💬 0 📌 0

Blog: www.anthropic.com/research/bloom
Tech paper: alignment.anthropic.com/2025/bloom-...
Github: github.com/safety-rese...
Bloom eval pipeline output example: claude.ai/redirect/we...

23.12.2025 09:51 — 👍 0 🔁 0 💬 0 📌 0

I missed Bloom, Anthropic's open-source framework for automating AI behavioral tests. It approaches LLMs' evaluation by:

> Using a 4-stage pipeline: "understanding," "ideation," "rollouts," "judgment."
> Generating new scenarios each time to reduce training set risks.

More in the 🧵

23.12.2025 09:51 — 👍 0 🔁 0 💬 1 📌 0

Agent Designer overview | Vertex AI Agent Builder | Google Cloud Documentation Understand Agent Designer's low-code visual interface to design, configure, and test AI agents, integrating models and tools.

Docs: docs.cloud.google.com/agent-build...

22.12.2025 16:00 — 👍 0 🔁 0 💬 0 📌 0

This week, Vertex AI introduced Agent Designer in Agent Builder.

It is a low-code interface lets you design agents and export logic to the Agent Development Kit (ADK).

The preview version limits MCP authentication and advanced ADK patterns, but overall looks promising!

Docs in 🧵

22.12.2025 16:00 — 👍 0 🔁 0 💬 1 📌 0

Currently, Vertex AI Agent Engine offers two paths based on your agent's lifecycle stage:

> Serialization (pickle): great for quick deployment tests in notebooks
> In-line source deployment: relevant for ci/cd and tools like terraform.

If you could pick any way to fill boxes, what would it be?

19.12.2025 20:16 — 👍 0 🔁 0 💬 0 📌 0

Interesting blog on separating AI agents from chat interfaces. It explains using CopilotKit's AG-UI protocol to standardize interfaces by wrapping your Google ADK agent with ag_ui_adk, enabling seamless text, tool, and state streaming to frontends.

> Blog: discuss.google.dev/t/integrati...

19.12.2025 17:42 — 👍 0 🔁 0 💬 0 📌 0

generative-ai/agents/agent_engine/tutorial_get_started_with_cloud_api_registry.ipynb at main · GoogleCloudPlatform/generative-ai Sample code and notebooks for Generative AI on Google Cloud, with Gemini on Vertex AI - GoogleCloudPlatform/generative-ai

Blog: discuss.google.dev/t/where-is-...

Tutorial notebook: github.com/GoogleCloud...

18.12.2025 18:23 — 👍 0 🔁 0 💬 0 📌 0

Google Cloud's Cloud API Registry for Vertex AI Agent Builder is a new centralized catalog. It helps developers find tools and deploy agents using MCP servers for services like BigQuery.

Tutorial and blog on deploying an agent via Cloud API Registry on Vertex AI Agent Engine in the 🧵

18.12.2025 18:23 — 👍 0 🔁 0 💬 1 📌 0

Vertex AI Agent Builder release notes | Google Cloud Documentation

Vertex AI Agent Engine just dropped a massive update for Vertex AI Agent Engine with a major global expansion, production-ready features, and a pricing reduction just in time for the holidays.

Release notes: docs.cloud.google.com/agent-build...

16.12.2025 19:00 — 👍 0 🔁 0 💬 0 📌 0

Vertex AI Agent Engine | liteLLM Call Vertex AI Agent Engine (Reasoning Engines) in the OpenAI Request/Response format.

LiteLLM Agent Gateway NOW supports Vertex AI Agent Engine. You can now invoke your deployed agents using the OpenAI format or the A2A Protocol, complete with centralized logging and access management.

> Docs: docs.litellm.ai/docs/provid...

16.12.2025 18:30 — 👍 0 🔁 0 💬 0 📌 0

From the RedditEng community on Reddit Explore this post and more from the RedditEng community

Distributed training deadlocks are tough for ML teams.

Reddit tackled this with smarter Ray job controls and "all-or-nothing" scheduling. They modernized their "Gazette" platform with KubeRay, Kueue, and Google Cloud.

> Blog: www.reddit.com/r/RedditEng...

16.12.2025 17:30 — 👍 0 🔁 0 💬 0 📌 0

MiniMax-M2: API Provider Performance Benchmarking & Price Analysis | Artificial Analysis Analysis of API providers for MiniMax-M2 across performance metrics including latency (time to first token), output speed (output tokens per second), price and others. API providers benchmarked include Amazon Bedrock, Google, Fireworks, DeepInfra, MiniMax, GMI, Novita.

> Kimi-K2-Thinking: artificialanalysis.ai/models/kimi...
> MiniMax-M2: artificialanalysis.ai/models/mini...

16.12.2025 16:00 — 👍 0 🔁 0 💬 0 📌 0

Vertex AI is now the fastest provider for Kimi K2 Thinking and MiniMax M2 on Artificial Analysis , with per-token pricing on par with the rest of the industry. We are preparing a deep-dive engineering blog to explain the implementation.

Link in the 🧵

16.12.2025 16:00 — 👍 0 🔁 0 💬 1 📌 0

Blog: substack.com/home/post/p...

08.12.2025 14:00 — 👍 0 🔁 0 💬 0 📌 0

I've been reflecting on this past year, the lessons, and my time. It's led me to ask daily: What truly matters for me to do in AI? I think I've found the answer, and yours might differ. If you're trying to find your path, perhaps this will be useful to you too.

Blog in the 🧵

08.12.2025 14:00 — 👍 0 🔁 0 💬 1 📌 0

Supervised Fine-tuning Gemini for Predictive Maintenance Forging Specialist AI: Coder’s Guide to Tuning Foundation models like Gemini are the raw material of the AI revolution—powerful, generalist intellects. But raw material doesn’t solve specific business problems. Real-world value is unlocked by forging these generalists into specialists: a legal AI that understands contract law, a financial AI that detects sophisticated fraud, a manufacturing AI that can hear the subtle signs of impending machine failure. Related Assets: Code: sft_gemini_predi...

Blog: discuss.google.dev/t/supervise...

07.12.2025 09:51 — 👍 0 🔁 0 💬 0 📌 0

Vertex AI is really providing capabilities to customize Gemini for your production applications when prompt engineering isn't enough.

Aniket has published an excellent walkthrough on using Supervised Fine-Tuning Gemini on Vertex AI, achieving 97.7% accuracy in predictive maintenance.

Blog in the 🧵

07.12.2025 09:51 — 👍 1 🔁 1 💬 1 📌 0

Ivan Nardini

Latest posts by ivnardini.bsky.social on Bluesky

@ivnardini is following 3 prominent accounts