The "Cold Start" problem is the silent killer of GenAI ROI When a traffic spike hits your AI application, waiting 10+ minutes for a new GPU node to pull a 100GB+ model is the difference between a seamless user experience and a timeout error. It also means expensive GPUs are sitting idle, burning budget while waiting for data. Fortunately, a powerful new infrastructure "trifecta" on Google #Kubernetes Engine (#GKE) is solving this, slashing load times from minutes to mere seconds. By combining NVIDIA’s Run:ai streamer, Google Cloud's storage caching, and GKE's native streaming, we can finally maximize GPU utilization and achieve true, fast autoscaling. Here is the stack that is changing the game: 🚀 1. GKE Image Streaming Before you load the model, you need to load the container. GKE Image Streaming lets pods enter the Running state instantly by fetching only necessary startup data on demand, rather than waiting for the full image download. ⚡ 2. Google Cloud "Anywhere Cache" Even with fast streaming, latency matters. Anywhere Cache provides an SSD-backed read cache zonally co-located with your GKE nodes. This means sub-millisecond access to weights and massive throughput, ensuring new nodes don't bottleneck at the storage layer. 🧠 3. NVIDIA Run:ai Model Streamer This is the biggest shift. Traditionally, you download a model to disk, then load it to RAM, then transfer to GPU. The Run:ai streamer (integrated with vLLM) flips the script. It streams model weights directly from Cloud Storage into GPU memory, bypassing the local disk entirely. The Result: Your GPUs spend their time computing, not waiting. For MLOps teams, this means Horizontal Pod Autoscaling (HPA) actually works in real-time - scaling out for spikes and back down quickly, significantly reducing TCO. Have you implemented model streaming in your infrastructure yet? #GenAI #Kubernetes #GKE #NVIDIA #MLOps #CloudComputing #GPU #AIInfrastructure #GoogleCloud
Latest Google Kubernetes Engine Feature Releases
Explore top LinkedIn content from expert professionals.
Summary
Google Kubernetes Engine (GKE) is a managed service that makes it easier to deploy and run applications using containers, and recent feature releases are focused on improving performance, scaling, security, and AI integration. These updates help organizations train AI models, manage large clusters, boost reliability, and streamline access, all while reducing costs and complexity.
- Streamline cluster access: Enable DNS-based endpoints to securely connect to your GKE control plane from anywhere without the hassle of proxies or static IPs.
- Speed up AI training: Use multi-tier checkpointing to minimize downtime and rapidly recover ML workloads, allowing training jobs to resume quickly after failures.
- Customize compute scaling: Configure custom compute classes to match your workload needs, ensuring pods run on preferred VM types and autoscale efficiently for cost savings.
-
-
GKE just released custom compute classes which I think is a killer and unique feature that only #GoogleCloud has for now 🎉 Custom Compute Classes are a mechanism in #GoogleKubernetesEngine that allows you to configure a set of node configurations you want to workload to run on along side the order you want these configurations to be provisioned in with a fallback order, scale out configs and defaults [1]. Let's take an example. Say for example I want my Pods to run on Spot Virtual Machines unless they are not available in which case I want to fallover to standard VM's. But I also want to fallback to Spot when they become available and In case neither Spot nor the config I want is available I want a default node config. Typically in #Kubernetes you will have to define some sort of custom logic with Labels, Tolerations, Selectors, BallonPods... With Custom Compute Classes you can define a custom cluster (as a CRD) like the image below. This will tell GKE to: - Provision an N2 with Min 64 Cores as Spot VM's. - If not fallover to N2 with any number of Core as Spot VM's. - If not fallover to N2 standard (not Spot). You define all of these are a custom compute class object and label the namespace [2]. This works for both Standard and Autopilot clusters and is supported with Autoscaler. [1] https://lnkd.in/eQwrTn6t [2] https://lnkd.in/ea7bqAsE
-
Today, at #GoogleCloudNext, we’re announcing significant improvements to Google Kubernetes Engine (GKE) to help platform teams succeed with AI: * Cluster Director for GKE, now generally available, lets you deploy and manage large clusters of accelerated VMs with compute, storage, and networking — all operating as a single unit. * GKE Inference Quickstart, now in public preview, which simplifies the selection of infrastructure and deployment of AI models, while delivering benchmarked performance characteristics. * GKE Inference Gateway, now in public preview, provides intelligent routing and load balancing for AI inference on GKE. * A new container-optimized compute platform is rolling out on GKE Autopilot today, and in Q3, Autopilot’s compute platform will be made available to standard GKE clusters. * Gemini Cloud Assist Investigations, now in private preview, helps with GKE troubleshooting, decreasing the time it takes to understand the root cause and resolve issues. * With a built-in partnership with Anyscale, RayTurbo on GKE will launch later this year to deliver superior GPU/TPU performance, rapid cluster startup, and robust autoscaling. More details in the blog post below... https://lnkd.in/gxuzMJaJ
-
Google Kubernetes Engine's newly released multi-tier checkpointing can help you train large ML models much faster, potentially saving many millions of dollars for large training jobs. Here's how it works: If you're training a large ML model across many GPUs or TPUs over many days or weeks, hardware failures are certain to occur. With GKE's multi-tier checkpointing, it's very easy for your PyTorch GPU training job (or, JAX for Cloud TPUs) to write checkpoints to several different tiers of storage that have different durability and speed-of-recovery properties. Specifically, you'll write checkpoints to three places: First, checkpoints are written to local storage like RAMDisk or local SSD on the VM for very rapid recovery in the case of software bugs. Second, checkpoints are replicated to peer nodes or to in-cluster distributed storage to rapidly recovery from single or multiple node failures. Third, checkpoints are saved in truly durable storage like Google Cloud Storage for recovery from catastrophic failures. Whenever failures happen, the checkpoint is restored in the fastest way possible, letting ML training resume quickly. This solution improves mean-time-to-recovery (MTTR), and in our tests caused the ML Goodput of large training jobs to improve significantly -- for instance, by 6.59% in a 35K-chip workload on TPU v5p. https://lnkd.in/eyyZWh4z Check out our user guide: https://lnkd.in/ediF8Tc6 Huge kudos to Jason Messer, Dmitry Kakurin, Akshay Ram, Akash Verenkar, and team for this!
-
So many exciting GKE news from Kubecon this week, but this launch is special to me: we announced a new flexible DNS-based endpoint for accessing the GKE control plane. Blog from Ninad Desai and Chris Gonterman : https://lnkd.in/gxzQGf6e Customers were asking for it and it is here now! 🔒 Secure Access from Anywhere: The DNS-based endpoint eliminates the need for proxies or bastion hosts. **Bastion hosts are notoriously hard to setup - no more toil and IP headaches!** Authorized users can connect from home, on-prem, or other clouds seamlessly. 🔧 Dynamic and Simplified Security: IAM policies handle user authentication without relying on static network IPs. To revoke access just update permissions - no reconfiguring more firewalls. 🛡️ Multi-Layer Protection: Combine the ease IAM policies with VPC Service Controls for robust, context-aware access from approved origins -now you have these two layers or protection 🚀 Effortless Setup: Enable DNS-based access for any cluster with just a `gcloud` command and upgrade it in minutes. With this feature, we’re making GKE cluster access more flexible, secure, and user-friendly. #GKE #Kubernetes #CloudSecurity