TNS
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
AI Operations / Cloud Services / Kubernetes

Tutorial: GPU-Accelerated Serverless Inference With Google Cloud Run

In this tutorial, I will walk you through the steps of deploying Llama 3.1 LLM with 8B parameters on a GPU-based Cloud Run service.
Apr 18th, 2025 6:00am by
Featued image for: Tutorial: GPU-Accelerated Serverless Inference With Google Cloud Run
Feature image via Unsplash.

Recently, Google Cloud launched GPU support for the Cloud Run serverless platform. This feature enables developers to accelerate serverless inference of models deployed on Cloud Run.

In this tutorial, I will walk you through the steps of deploying Llama 3.1 Large Language Model (LLM) with 8B parameters on a GPU-based Cloud Run service. We will use the Text Generation Inference (TGI) server from Hugging Face as the model server and inference engine.

This guide assumes you can access Google Cloud and have the gcloud CLI installed and configured on your machine.

Step 1 – Initializing the Environment

Let’s start by defining the environment variables needed to configure Cloud Run


Since Cloud Run expects a container image as the deployment unit, we use the official Deep Learning Container from Hugging Face, already stored within Google Cloud Artifact Registry.

The image us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310 represents the official TGI container image that we will deploy on Cloud Run.

The next step is to configure the gcloud CLI to use the correct project and region. Run the following commands to initialize the environment.


Finally, let’s ensure that Cloud Run API is enabled for your project.

Step 2 – Deploying TGI Model Server

We are ready to deploy the Text Generation Inference server on Google Cloud. Run the below command to deploy.


The --image parameter specifies the container image stored in Google Cloud Artifact registry that we initialized in the environment variable.

The --args switch passes the model name as it appears in the Hugging Face repository. For increased throughput, we are using an int4 quantized model that can handle up to 64 concurrent requests.

The most important set of arguments is gpu=1 and --gpu-type=nvidia-l4, which forces the service to use GPU acceleration.

We don’t want to use anonymous authentication. So, the switch --no-allow-unauthenticated forces the client to use Google Cloud-based authentication.

Wait for the command to finish, and you should be able to see the service in Google Cloud Console. The green tick mark indicates that the service is deployed and running.

Step 3 – Performing Inference

With the model deployed in Cloud Run, we can now perform inference against the endpoint. Before that, let’s run the proxy to expose the service on our local machine. This is a handy technique to test a service deployed in Cloud Run.


We can now use the cURL command to test the inference endpoint.


curl http://localhost:8080/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the capital of France?"
}
],
"max_tokens": 128
}'

Since Hugging Face TGI exposes an OpenAI-compatible endpoint, we can also use the standard OpenAI Python library to talk to the service.

Install the OpenAI Python module.


We can now run the code below to test the service.


The first request takes time as the model is downloaded. However, subsequent calls will be faster as the model is cached and becomes readily available.

Apart from TGI, deploying other model servers, such as vLLM, on Google Cloud Run is possible.

Created with Sketch.
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.