Tutorial: GPU-Accelerated Serverless Inference With Google Cloud Run

Recently, Google Cloud launched GPU support for the Cloud Run serverless platform. This feature enables developers to accelerate serverless inference of models deployed on Cloud Run.
In this tutorial, I will walk you through the steps of deploying Llama 3.1 Large Language Model (LLM) with 8B parameters on a GPU-based Cloud Run service. We will use the Text Generation Inference (TGI) server from Hugging Face as the model server and inference engine.
This guide assumes you can access Google Cloud and have the gcloud CLI installed and configured on your machine.
Step 1 – Initializing the Environment
Let’s start by defining the environment variables needed to configure Cloud Run
1 2 3 4 |
export PROJECT_ID=YOUR_GCP_PROJECT export LOCATION=us-central1 export CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310 export SERVICE_NAME=text-generation-inference |
Since Cloud Run expects a container image as the deployment unit, we use the official Deep Learning Container from Hugging Face, already stored within Google Cloud Artifact Registry.
The image us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310
represents the official TGI container image that we will deploy on Cloud Run.
The next step is to configure the gcloud
CLI to use the correct project and region. Run the following commands to initialize the environment.
1 2 |
gcloud auth login gcloud config set project $PROJECT_ID |
Finally, let’s ensure that Cloud Run API is enabled for your project.
1 |
gcloud services enable run.googleapis.com |
Step 2 – Deploying TGI Model Server
We are ready to deploy the Text Generation Inference server on Google Cloud. Run the below command to deploy.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
gcloud beta run deploy $SERVICE_NAME \ --image=$CONTAINER_URI \ --args="--model-id=hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4,--quantize=awq,--max-concurrent-requests=64" \ --set-env-vars="HF_HUB_ENABLE_HF_TRANSFER=1, HF_TOKEN=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" \ --port=8080 \ --cpu=4 \ --memory=16Gi \ --no-cpu-throttling \ --gpu=1 \ --gpu-type=nvidia-l4 \ --max-instances=1 \ --concurrency=64 \ --region=$LOCATION \ --no-allow-unauthenticated |
The --image
parameter specifies the container image stored in Google Cloud Artifact registry that we initialized in the environment variable.
The --args
switch passes the model name as it appears in the Hugging Face repository. For increased throughput, we are using an int4 quantized model that can handle up to 64 concurrent requests.
The most important set of arguments is gpu=1
and --gpu-type=nvidia-l4
, which forces the service to use GPU acceleration.
We don’t want to use anonymous authentication. So, the switch --no-allow-unauthenticated
forces the client to use Google Cloud-based authentication.
Wait for the command to finish, and you should be able to see the service in Google Cloud Console. The green tick mark indicates that the service is deployed and running.
Step 3 – Performing Inference
With the model deployed in Cloud Run, we can now perform inference against the endpoint. Before that, let’s run the proxy to expose the service on our local machine. This is a handy technique to test a service deployed in Cloud Run.
1 |
gcloud run services proxy $SERVICE_NAME --region $LOCATION |
We can now use the cURL command to test the inference endpoint.
curl http://localhost:8080/v1/chat/completions \
-X POST \
-H 'Content-Type: application/json' \
-d '{
"model": "tgi",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is the capital of France?"
}
],
"max_tokens": 128
}'
Since Hugging Face TGI exposes an OpenAI-compatible endpoint, we can also use the standard OpenAI Python library to talk to the service.
Install the OpenAI Python module.
1 |
pip install --upgrade openai |
We can now run the code below to test the service.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1/", api_key="-", ) chat_completion = client.chat.completions.create( model="tgi", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, ], max_tokens=128, ) |
The first request takes time as the model is downloaded. However, subsequent calls will be faster as the model is cached and becomes readily available.
Apart from TGI, deploying other model servers, such as vLLM, on Google Cloud Run is possible.