Tutorial: GPU-Accelerated Serverless Inference With Google Cloud Run

In this tutorial, I will walk you through the steps of deploying Llama 3.1 LLM with 8B parameters on a GPU-based Cloud Run service.

Apr 18th, 2025 6:00am by Janakiram MSV

Featued image for: Tutorial: GPU-Accelerated Serverless Inference With Google Cloud Run

Feature image via Unsplash.

Recently, Google Cloud launched GPU support for the Cloud Run serverless platform. This feature enables developers to accelerate serverless inference of models deployed on Cloud Run.

In this tutorial, I will walk you through the steps of deploying Llama 3.1 Large Language Model (LLM) with 8B parameters on a GPU-based Cloud Run service. We will use the Text Generation Inference (TGI) server from Hugging Face as the model server and inference engine.

This guide assumes you can access Google Cloud and have the gcloud CLI installed and configured on your machine.

Step 1 – Initializing the Environment

Let’s start by defining the environment variables needed to configure Cloud Run

export PROJECT_ID=YOUR_GCP_PROJECT
export LOCATION=us-central1 
export CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310
export SERVICE_NAME=text-generation-inference

export PROJECT_ID=YOUR_GCP_PROJECT

export LOCATION=us-central1

export CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310

export SERVICE_NAME=text-generation-inference

Since Cloud Run expects a container image as the deployment unit, we use the official Deep Learning Container from Hugging Face, already stored within Google Cloud Artifact Registry.

The image us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.2-2.ubuntu2204.py310 represents the official TGI container image that we will deploy on Cloud Run.

The next step is to configure the gcloud CLI to use the correct project and region. Run the following commands to initialize the environment.

gcloud auth login
gcloud config set project $PROJECT_ID

1 2	gcloud auth login gcloud config set project $PROJECT_ID

Finally, let’s ensure that Cloud Run API is enabled for your project.

gcloud services enable run.googleapis.com

1	gcloud services enable run.googleapis.com

Step 2 – Deploying TGI Model Server

We are ready to deploy the Text Generation Inference server on Google Cloud. Run the below command to deploy.

gcloud beta run deploy $SERVICE_NAME \
    --image=$CONTAINER_URI \  --args="--model-id=hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4,--quantize=awq,--max-concurrent-requests=64" \
     --set-env-vars="HF_HUB_ENABLE_HF_TRANSFER=1, HF_TOKEN=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" \
    --port=8080 \
    --cpu=4 \
    --memory=16Gi \
    --no-cpu-throttling \
    --gpu=1 \
    --gpu-type=nvidia-l4 \
    --max-instances=1 \
    --concurrency=64 \
    --region=$LOCATION \
    --no-allow-unauthenticated

gcloud beta run deploy $SERVICE_NAME \

--image=$CONTAINER_URI \ --args="--model-id=hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4,--quantize=awq,--max-concurrent-requests=64" \

--set-env-vars="HF_HUB_ENABLE_HF_TRANSFER=1, HF_TOKEN=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" \

--port=8080 \

--cpu=4 \

--memory=16Gi \

--no-cpu-throttling \

--gpu=1 \

--gpu-type=nvidia-l4 \

--max-instances=1 \

--concurrency=64 \

--region=$LOCATION \

--no-allow-unauthenticated

The --image parameter specifies the container image stored in Google Cloud Artifact registry that we initialized in the environment variable.

The --args switch passes the model name as it appears in the Hugging Face repository. For increased throughput, we are using an int4 quantized model that can handle up to 64 concurrent requests.

The most important set of arguments is gpu=1 and --gpu-type=nvidia-l4, which forces the service to use GPU acceleration.

We don’t want to use anonymous authentication. So, the switch --no-allow-unauthenticated forces the client to use Google Cloud-based authentication.

Wait for the command to finish, and you should be able to see the service in Google Cloud Console. The green tick mark indicates that the service is deployed and running.

Step 3 – Performing Inference

With the model deployed in Cloud Run, we can now perform inference against the endpoint. Before that, let’s run the proxy to expose the service on our local machine. This is a handy technique to test a service deployed in Cloud Run.

gcloud run services proxy $SERVICE_NAME --region $LOCATION

1	gcloud run services proxy $SERVICE_NAME --region $LOCATION

We can now use the cURL command to test the inference endpoint.

curl http://localhost:8080/v1/chat/completions \ -X POST \ -H 'Content-Type: application/json' \ -d '{ "model": "tgi", "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the capital of France?" } ], "max_tokens": 128 }'

Since Hugging Face TGI exposes an OpenAI-compatible endpoint, we can also use the standard OpenAI Python library to talk to the service.

Install the OpenAI Python module.

pip install --upgrade openai

1	pip install --upgrade openai

We can now run the code below to test the service.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1/",
    api_key="-",
)

chat_completion = client.chat.completions.create(
    model="tgi",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is the capital of France?"},
    ],
    max_tokens=128,
)

from openai import OpenAI

client = OpenAI(

base_url="http://localhost:8080/v1/",

api_key="-",

)

chat_completion = client.chat.completions.create(

model="tgi",

messages=[

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "What is the capital of France?"},

max_tokens=128,

)

The first request takes time as the model is downloaded. However, subsequent calls will be faster as the model is cached and becomes readily available.

Apart from TGI, deploying other model servers, such as vLLM, on Google Cloud Run is possible.

Janakiram MSV is the principal analyst at Janakiram & Associates and an adjunct faculty member at the International Institute of Information Technology. He is also a Google Qualified Cloud Developer, an Amazon Certified Solution Architect, an Amazon Certified Developer, an...