Skip to content

Conversation

@SheldonTsen
Copy link

This PR is just a repeat of #58340 because the commit history was quite messed up. Comments were all resolved as off the opening of this PR I believe!

Description

I was investigating odd behaviour where requesting exact number of workers via the python sdk was not behaving as expected. I initially raised an issue here: #55736. I was then pointed tothis: ray-project/kuberay#3794. However, even after the fix, I was not observing any different behaviour.

Then I thought to try and have ArgoCD ignore the replicas field, and then, everything started working as expected.

I thought it be best to convey this in an example, and I could not find any documentation on how to deploy using ArgoCD (which also has a couple of lines that one needs to be aware about). IIRC I pieced it together based on some github issues and debugging.

The important point is that when managing Ray via ArgoCD with the Autoscaler enabled, the ignoreDifferences must be managed properly to get the expected behaviour of the Autoscaler.

I would have attached screenshots, but from a PR review perspective, this doesn't prove anything. Essentially what I did was:

  • introduce the ignoreDifferences section, request X number of workers via ray.autoscaler.sdk.request_resources, kept changing it. When increasing X, it worked as expected and quite speedily. When reducing X, it takes ~10 mins (based on my idle setting in the ArgoCD app) then workers start spinning down. Eventually requesting 1 worker and it goes back to 1.
  • removed the ignoreDifferences section, request X number of workers. Then, requesting more than X, nothing happens. Request X=1, nothing happens. Essentially its like the ray.autoscaler.sdk.request_resources does nothing. It does print out some logs (showing that it is trying to do something), but when looking at the number of pods/workers, nothing happens. Delete the RayCluster, start back at original state, request Y, sometimes get Y, sometimes do not get Y workers. Essentially, it's not expected behaviour, looks very random.
  • repeat back and forth in my environment multiple times to confirm.
  • In my case, I was testing X = 40/80/100/200.
  • For small X, like <10, it appears to work, but you will see that pods can shutdown well within the idle limit, but get spun back up again.
@SheldonTsen SheldonTsen requested review from a team as code owners December 1, 2025 09:52
@SheldonTsen
Copy link
Author

@fscnick @Future-Outlier - new PR because the old one had a messed up commit history.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR adds valuable documentation for deploying Ray on Kubernetes with ArgoCD. The guide is comprehensive and covers important aspects like handling autoscaling with ignoreDifferences. I've found a few critical issues related to non-existent versions being used in the examples, which would prevent users from successfully following the guide. I've also pointed out some minor inconsistencies and best practices to improve the documentation further. Overall, this is a great addition!

namespace: ray-cluster
source:
repoURL: https://github.com/ray-project/kuberay
targetRevision: v1.5.1 # update this as necessary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The targetRevision is set to v1.5.1, but this version does not seem to exist in the ray-project/kuberay repository. The latest stable version is v1.1.1. Using a non-existent version will cause the deployment to fail. This comment also applies to lines 83, 384, and 405.

Suggested change
targetRevision: v1.5.1 # update this as necessary
targetRevision: v1.1.1 # update this as necessary
source:
repoURL: https://ray-project.github.io/kuberay-helm/
chart: ray-cluster
targetRevision: "1.5.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The targetRevision for the ray-cluster Helm chart is 1.5.1, but this version does not seem to exist in the kuberay-helm repository. The latest stable version is 1.1.1. Using a non-existent version will cause the deployment to fail.

Suggested change
targetRevision: "1.5.1"
targetRevision: "1.1.1"
source:
repoURL: https://ray-project.github.io/kuberay-helm/
chart: ray-cluster
targetRevision: "1.4.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The targetRevision for the ray-cluster Helm chart is 1.4.1 here, which is inconsistent with 1.5.1 used earlier. Both versions seem to not exist. Please use a consistent and existing version, for example 1.1.1.

Suggested change
targetRevision: "1.4.1"
targetRevision: "1.1.1"
Comment on lines +13 to +14
* (Optional)[ArgoCD installed](https://argo-cd.readthedocs.io/en/stable/getting_started/) on your Kubernetes cluster.
* (Optional)[ArgoCD CLI](https://argo-cd.readthedocs.io/en/stable/cli_installation/) installed on your local machine (recommended for easier application management. It might need [port-forwarding and login](https://argo-cd.readthedocs.io/en/stable/getting_started/#port-forwarding) depending on your environment).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The markdown syntax for these optional prerequisites is slightly incorrect. The (Optional) text should be outside the link definition for better rendering and clarity.

Suggested change
* (Optional)[ArgoCD installed](https://argo-cd.readthedocs.io/en/stable/getting_started/) on your Kubernetes cluster.
* (Optional)[ArgoCD CLI](https://argo-cd.readthedocs.io/en/stable/cli_installation/) installed on your local machine (recommended for easier application management. It might need [port-forwarding and login](https://argo-cd.readthedocs.io/en/stable/getting_started/#port-forwarding) depending on your environment).
* (Optional) [ArgoCD installed](https://argo-cd.readthedocs.io/en/stable/getting_started/) on your Kubernetes cluster.
* (Optional) [ArgoCD CLI](https://argo-cd.readthedocs.io/en/stable/cli_installation/) installed on your local machine (recommended for easier application management. It might need [port-forwarding and login](https://argo-cd.readthedocs.io/en/stable/getting_started/#port-forwarding) depending on your environment).
valuesObject:
image:
repository: docker.io/rayproject/ray
tag: latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the latest tag for Docker images is generally discouraged in documentation and production environments as it can lead to unpredictable behavior when the image is updated. Pinning to a specific version ensures reproducibility. This comment applies to all occurrences of tag: latest in this file (lines 193, 208, 449, 477, 492). Based on the KubeRay version, a compatible Ray version would be 2.9.3.

Suggested change
tag: latest
tag: "2.9.3"
Comment on lines +480 to +481
replicas: 1
minReplicas: 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In the ray-argocd-all.yaml example, additional-worker-group1 is configured with replicas: 1 and minReplicas: 1. However, in the separate raycluster.yaml example (lines 196-197), it's configured with replicas: 0 and minReplicas: 0. This inconsistency might be confusing. For clarity, it would be best to keep the examples consistent.

Suggested change
replicas: 1
minReplicas: 1
replicas: 0
minReplicas: 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant