Add documentation for ArgoCD #59085

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

SheldonTsen wants to merge 2 commits into ray-project:master from SheldonTsen:add-argocd-docs2

+519 −0

SheldonTsen commented Dec 1, 2025

This PR is just a repeat of #58340 because the commit history was quite messed up. Comments were all resolved as off the opening of this PR I believe!

Description

I was investigating odd behaviour where requesting exact number of workers via the python sdk was not behaving as expected. I initially raised an issue here: #55736. I was then pointed tothis: ray-project/kuberay#3794. However, even after the fix, I was not observing any different behaviour.

Then I thought to try and have ArgoCD ignore the replicas field, and then, everything started working as expected.

I thought it be best to convey this in an example, and I could not find any documentation on how to deploy using ArgoCD (which also has a couple of lines that one needs to be aware about). IIRC I pieced it together based on some github issues and debugging.

The important point is that when managing Ray via ArgoCD with the Autoscaler enabled, the ignoreDifferences must be managed properly to get the expected behaviour of the Autoscaler.

I would have attached screenshots, but from a PR review perspective, this doesn't prove anything. Essentially what I did was:

introduce the ignoreDifferences section, request X number of workers via ray.autoscaler.sdk.request_resources, kept changing it. When increasing X, it worked as expected and quite speedily. When reducing X, it takes ~10 mins (based on my idle setting in the ArgoCD app) then workers start spinning down. Eventually requesting 1 worker and it goes back to 1.
removed the ignoreDifferences section, request X number of workers. Then, requesting more than X, nothing happens. Request X=1, nothing happens. Essentially its like the ray.autoscaler.sdk.request_resources does nothing. It does print out some logs (showing that it is trying to do something), but when looking at the number of pods/workers, nothing happens. Delete the RayCluster, start back at original state, request Y, sometimes get Y, sometimes do not get Y workers. Essentially, it's not expected behaviour, looks very random.
repeat back and forth in my environment multiple times to confirm.
In my case, I was testing X = 40/80/100/200.
For small X, like <10, it appears to work, but you will see that pods can shutdown well within the idle limit, but get spun back up again.


          redo PR

fc9c1b5

SheldonTsen requested review from a team as code owners

December 1, 2025 09:52


          Merge branch 'master' into add-argocd-docs2

66c9d5e

SheldonTsen mentioned this pull request

Add documentation for ArgoCD #58340

Closed

Author

SheldonTsen commented Dec 1, 2025

@fscnick @Future-Outlier - new PR because the old one had a messed up commit history.

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

This PR adds valuable documentation for deploying Ray on Kubernetes with ArgoCD. The guide is comprehensive and covers important aspects like handling autoscaling with ignoreDifferences. I've found a few critical issues related to non-existent versions being used in the examples, which would prevent users from successfully following the guide. I've also pointed out some minor inconsistencies and best practices to improve the documentation further. Overall, this is a great addition!

doc/source/cluster/kubernetes/examples/argocd.md

    
                  namespace: ray-cluster

                source:

                  repoURL: https://github.com/ray-project/kuberay

                  targetRevision: v1.5.1  # update this as necessary

Contributor

gemini-code-assist bot Dec 1, 2025

The targetRevision is set to v1.5.1, but this version does not seem to exist in the ray-project/kuberay repository. The latest stable version is v1.1.1. Using a non-existent version will cause the deployment to fail. This comment also applies to lines 83, 384, and 405.

Suggested change

      
                targetRevision: v1.5.1  # update this as necessary
          
                targetRevision: v1.1.1  # update this as necessary

doc/source/cluster/kubernetes/examples/argocd.md

    
                source:

                  repoURL: https://ray-project.github.io/kuberay-helm/

                  chart: ray-cluster

                  targetRevision: "1.5.1"

Contributor

gemini-code-assist bot Dec 1, 2025

The targetRevision for the ray-cluster Helm chart is 1.5.1, but this version does not seem to exist in the kuberay-helm repository. The latest stable version is 1.1.1. Using a non-existent version will cause the deployment to fail.

Suggested change

      
                targetRevision: "1.5.1"
          
                targetRevision: "1.1.1"

doc/source/cluster/kubernetes/examples/argocd.md

    
                source:

                  repoURL: https://ray-project.github.io/kuberay-helm/

                  chart: ray-cluster

                  targetRevision: "1.4.1"

Contributor

gemini-code-assist bot Dec 1, 2025

The targetRevision for the ray-cluster Helm chart is 1.4.1 here, which is inconsistent with 1.5.1 used earlier. Both versions seem to not exist. Please use a consistent and existing version, for example 1.1.1.

Suggested change

      
                targetRevision: "1.4.1"
          
                targetRevision: "1.1.1"

doc/source/cluster/kubernetes/examples/argocd.md

Comment on lines +13 to +14

    
              * (Optional)[ArgoCD installed](https://argo-cd.readthedocs.io/en/stable/getting_started/) on your Kubernetes cluster.

              * (Optional)[ArgoCD CLI](https://argo-cd.readthedocs.io/en/stable/cli_installation/) installed on your local machine (recommended for easier application management. It might need [port-forwarding and login](https://argo-cd.readthedocs.io/en/stable/getting_started/#port-forwarding) depending on your environment).

Contributor

gemini-code-assist bot Dec 1, 2025

The markdown syntax for these optional prerequisites is slightly incorrect. The (Optional) text should be outside the link definition for better rendering and clarity.

Suggested change

      
            * (Optional)[ArgoCD installed](https://argo-cd.readthedocs.io/en/stable/getting_started/) on your Kubernetes cluster.
          
            * (Optional)[ArgoCD CLI](https://argo-cd.readthedocs.io/en/stable/cli_installation/) installed on your local machine (recommended for easier application management. It might need [port-forwarding and login](https://argo-cd.readthedocs.io/en/stable/getting_started/#port-forwarding) depending on your environment).
          
            * (Optional) [ArgoCD installed](https://argo-cd.readthedocs.io/en/stable/getting_started/) on your Kubernetes cluster.
          
            * (Optional) [ArgoCD CLI](https://argo-cd.readthedocs.io/en/stable/cli_installation/) installed on your local machine (recommended for easier application management. It might need [port-forwarding and login](https://argo-cd.readthedocs.io/en/stable/getting_started/#port-forwarding) depending on your environment).

doc/source/cluster/kubernetes/examples/argocd.md

    
                    valuesObject:

                      image:

                        repository: docker.io/rayproject/ray

                        tag: latest

Contributor

gemini-code-assist bot Dec 1, 2025

Using the latest tag for Docker images is generally discouraged in documentation and production environments as it can lead to unpredictable behavior when the image is updated. Pinning to a specific version ensures reproducibility. This comment applies to all occurrences of tag: latest in this file (lines 193, 208, 449, 477, 492). Based on the KubeRay version, a compatible Ray version would be 2.9.3.

Suggested change

      
                      tag: latest
          
                      tag: "2.9.3"

doc/source/cluster/kubernetes/examples/argocd.md

Comment on lines +480 to +481

    
                          replicas: 1

                          minReplicas: 1

Contributor

gemini-code-assist bot Dec 1, 2025

In the ray-argocd-all.yaml example, additional-worker-group1 is configured with replicas: 1 and minReplicas: 1. However, in the separate raycluster.yaml example (lines 196-197), it's configured with replicas: 0 and minReplicas: 0. This inconsistency might be confusing. For clarity, it would be best to keep the examples consistent.

Suggested change

      
                        replicas: 1
          
                        minReplicas: 1
          
                        replicas: 0
          
                        minReplicas: 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment