-
Notifications
You must be signed in to change notification settings - Fork 7k
Add documentation for ArgoCD #59085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add documentation for ArgoCD #59085
Conversation
|
@fscnick @Future-Outlier - new PR because the old one had a messed up commit history. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR adds valuable documentation for deploying Ray on Kubernetes with ArgoCD. The guide is comprehensive and covers important aspects like handling autoscaling with ignoreDifferences. I've found a few critical issues related to non-existent versions being used in the examples, which would prevent users from successfully following the guide. I've also pointed out some minor inconsistencies and best practices to improve the documentation further. Overall, this is a great addition!
| namespace: ray-cluster | ||
| source: | ||
| repoURL: https://github.com/ray-project/kuberay | ||
| targetRevision: v1.5.1 # update this as necessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The targetRevision is set to v1.5.1, but this version does not seem to exist in the ray-project/kuberay repository. The latest stable version is v1.1.1. Using a non-existent version will cause the deployment to fail. This comment also applies to lines 83, 384, and 405.
| targetRevision: v1.5.1 # update this as necessary | |
| targetRevision: v1.1.1 # update this as necessary |
| source: | ||
| repoURL: https://ray-project.github.io/kuberay-helm/ | ||
| chart: ray-cluster | ||
| targetRevision: "1.5.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| source: | ||
| repoURL: https://ray-project.github.io/kuberay-helm/ | ||
| chart: ray-cluster | ||
| targetRevision: "1.4.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * (Optional)[ArgoCD installed](https://argo-cd.readthedocs.io/en/stable/getting_started/) on your Kubernetes cluster. | ||
| * (Optional)[ArgoCD CLI](https://argo-cd.readthedocs.io/en/stable/cli_installation/) installed on your local machine (recommended for easier application management. It might need [port-forwarding and login](https://argo-cd.readthedocs.io/en/stable/getting_started/#port-forwarding) depending on your environment). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The markdown syntax for these optional prerequisites is slightly incorrect. The (Optional) text should be outside the link definition for better rendering and clarity.
| * (Optional)[ArgoCD installed](https://argo-cd.readthedocs.io/en/stable/getting_started/) on your Kubernetes cluster. | |
| * (Optional)[ArgoCD CLI](https://argo-cd.readthedocs.io/en/stable/cli_installation/) installed on your local machine (recommended for easier application management. It might need [port-forwarding and login](https://argo-cd.readthedocs.io/en/stable/getting_started/#port-forwarding) depending on your environment). | |
| * (Optional) [ArgoCD installed](https://argo-cd.readthedocs.io/en/stable/getting_started/) on your Kubernetes cluster. | |
| * (Optional) [ArgoCD CLI](https://argo-cd.readthedocs.io/en/stable/cli_installation/) installed on your local machine (recommended for easier application management. It might need [port-forwarding and login](https://argo-cd.readthedocs.io/en/stable/getting_started/#port-forwarding) depending on your environment). |
| valuesObject: | ||
| image: | ||
| repository: docker.io/rayproject/ray | ||
| tag: latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the latest tag for Docker images is generally discouraged in documentation and production environments as it can lead to unpredictable behavior when the image is updated. Pinning to a specific version ensures reproducibility. This comment applies to all occurrences of tag: latest in this file (lines 193, 208, 449, 477, 492). Based on the KubeRay version, a compatible Ray version would be 2.9.3.
| tag: latest | |
| tag: "2.9.3" |
| replicas: 1 | ||
| minReplicas: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the ray-argocd-all.yaml example, additional-worker-group1 is configured with replicas: 1 and minReplicas: 1. However, in the separate raycluster.yaml example (lines 196-197), it's configured with replicas: 0 and minReplicas: 0. This inconsistency might be confusing. For clarity, it would be best to keep the examples consistent.
| replicas: 1 | |
| minReplicas: 1 | |
| replicas: 0 | |
| minReplicas: 0 |
This PR is just a repeat of #58340 because the commit history was quite messed up. Comments were all resolved as off the opening of this PR I believe!
Description
I was investigating odd behaviour where requesting exact number of workers via the python sdk was not behaving as expected. I initially raised an issue here: #55736. I was then pointed tothis: ray-project/kuberay#3794. However, even after the fix, I was not observing any different behaviour.
Then I thought to try and have ArgoCD ignore the replicas field, and then, everything started working as expected.
I thought it be best to convey this in an example, and I could not find any documentation on how to deploy using ArgoCD (which also has a couple of lines that one needs to be aware about). IIRC I pieced it together based on some github issues and debugging.
The important point is that when managing Ray via ArgoCD with the Autoscaler enabled, the
ignoreDifferencesmust be managed properly to get the expected behaviour of the Autoscaler.I would have attached screenshots, but from a PR review perspective, this doesn't prove anything. Essentially what I did was:
ignoreDifferencessection, request X number of workers viaray.autoscaler.sdk.request_resources, kept changing it. When increasing X, it worked as expected and quite speedily. When reducing X, it takes ~10 mins (based on my idle setting in the ArgoCD app) then workers start spinning down. Eventually requesting 1 worker and it goes back to 1.ignoreDifferencessection, request X number of workers. Then, requesting more than X, nothing happens. Request X=1, nothing happens. Essentially its like theray.autoscaler.sdk.request_resourcesdoes nothing. It does print out some logs (showing that it is trying to do something), but when looking at the number of pods/workers, nothing happens. Delete the RayCluster, start back at original state, request Y, sometimes get Y, sometimes do not get Y workers. Essentially, it's not expected behaviour, looks very random.