Skip to content

CI: Add Windows GPU runner for tests #444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Feb 20, 2025
Merged

CI: Add Windows GPU runner for tests #444

merged 29 commits into from
Feb 20, 2025

Conversation

leofang
Copy link
Member

@leofang leofang commented Feb 11, 2025

Part of #277.

Based on offline discussion with our admin @aflat and @aterrel @kkraus14, as a stop-gap solution until we get our own runners we are setting up a GH-hosted runners to cover Windows. Because the rate is high, we pick only 1 Python version x 2 latest CUDA major versions (11.8 & 12.8) x 2 different ways of getting CUDA to keep the cost within $100 a month.

In terms of the implementation, the new workflow is largely the same as its Linux counterpart, except for the following changes since the VM image of choice (Windows 11 23H2) has nothing pre-installed:

  • Windows has its own workflow file because using container or not cannot be matrix-parametrized (CI: Add Windows GPU runner for tests #444 (comment))
  • Bash shell commands are migrated to Powershell using Perplexity + some manual changes (due to lack of Git for Windows)
    • The collateral derange is our own Bash-based fetch_ctk action cannot be used; instead, we switch to use Jimver/cuda-toolkit which has been approved by the admins
  • CUDA driver is installed as part of the test CI (most time consuming part - 3m20s!)
    • Since the runner has a T4 GPU, the default mode is TCC
  • Some of the job steps came from our prior work (Add Github actions CI for tests cupy-ci-poc/cupy#1)
@leofang leofang added P0 High priority - Must do! CI/CD CI/CD infrastructure labels Feb 11, 2025
@leofang leofang added this to the cuda.core beta 3 milestone Feb 11, 2025
@leofang leofang self-assigned this Feb 11, 2025
Copy link
Contributor

copy-pr-bot bot commented Feb 11, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@leofang
Copy link
Member Author

leofang commented Feb 11, 2025

/ok to test

This comment has been minimized.

@leofang
Copy link
Member Author

leofang commented Feb 11, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Feb 11, 2025

@cryos @rwgk do you guys have clever ways about refactoring a workflow file? On the Windows runner, we cannot launch any container so this part must be removed:

# Our self-hosted runners require a container
# TODO: use a different (nvidia?) container
container:
options: -u root --security-opt seccomp=unconfined --shm-size 16g
image: ubuntu:22.04
env:
NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}

but otherwise the remaining logic should be the same and preserved... Are there smarter ways than duplicating the file for Windows?

@cryos
Copy link
Collaborator

cryos commented Feb 11, 2025

@cryos @rwgk do you guys have clever ways about refactoring a workflow file? On the Windows runner, we cannot launch any container so this part must be removed:

# Our self-hosted runners require a container
# TODO: use a different (nvidia?) container
container:
options: -u root --security-opt seccomp=unconfined --shm-size 16g
image: ubuntu:22.04
env:
NVIDIA_VISIBLE_DEVICES: ${{ env.NVIDIA_VISIBLE_DEVICES }}

but otherwise the remaining logic should be the same and preserved... Are there smarter ways than duplicating the file for Windows?

I feel like there should be, searching turned up the same disappointing you cannot make container conditional on the matrix and you have to split the workflow which I agree is disappointing.

@leofang
Copy link
Member Author

leofang commented Feb 11, 2025

/ok to test

1 similar comment
@leofang
Copy link
Member Author

leofang commented Feb 11, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Feb 11, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Feb 11, 2025

(spoke to George, the underlying image is changed now, let me restart the tests)

@leofang leofang closed this Feb 11, 2025
@leofang leofang reopened this Feb 11, 2025
@leofang
Copy link
Member Author

leofang commented Feb 11, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Feb 11, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Feb 12, 2025

Looks like we hit https://github.com/orgs/community/discussions/138813... which is weird... I haven't seen this error in the previous project...

@leofang
Copy link
Member Author

leofang commented Feb 14, 2025

Unfortunately there is no way to route this issue to the right GitHub support person. Closing this for now.

@leofang leofang closed this Feb 14, 2025
@leofang
Copy link
Member Author

leofang commented Feb 19, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Feb 19, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Feb 19, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Feb 19, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Feb 19, 2025

/ok to test

@leofang
Copy link
Member Author

leofang commented Feb 19, 2025

/ok to test

@leofang leofang marked this pull request as ready for review February 19, 2025 22:53
@leofang leofang requested review from cryos, tpn and vzhurba01 February 19, 2025 22:53
@leofang
Copy link
Member Author

leofang commented Feb 19, 2025

This is ready for review! The PR description is updated.

@leofang
Copy link
Member Author

leofang commented Feb 19, 2025

The doc build failure

AttributeError: module 'sphinx.util.typing' has no attribute 'ForwardRef'

is due to Sphinx 8.2.0, let's fix it in a separate PR:

@leofang leofang changed the title CI: Add windows runner Feb 19, 2025
Copy link
Collaborator

@cryos cryos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, looks good to me.

@leofang leofang merged commit 1e1148b into NVIDIA:main Feb 20, 2025
71 of 72 checks passed
@leofang leofang deleted the win_ci branch February 20, 2025 01:19
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD CI/CD infrastructure P0 High priority - Must do!
4 participants