Skip to content

Add ObjectCode ptx constructor #470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 1, 2025

Conversation

brandon-b-miller
Copy link
Contributor

Adds the ability to link multiple PTX files with one linker instance.

Copy link
Contributor

copy-pr-bot bot commented Feb 25, 2025

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@leofang leofang added P0 High priority - Must do! feature New feature or request cuda.core Everything related to the cuda.core module labels Feb 25, 2025
@leofang leofang added this to the cuda.core beta 3 milestone Feb 25, 2025
Copy link
Contributor

@ksimpson-work ksimpson-work left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the title of the review should be changed to something like : "Add ObjectCode constructors" or something like that. Linking multiple PTX object codes is already supported.

I personally don't think we should allow people to be constructing ptx object codes, but if we are going to do so, I agree with Leo that we need to be very explicit in the documentation of the constructor, that it is not the main path and should only be used when necessary.

Thanks for contributing!

ksimpson-work
ksimpson-work previously approved these changes Feb 26, 2025
Copy link
Collaborator

@rwgk rwgk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by comments; very minor; I was mostly curious.

@brandon-b-miller brandon-b-miller changed the title Support linking multiple ptx files Feb 27, 2025
@brandon-b-miller
Copy link
Contributor Author

On CUDA 12, in a configuration where the driver is behind the runtime, I'm encountering an MVC error at test_object_code_load_ptx.

cuda.core.experimental._utils.CUDAError: CUDA_ERROR_UNSUPPORTED_PTX_VERSION: the provided PTX was compiled with an unsupported toolchain.

I'm guessing this has something to do with not invoking nvjitlink where we otherwise should be, I'm investigating.

@ksimpson-work
Copy link
Contributor

On CUDA 12, in a configuration where the driver is behind the runtime, I'm encountering an MVC error at test_object_code_load_ptx.

cuda.core.experimental._utils.CUDAError: CUDA_ERROR_UNSUPPORTED_PTX_VERSION: the provided PTX was compiled with an unsupported toolchain.

I'm guessing this has something to do with not invoking nvjitlink where we otherwise should be, I'm investigating.

As far as we know, this is expected behaviour. This is an error raised by the driver. We raise a warning from Program.Compile() if you compile ptx in an enironment where the driver is behind the runtime. Did you get this warning?

@leofang
Copy link
Member

leofang commented Feb 27, 2025

/ok to test

leofang
leofang previously approved these changes Feb 27, 2025
@leofang
Copy link
Member

leofang commented Feb 27, 2025

As far as we know, this is expected behaviour. This is an error raised by the driver. We raise a warning from Program.Compile() if you compile ptx in an enironment where the driver is behind the runtime. Did you get this warning?

Keenan is right and this error is reproduced in the CI (with CTK 12.8 & driver 12.4). @brandon-b-miller we should catch the warning

RuntimeWarning: The CUDA driver version is older than the backend version. The generated ptx will not be loadable by the current driver.

as noted by Keenan and skip the test_object_code_load_ptx test when a warning is raised. In this case nvJitLink is not involved; in fact, if it were then we'd get a valid PTX to load.

@brandon-b-miller
Copy link
Contributor Author

What do you think of the changes in 9276d11? It seems we dodge the warning when constructing an ObjectCode directly since we avoid the Program.compile codepath.

@leofang
Copy link
Member

leofang commented Feb 28, 2025

/ok to test

@kkraus14
Copy link
Collaborator

Should we explore integrating and providing an interface to nvptxcompiler as opposed to using the driver APIs?

@leofang
Copy link
Member

leofang commented Feb 28, 2025

It seems we dodge the warning when constructing an ObjectCode directly since we avoid the Program.compile codepath.

Right, the warning is raised in the fixture which is passed as a test function argument. On top of my head I don't know if there's a way to capture such warnings and do a conditional skip. This change looks fine to me.

@leofang
Copy link
Member

leofang commented Feb 28, 2025

Should we explore integrating and providing an interface to nvptxcompiler as opposed to using the driver APIs?

@kkraus14 we discussed that before and decided to avoid that because nvptxcompiler does not provide a shared library, only a static one which would cause huge binary size bloat and a similar painful deployment challenge as with pynvjitlink for a niche use case. nvJitLink can link multiple PTXs together into a cubin, which is what's already enabled in Linker and tested in this PR. From what I can tell this should meet all of numba-cuda needs.

@kkraus14
Copy link
Collaborator

Should we explore integrating and providing an interface to nvptxcompiler as opposed to using the driver APIs?

nvJitLink can link multiple PTXs together into a cubin, which is what's already enabled in Linker and tested in this PR. From what I can tell this should meet all of numba-cuda needs.

IIRC the motivating use case for nvptxcompiler is that it solves the problem of using a newer CTK with an older driver with regards to PTX versioning. I believe it would address #470 (comment) for example.

@kkraus14 we discussed that before and decided to avoid that because nvptxcompiler does not provide a shared library, only a static one which would cause huge binary size bloat and a similar painful deployment challenge as with pynvjitlink for a niche use case.

I would argue this is becoming less and less of a niche use case as we package CTK better and better for Python users. Regarding bloat, in the fullness of time we should probably create a Pythonic interface / bindings to it and package it as a separate library that could be optionally used by cuda.core or something similar to that.

@leofang
Copy link
Member

leofang commented Feb 28, 2025

IIRC the motivating use case for nvptxcompiler is that it solves the problem of using a newer CTK with an older driver with regards to PTX versioning. I believe it would address #470 (comment) for example.

I think that example is really a corner case that we want to discourage. If one has multiple PTX and/or LTO-IR to link together, and we want the final output to be loadable by the driver (not just for humans to inspect/debug the output), then the output should be a CUBIN not PTX. That example is not following the recommended practice as per the CUDA Compatibility doc.

nvJitLink is designed for exactly this use case. An old driver cannot load a PTX generated by a newer toolchain, but it has no problem loading CUBIN generated by a newer toolchain, and nvJitLink can generate a CUBIN as the final output.

In fact, cuda.core uses nvJitLink for multiple purposes due to this exceptional capability:

  • JIT-compiling a single PTX to CUBIN (this happens in Program)
  • Linking either multiple PTX to CUBIN or multiple LTO-IR to CUBIN/PTX (this happens in Linker)

This allows us to have a full solution for CUDA minor version compatibility.

Regarding bloat, in the fullness of time we should probably create a Pythonic interface / bindings to it and package it as a separate library that could be optionally used by cuda.core or something similar to that.

At one point I would like to cover bindings for all CTK components (except for math libs, which now lives in nvmath.bindings), so it sounds like a reasonable ask and I created #478 to track it. However, I really do not see a use case for it now that nvJitLink solves all problems and is much more friendly to maintain, package, and deploy.

@kkraus14
Copy link
Collaborator

nvJitLink is designed for exactly this use case. An old driver cannot load a PTX generated by a newer toolchain, but it has no problem loading CUBIN generated by a newer toolchain, and nvJitLink can generate a CUBIN as the final output.

Based on the documentation it looks like nvJitLink uses nvptxcompiler based on needing to include it if static linking, so this makes sense. I wasn't aware of this capability. Thanks!

@leofang
Copy link
Member

leofang commented Mar 1, 2025

Yeah I feel we should document some of these discussions. Perhaps as a Q&A or tips & tricks emphasizing CUDA minor version compat support.

@leofang leofang merged commit 3d413ed into NVIDIA:main Mar 1, 2025
74 checks passed
@leofang
Copy link
Member

leofang commented Mar 1, 2025

Thanks @brandon-b-miller @ksimpson-work for driving this PR across finish line!

@leofang
Copy link
Member

leofang commented Mar 3, 2025

Yeah I feel we should document some of these discussions.

See #480.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.core Everything related to the cuda.core module feature New feature or request P0 High priority - Must do!
5 participants