Skip to content

Improve perf of accessing dev.compute_capability #459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 25, 2025

Conversation

leofang
Copy link
Member

@leofang leofang commented Feb 21, 2025

Part of #439.

Before this PR:

In [4]: %timeit dev.compute_capability
1.87 μs ± 2.34 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

With this PR:

In [7]: %timeit dev.compute_capability
97.2 ns ± 0.087 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

which I consider good enough for a pure Python implementation. Compared to the CuPy counterpart (which is Cython-based and returning a string instead of namedtuple):

In [12]: %timeit dev.compute_capability
41.6 ns ± 0.0415 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Note that the perf improvement of retrieving a Device() instance is out of scope of this PR and pending investigation (#439 (comment)).

As part of this PR, I also removed a silly lock in the Device constructor. The data being protected is already placed in the thread-local storage, so it makes no sense to add another lock.

@leofang leofang added enhancement Any code-related improvements P0 High priority - Must do! cuda.core Everything related to the cuda.core module labels Feb 21, 2025
@leofang leofang added this to the cuda.core beta 3 milestone Feb 21, 2025
@leofang leofang requested a review from shwina February 21, 2025 19:44
@leofang leofang self-assigned this Feb 21, 2025
Copy link
Contributor

copy-pr-bot bot commented Feb 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@leofang
Copy link
Member Author

leofang commented Feb 21, 2025

/ok to test

This comment has been minimized.

@leofang leofang merged commit 440eabd into NVIDIA:main Feb 25, 2025
73 checks passed
@leofang leofang deleted the cache_cc branch February 25, 2025 17:34
Copy link

Doc Preview CI
Preview removed because the pull request was closed or merged.
@rwgk
Copy link
Collaborator

rwgk commented Mar 14, 2025

I looked here while working on the release notes. (I looked at the code before while working on #439 a couple days ago, but didn't realize then that the code is so new.)

From the PR description:

As part of this PR, I also removed a silly lock in the Device constructor. The data being protected is already placed in the thread-local storage, so it makes no sense to add another lock.

I'm almost certain that there can be a race now:

https://chatgpt.com/share/67d4bc2f-6c7c-8008-9d86-425fe77a3ed9

@leofang
Copy link
Member Author

leofang commented Mar 15, 2025

I don't get it. This is thread local storage, not normal Python object, why do we need any lock?

@rwgk
Copy link
Collaborator

rwgk commented Mar 15, 2025 via email

@leofang
Copy link
Member Author

leofang commented Mar 15, 2025

Let us not rush into a fix and check CPython threading.local internal first. I believe the impl already has a lock so that Python level access is guaranteed thread safe.

@rwgk
Copy link
Collaborator

rwgk commented Mar 15, 2025

I only have a minute right now, hoping for a piece of information that will help me when I have a block of time later:

  • I have to read up on threading.local.
  • But if I'm guessing correctly, a difference before/after this PR is that _tls.devices is now recomputed for each thread, while it was computed only once before. Does that sound right?
  • Assuming my guess is correct, is that what you wanted, or just something you accepted?
@leofang
Copy link
Member Author

leofang commented Mar 15, 2025

Yes, each thread has its own _tls and by definition (thread local storage) and by CPython implementation accessing _tls will not race. _tls should not be considered as normal Python objects.

@rwgk
Copy link
Collaborator

rwgk commented Mar 15, 2025

See #520: no race, but it recomputes _tls.devices for each new thread.

@leofang
Copy link
Member Author

leofang commented Mar 15, 2025

Right, this is the consequence of storing data in thread-local storage. It was already the case before this PR, and this is why having a lock makes no sense. Each thread always has its own copy of _tls.devices.

@rwgk
Copy link
Collaborator

rwgk commented Mar 15, 2025

Do we want that behavior? Or would it be better if devices was computed only once per process?

@leofang
Copy link
Member Author

leofang commented Mar 15, 2025

Yes. For example, cudaGetDevice/cudaSetDevice are already accessing per-thread-level information:
https://developer.nvidia.com/blog/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

@rwgk
Copy link
Collaborator

rwgk commented Mar 15, 2025

Wow, thanks! I need to read up. I completely misinterpreted what the loop is supposed to do.

@leofang
Copy link
Member Author

leofang commented Mar 15, 2025

No worries! Always good to have extra pairs of eyes 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P0 High priority - Must do!
3 participants