Skip to content

Conversation

@yuhezhang-ai
Copy link
Contributor

Description

Changed Makefile to use uv run python instead of system python3, ensuring the compiled extension matches the uv Python environment.

Also added -undefined dynamic_lookup linking flag for macOS to fix 'Undefined symbols' errors during compilation.

Testing

Verified with system Python 3.11 and uv Python 3.12 - the compiled .so file now correctly uses the uv Python version (3.12).

Fixes #438


First-time contributor here. I'm a research engineer transitioning from edge models to LLM infrastructure and algorithms. Happy to help with more tasks in the future. Thanks for reviewing!

Changed Makefile to use 'uv run python' instead of system 'python3',
ensuring the compiled extension matches the uv Python environment.

Also added '-undefined dynamic_lookup' linking flag for macOS to fix
'Undefined symbols' errors during compilation.

Fixes NVIDIA-NeMo#438

Signed-off-by: Yuhe Zhang <yuhe@polarr.co>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 16, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@adil-a
Copy link
Collaborator

adil-a commented Nov 16, 2025

Thank you so much @yuhezhang-ai ! We really appreciate community contributions :)

@nvidia-nemo/automation @thomasdhc can you help verify the changes?

@akoumpa
Copy link
Contributor

akoumpa commented Nov 17, 2025

/ok to test 8d95c2c

@yuhezhang-ai
Copy link
Contributor Author

Hi, I updated branch to include the Makefile package-data fix from main.

About the previous CI failures:

Looking at the logs, the failures were due to:

RuntimeError: PyTorch has CUDA Version=12.9 and torchvision has CUDA Version=13.0

This appears to be a dependency resolution issue in the CI environment, unrelated to the Makefile changes (the compilation test itself passed).

Should I wait to see if this persists after the package-data fix, or would you like me to investigate a torchvision version constraint? such as add a torchvision version pin to pyproject.toml?

Thanks!

@thomasdhc
Copy link
Contributor

Hey @yuhezhang-ai Thanks so this update. The uv run python is actually re-installing torch when it should not be and is causing this error. This is caused by some of our testing setup incorrectly mounting another copy of Automodel. I'll need to make changes to the overall test workflow. When that PR is done I will apply those changes to this PR.

No further action needs to be taken from your side.

Thanks!

@yuhezhang-ai
Copy link
Contributor Author

Thanks for clarifying! I appreciate you taking the time to explain the root cause.

I'm interested in contributing more to the project as I learn about LLM infrastructure. Are there other issues that might be suitable for me to work on?

My Background:

  • Computer vision research engineer with algorithm experience (and actively learning LLM/VLM)
  • Some Triton kernel knowledge, but limited distributed training experience
  • No GPU cluster access, but can test single-GPU scenarios via Colab

I can probably help with algorithm, code quality, bug fixes, kernel optimization - tasks that can be developed/verified on single-GPU.

for example, I noticed #780 (sequence classification metrics bug) seems suitable for me. It's about correctness and can be tested on Colab, though it has an assignee.

Happy to help with whatever you think would be suitable! 🙏

@adil-a
Copy link
Collaborator

adil-a commented Nov 19, 2025

Hey @yuhezhang-ai thank you so much for your enthusiasm! It'd be great to have more hands on-board :) We usually file any open issues on the GitHub Issues tab so feel free to pick up anything interesting to you. #780 might be a good and easy one to pick up

@akoumpa
Copy link
Contributor

akoumpa commented Nov 19, 2025

/ok to test ecde148

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment