feat: Enable cycling through all positive documents in biencoder training #907 #933

yuhezhang-ai · 2025-12-04T23:25:50Z

What does this PR do ?

This PR updates the biencoder training recipe to utilize all available positive documents for a given query. Previously, only the first positive document was used. Now, the training loop cycles through the list of positive documents across epochs using a modulo operation (e.g., Epoch 0 uses doc 0, Epoch 1 uses doc 1, etc.).

Changelog

Modified retrieval_dataset.py to accept an epoch argument in the transform function.
Added update_dataset_epoch helper to update the dataset transform.
Updated train_biencoder.py to call update_dataset_epoch at the start of each epoch.
Added unit tests to verify epoch-based positive document selection.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?

Additional Information

Close Support multiple positive docs for biencoder training #907

…s for a query by cycling through them based on the current epoch, instead of always using the first one Signed-off-by: Yuhe Zhang <yuhe@polarr.co>

copy-pr-bot · 2025-12-04T23:25:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

akoumpa · 2025-12-10T19:32:17Z

/ok to test c4f83a9

copy-pr-bot · 2025-12-10T19:32:20Z

/ok to test c4f83a9

@akoumpa, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

akoumpa · 2025-12-10T19:33:24Z

/ok to test f57eaa4

yuhezhang-ai · 2025-12-11T01:39:56Z

Hi @akoumpa , thanks for triggering CICD test on this PR.

I just noticed there is a newer draft PR (#937) working on the same multi-positive cycling behavior for biencoder training, but using a different implementation:

My PR: passes the epoch into the dataset transform function and updates the dataset transform at the start of each epoch (update_dataset_epoch). This keeps the existing dataset structure but makes the transform epoch-aware.
feat: [WIP] Train biencoder with multi pos docs #937: introduces a dataset wrapper that stores an internal _current_epoch field and exposes a set_epoch() API, which changes how the dataset itself is structured.

I wanted to mention this in case it helps decide which approach fits Automodel better. I’m happy to update this PR or align with whichever direction the team prefers.

Updates biencoder training to utilize all available positive document…

c4f83a9

…s for a query by cycling through them based on the current epoch, instead of always using the first one Signed-off-by: Yuhe Zhang <yuhe@polarr.co>

yuhezhang-ai requested review from HuiyingLi, adil-a, akoumpa and hemildesai as code owners December 4, 2025 23:25

github-actions bot added the community-request label Dec 4, 2025

Merge branch 'main' into yuhe/907/positive-doc-cycling

f57eaa4

yuhezhang-ai changed the title ~~feat: Enable cycling through all positive documents in biencoder training~~ Dec 5, 2025

snowmanwwg added external x-pixieset labels Dec 9, 2025

copy-pr-bot bot temporarily deployed to nemo-ci December 10, 2025 19:33 Inactive

copy-pr-bot bot temporarily deployed to test December 10, 2025 19:33 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 10, 2025 20:55 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci December 10, 2025 21:33 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Enable cycling through all positive documents in biencoder training #907 #933

feat: Enable cycling through all positive documents in biencoder training #907 #933

Uh oh!

yuhezhang-ai commented Dec 4, 2025 •

edited

Loading

copy-pr-bot bot commented Dec 4, 2025

akoumpa commented Dec 10, 2025

copy-pr-bot bot commented Dec 10, 2025

akoumpa commented Dec 10, 2025

yuhezhang-ai commented Dec 11, 2025

Labels

3 participants

feat: Enable cycling through all positive documents in biencoder training #907 #933

Are you sure you want to change the base?

feat: Enable cycling through all positive documents in biencoder training #907 #933

Uh oh!

Conversation

yuhezhang-ai commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Before your PR is "Ready for review"

Additional Information

copy-pr-bot bot commented Dec 4, 2025

akoumpa commented Dec 10, 2025

copy-pr-bot bot commented Dec 10, 2025

akoumpa commented Dec 10, 2025

yuhezhang-ai commented Dec 11, 2025

Labels

3 participants

yuhezhang-ai commented Dec 4, 2025 •

edited

Loading