Skip to content

Conversation

@yuhezhang-ai
Copy link
Contributor

@yuhezhang-ai yuhezhang-ai commented Dec 4, 2025

What does this PR do ?

This PR updates the biencoder training recipe to utilize all available positive documents for a given query. Previously, only the first positive document was used. Now, the training loop cycles through the list of positive documents across epochs using a modulo operation (e.g., Epoch 0 uses doc 0, Epoch 1 uses doc 1, etc.).

Changelog

  • Modified retrieval_dataset.py to accept an epoch argument in the transform function.
  • Added update_dataset_epoch helper to update the dataset transform.
  • Updated train_biencoder.py to call update_dataset_epoch at the start of each epoch.
  • Added unit tests to verify epoch-based positive document selection.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

Additional Information

…s for a query by cycling through them based on the current epoch, instead of always using the first one

Signed-off-by: Yuhe Zhang <yuhe@polarr.co>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 4, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yuhezhang-ai yuhezhang-ai changed the title feat: Enable cycling through all positive documents in biencoder training Dec 5, 2025
@akoumpa
Copy link
Contributor

akoumpa commented Dec 10, 2025

/ok to test c4f83a9

@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 10, 2025

/ok to test c4f83a9

@akoumpa, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@akoumpa
Copy link
Contributor

akoumpa commented Dec 10, 2025

/ok to test f57eaa4

@yuhezhang-ai
Copy link
Contributor Author

Hi @akoumpa , thanks for triggering CICD test on this PR.

I just noticed there is a newer draft PR (#937) working on the same multi-positive cycling behavior for biencoder training, but using a different implementation:

  • My PR: passes the epoch into the dataset transform function and updates the dataset transform at the start of each epoch (update_dataset_epoch). This keeps the existing dataset structure but makes the transform epoch-aware.

  • feat: [WIP] Train biencoder with multi pos docs #937: introduces a dataset wrapper that stores an internal _current_epoch field and exposes a set_epoch() API, which changes how the dataset itself is structured.

I wanted to mention this in case it helps decide which approach fits Automodel better. I’m happy to update this PR or align with whichever direction the team prefers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment