-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Thanks for your questions! You can flexibly adjust the inference setup based on your computational resources without affecting results.
If you only want to test a subset, you can directly filter the query file, but for a small-scale experiment we don’t recommend taking the first 100, please sample randomly instead.
In the repo, we provide two evaluation approaches: using a trained critic model or external SOTA models as judges. To stay consistent with our latest leaderboard, we recommend using the same judge setup as the current leaderboard (currently Claude-4.5).
Originally posted by @A-Quarter-Mile in #19
Is it possible to modify the code to support using vLLM for acceleration during evaluation (by using the critic model trained in the paper)?