support vllm acceleration during evaluation

@A-Quarter-Mile

Thanks for your questions! You can flexibly adjust the inference setup based on your computational resources without affecting results.

If you only want to test a subset, you can directly filter the query file, but for a small-scale experiment we don’t recommend taking the first 100, please sample randomly instead.

In the repo, we provide two evaluation approaches: using a trained critic model or external SOTA models as judges. To stay consistent with our latest leaderboard, we recommend using the same judge setup as the current leaderboard (currently Claude-4.5).

Originally posted by @A-Quarter-Mile in #19

Is it possible to modify the code to support using vLLM for acceleration during evaluation (by using the critic model trained in the paper)?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support vllm acceleration during evaluation #24

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

support vllm acceleration during evaluation #24

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions