Skip to content

Conversation

@recursix
Copy link
Collaborator

@recursix recursix commented Jul 4, 2025

Description by Korbit AI

What change is being made?

Add new agents, include Anthropic chat model support, enhance tool-use logic, and implement a study archiving mechanism in the agent framework.

Why are these changes being made?

These changes modernize the agent support, expand compatibility with Anthropic models, enhance tool-use functionality by automatically converting tool calls to Python, and maintain storage efficiency by archiving completed or ineffective experiments. This will streamline agent performance by incorporating newer models and optimizing resource management.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

recursix added 3 commits June 30, 2025 19:20
* on-the-fly error report
* left-right keys to navigate the steps
* multiple experiment selection
…nt script; add rate limit testing functionality
Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Documentation Insufficient VSCode debugging context ▹ view 🧠 Not in scope
Documentation Unclear purpose of reproducibility_mode ▹ view
Security Missing API Authentication Validation ▹ view ✅ Fix detected
Functionality Unverified Ray Backend Dependency ▹ view ✅ Fix detected
Error Handling Uncaught errors in reproducibility setup ▹ view ✅ Fix detected
Readability Remove commented model configurations ▹ view ✅ Fix detected
Functionality Incorrect Step Limit ▹ view
Performance Suboptimal Parallel Backend Selection ▹ view
Files scanned
File Path Reviewed
src/agentlab/agents/generic_agent/init.py
main_exp_new_models.py
src/agentlab/llm/llm_configs.py
src/agentlab/analyze/agent_xray.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

Comment on lines 48 to 50
## Number of parallel jobs
n_jobs = 5 # Make sure to use 1 job when debugging in VSCode
# n_jobs = -1 # to use all available cores

This comment was marked as resolved.

if __name__ == "__main__": # necessary for dask backend

if reproducibility_mode:
[a.set_reproducibility_mode() for a in agent_args]

This comment was marked as resolved.


# chat_model_args = CHAT_MODEL_ARGS_DICT["openai/gpt-4.1-mini-2025-04-14"]
# chat_model_args = CHAT_MODEL_ARGS_DICT["openai/gpt-4.1-2025-04-14"]
chat_model_args = CHAT_MODEL_ARGS_DICT["openrouter/anthropic/claude-3.7-sonnet"]

This comment was marked as resolved.

Comment on lines 39 to 42
# Set reproducibility_mode = True for reproducibility
# this will "ask" agents to be deterministic. Also, it will prevent you from launching if you have
# local changes. For your custom agents you need to implement set_reproducibility_mode
reproducibility_mode = False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear purpose of reproducibility_mode category Documentation

Tell me more
What is the issue?

The comment explains what reproducibility_mode does but not why someone would want to use it or its implications for experimental results.

Why this matters

Without understanding the purpose and implications of reproducibility mode, users might not make informed decisions about when to enable it, potentially affecting the validity of their experiments.

Suggested change ∙ Feature Preview

Set reproducibility_mode = True when you need consistent, repeatable experimental results

This ensures deterministic agent behavior and prevents accidental local code modifications

that could affect results. Required for publishing or comparing benchmark results.

Note: Custom agents must implement set_reproducibility_mode to support this feature

reproducibility_mode = False

Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.


study.run(
n_jobs=n_jobs,
parallel_backend="ray", # "ray", "joblib" or "sequential"

This comment was marked as resolved.

Comment on lines 66 to 71
study.run(
n_jobs=n_jobs,
parallel_backend="ray", # "ray", "joblib" or "sequential"
strict_reproducibility=reproducibility_mode,
n_relaunch=3,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suboptimal Parallel Backend Selection category Performance

Tell me more
What is the issue?

The parallel backend is hardcoded to 'ray' without considering the system's capabilities or the specific benchmark's characteristics.

Why this matters

Different parallel backends have varying performance characteristics. Ray has higher startup overhead but better scaling for longer tasks, while joblib is more efficient for shorter tasks with lower overhead. Sequential might be better for debugging or small workloads.

Suggested change ∙ Feature Preview

Add logic to select the optimal parallel backend based on the benchmark type and workload size. For example:

def get_optimal_backend(benchmark, n_jobs):
    if n_jobs == 1:
        return "sequential"
    if benchmark in ["miniwob_tiny_test", "miniwob"]:
        return "joblib"  # Better for shorter tasks
    return "ray"  # Better for longer tasks
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines 20 to 22
# chat_model_args = CHAT_MODEL_ARGS_DICT["openai/gpt-4.1-mini-2025-04-14"]
# chat_model_args = CHAT_MODEL_ARGS_DICT["openai/gpt-4.1-2025-04-14"]
chat_model_args = CHAT_MODEL_ARGS_DICT["openrouter/anthropic/claude-3.7-sonnet"]

This comment was marked as resolved.

if key_event.startswith("Cmd+Left"):
step = max(0, step - 1)
elif key_event.startswith("Cmd+Right"):
step = min(len(info.exp_result.steps_info) - 2, step + 1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect Step Limit category Functionality

Tell me more
What is the issue?

The step limit calculation in handle_key_event incorrectly uses -2 instead of -1, preventing access to the last step.

Why this matters

Users cannot navigate to the final step of the experiment using keyboard navigation, missing potentially important information.

Suggested change ∙ Feature Preview
step = min(len(info.exp_result.steps_info) - 1, step + 1)
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

@amanjaiswal73892 amanjaiswal73892 merged commit c32400f into main Jul 9, 2025
5 of 6 checks passed
@amanjaiswal73892 amanjaiswal73892 deleted the new_experiments branch July 9, 2025 22:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants