New experiments #258

recursix · 2025-07-04T16:53:44Z

Description by Korbit AI

What change is being made?

Add new agents, include Anthropic chat model support, enhance tool-use logic, and implement a study archiving mechanism in the agent framework.

Why are these changes being made?

These changes modernize the agent support, expand compatibility with Anthropic models, enhance tool-use functionality by automatically converting tool calls to Python, and maintain storage efficiency by archiving completed or ineffective experiments. This will streamline agent performance by incorporating newer models and optimizing resource management.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

* on-the-fly error report * left-right keys to navigate the steps * multiple experiment selection

…nt script; add rate limit testing functionality

korbit-ai

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.

Category	Issue	Status
	Insufficient VSCode debugging context ▹ view	🧠 Not in scope
	Unclear purpose of reproducibility_mode ▹ view
	Missing API Authentication Validation ▹ view	✅ Fix detected
	Unverified Ray Backend Dependency ▹ view	✅ Fix detected
	Uncaught errors in reproducibility setup ▹ view	✅ Fix detected
	Remove commented model configurations ▹ view	✅ Fix detected
	Incorrect Step Limit ▹ view
	Suboptimal Parallel Backend Selection ▹ view

Files scanned

File Path	Reviewed
src/agentlab/agents/generic_agent/init.py	✅
main_exp_new_models.py	✅
src/agentlab/llm/llm_configs.py	✅
src/agentlab/analyze/agent_xray.py	✅

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

main_exp_new_models.py

+## Number of parallel jobs
+n_jobs = 5  # Make sure to use 1 job when debugging in VSCode
+# n_jobs = -1  # to use all available cores


main_exp_new_models.py

+if __name__ == "__main__":  # necessary for dask backend
+
+    if reproducibility_mode:
+        [a.set_reproducibility_mode() for a in agent_args]


main_exp_new_models.py

+
+# chat_model_args = CHAT_MODEL_ARGS_DICT["openai/gpt-4.1-mini-2025-04-14"]
+# chat_model_args = CHAT_MODEL_ARGS_DICT["openai/gpt-4.1-2025-04-14"]
+chat_model_args = CHAT_MODEL_ARGS_DICT["openrouter/anthropic/claude-3.7-sonnet"]


korbit-ai · 2025-07-04T16:56:12Z

main_exp_new_models.py

+# Set reproducibility_mode = True for reproducibility
+# this will "ask" agents to be deterministic. Also, it will prevent you from launching if you have
+# local changes. For your custom agents you need to implement set_reproducibility_mode
+reproducibility_mode = False


Unclear purpose of reproducibility_mode

Tell me more

What is the issue?

The comment explains what reproducibility_mode does but not why someone would want to use it or its implications for experimental results.

Why this matters

Without understanding the purpose and implications of reproducibility mode, users might not make informed decisions about when to enable it, potentially affecting the validity of their experiments.

Suggested change ∙ Feature Preview

Set reproducibility_mode = True when you need consistent, repeatable experimental results

This ensures deterministic agent behavior and prevents accidental local code modifications

that could affect results. Required for publishing or comparing benchmark results.

Note: Custom agents must implement set_reproducibility_mode to support this feature

reproducibility_mode = False

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

main_exp_new_models.py

+
+    study.run(
+        n_jobs=n_jobs,
+        parallel_backend="ray",  # "ray", "joblib" or "sequential"


korbit-ai · 2025-07-04T16:56:12Z

main_exp_new_models.py

+    study.run(
+        n_jobs=n_jobs,
+        parallel_backend="ray",  # "ray", "joblib" or "sequential"
+        strict_reproducibility=reproducibility_mode,
+        n_relaunch=3,
+    )


Suboptimal Parallel Backend Selection

Tell me more

What is the issue?

The parallel backend is hardcoded to 'ray' without considering the system's capabilities or the specific benchmark's characteristics.

Why this matters

Different parallel backends have varying performance characteristics. Ray has higher startup overhead but better scaling for longer tasks, while joblib is more efficient for shorter tasks with lower overhead. Sequential might be better for debugging or small workloads.

Suggested change ∙ Feature Preview

Add logic to select the optimal parallel backend based on the benchmark type and workload size. For example:

def get_optimal_backend(benchmark, n_jobs): if n_jobs == 1: return "sequential" if benchmark in ["miniwob_tiny_test", "miniwob"]: return "joblib" # Better for shorter tasks return "ray" # Better for longer tasks

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

main_exp_new_models.py

+# chat_model_args = CHAT_MODEL_ARGS_DICT["openai/gpt-4.1-mini-2025-04-14"]
+# chat_model_args = CHAT_MODEL_ARGS_DICT["openai/gpt-4.1-2025-04-14"]
+chat_model_args = CHAT_MODEL_ARGS_DICT["openrouter/anthropic/claude-3.7-sonnet"]


korbit-ai · 2025-07-04T16:56:12Z

src/agentlab/analyze/agent_xray.py

+        if key_event.startswith("Cmd+Left"):
+            step = max(0, step - 1)
+        elif key_event.startswith("Cmd+Right"):
+            step = min(len(info.exp_result.steps_info) - 2, step + 1)


Incorrect Step Limit

Tell me more

What is the issue?

The step limit calculation in handle_key_event incorrectly uses -2 instead of -1, preventing access to the last step.

Why this matters

Users cannot navigate to the final step of the experiment using keyboard navigation, missing potentially important information.

Suggested change ∙ Feature Preview

step = min(len(info.exp_result.steps_info) - 1, step + 1)

Provide feedback to improve future suggestions

_{💬 Looking for more details? Reply to this comment to chat with Korbit.}

…ve error logging

…dge cases hlighted in the tests

recursix added 3 commits June 30, 2025 19:20

exp for new models

3608dd6

Improve xray

6aae3cf

* on-the-fly error report * left-right keys to navigate the steps * multiple experiment selection

refactor: update chat model arguments and enable relaunch in experime…

139e55a

…nt script; add rate limit testing functionality

korbit-ai bot reviewed Jul 4, 2025

View reviewed changes

recursix added 11 commits July 4, 2025 14:33

Merge branch 'main' into new_experiments

f0068df

organizing experiments files

1922c60

Fix issues with missing tapeagent

bf61ab4

backward compatibility

7cfea82

Add functionality to archive studies based on summary analysis

13fe83d

Refactor get_elapsed_time to handle task info more robustly and impro…

88e949f

…ve error logging

Add AnthropicChatModel and corresponding test cases for API integration

26173bc

Update AXTREE_NOTE to clarify usage of unique identifiers in AXTree

d22a6ab

Add error handling for action annotation in get_screenshot function

2ff844d

use repr(value) to handle more robustly the conversion. This solves e…

23ec728

…dge cases hlighted in the tests

Merge branch 'main' into new_experiments

a0153be

amanjaiswal73892 approved these changes Jul 9, 2025

View reviewed changes

amanjaiswal73892 merged commit c32400f into main Jul 9, 2025
5 of 6 checks passed

amanjaiswal73892 deleted the new_experiments branch July 9, 2025 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

New experiments #258

New experiments #258

Uh oh!

recursix commented Jul 4, 2025 •

edited by korbit-ai bot

Loading

korbit-ai bot left a comment •

edited

Loading

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

korbit-ai bot Jul 4, 2025

This comment was marked as resolved.

Uh oh!

korbit-ai bot Jul 4, 2025

This comment was marked as resolved.

Uh oh!

korbit-ai bot Jul 4, 2025

Uh oh!

Labels

3 participants

New experiments #258

New experiments #258

Uh oh!

Conversation

recursix commented Jul 4, 2025 • edited by korbit-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description by Korbit AI

What change is being made?

Why are these changes being made?

korbit-ai bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

korbit-ai bot Jul 4, 2025

Choose a reason for hiding this comment

Unclear purpose of reproducibility_mode

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Set reproducibility_mode = True when you need consistent, repeatable experimental results

This ensures deterministic agent behavior and prevents accidental local code modifications

that could affect results. Required for publishing or comparing benchmark results.

Note: Custom agents must implement set_reproducibility_mode to support this feature

Provide feedback to improve future suggestions

This comment was marked as resolved.

Uh oh!

korbit-ai bot Jul 4, 2025

Choose a reason for hiding this comment

Suboptimal Parallel Backend Selection

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Provide feedback to improve future suggestions

This comment was marked as resolved.

Uh oh!

korbit-ai bot Jul 4, 2025

Choose a reason for hiding this comment

Incorrect Step Limit

What is the issue?

Why this matters

Suggested change ∙ Feature Preview

Provide feedback to improve future suggestions

Uh oh!

Labels

3 participants

recursix commented Jul 4, 2025 •

edited by korbit-ai bot

Loading

korbit-ai bot left a comment •

edited

Loading