-
Notifications
You must be signed in to change notification settings - Fork 104
New experiments #258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New experiments #258
Conversation
* on-the-fly error report * left-right keys to navigate the steps * multiple experiment selection
…nt script; add rate limit testing functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by Korbit AI
Korbit automatically attempts to detect when you fix issues in new commits.
| Category | Issue | Status |
|---|---|---|
| Insufficient VSCode debugging context ▹ view | 🧠 Not in scope | |
| Unclear purpose of reproducibility_mode ▹ view | ||
| Missing API Authentication Validation ▹ view | ✅ Fix detected | |
| Unverified Ray Backend Dependency ▹ view | ✅ Fix detected | |
| Uncaught errors in reproducibility setup ▹ view | ✅ Fix detected | |
| Remove commented model configurations ▹ view | ✅ Fix detected | |
| Incorrect Step Limit ▹ view | ||
| Suboptimal Parallel Backend Selection ▹ view |
Files scanned
| File Path | Reviewed |
|---|---|
| src/agentlab/agents/generic_agent/init.py | ✅ |
| main_exp_new_models.py | ✅ |
| src/agentlab/llm/llm_configs.py | ✅ |
| src/agentlab/analyze/agent_xray.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Check out our docs on how you can make Korbit work best for you and your team.
main_exp_new_models.py
Outdated
| ## Number of parallel jobs | ||
| n_jobs = 5 # Make sure to use 1 job when debugging in VSCode | ||
| # n_jobs = -1 # to use all available cores |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
main_exp_new_models.py
Outdated
| if __name__ == "__main__": # necessary for dask backend | ||
|
|
||
| if reproducibility_mode: | ||
| [a.set_reproducibility_mode() for a in agent_args] |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
main_exp_new_models.py
Outdated
|
|
||
| # chat_model_args = CHAT_MODEL_ARGS_DICT["openai/gpt-4.1-mini-2025-04-14"] | ||
| # chat_model_args = CHAT_MODEL_ARGS_DICT["openai/gpt-4.1-2025-04-14"] | ||
| chat_model_args = CHAT_MODEL_ARGS_DICT["openrouter/anthropic/claude-3.7-sonnet"] |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
main_exp_new_models.py
Outdated
| # Set reproducibility_mode = True for reproducibility | ||
| # this will "ask" agents to be deterministic. Also, it will prevent you from launching if you have | ||
| # local changes. For your custom agents you need to implement set_reproducibility_mode | ||
| reproducibility_mode = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear purpose of reproducibility_mode 
Tell me more
What is the issue?
The comment explains what reproducibility_mode does but not why someone would want to use it or its implications for experimental results.
Why this matters
Without understanding the purpose and implications of reproducibility mode, users might not make informed decisions about when to enable it, potentially affecting the validity of their experiments.
Suggested change ∙ Feature Preview
Set reproducibility_mode = True when you need consistent, repeatable experimental results
This ensures deterministic agent behavior and prevents accidental local code modifications
that could affect results. Required for publishing or comparing benchmark results.
Note: Custom agents must implement set_reproducibility_mode to support this feature
reproducibility_mode = False
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
main_exp_new_models.py
Outdated
|
|
||
| study.run( | ||
| n_jobs=n_jobs, | ||
| parallel_backend="ray", # "ray", "joblib" or "sequential" |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
main_exp_new_models.py
Outdated
| study.run( | ||
| n_jobs=n_jobs, | ||
| parallel_backend="ray", # "ray", "joblib" or "sequential" | ||
| strict_reproducibility=reproducibility_mode, | ||
| n_relaunch=3, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suboptimal Parallel Backend Selection 
Tell me more
What is the issue?
The parallel backend is hardcoded to 'ray' without considering the system's capabilities or the specific benchmark's characteristics.
Why this matters
Different parallel backends have varying performance characteristics. Ray has higher startup overhead but better scaling for longer tasks, while joblib is more efficient for shorter tasks with lower overhead. Sequential might be better for debugging or small workloads.
Suggested change ∙ Feature Preview
Add logic to select the optimal parallel backend based on the benchmark type and workload size. For example:
def get_optimal_backend(benchmark, n_jobs):
if n_jobs == 1:
return "sequential"
if benchmark in ["miniwob_tiny_test", "miniwob"]:
return "joblib" # Better for shorter tasks
return "ray" # Better for longer tasksProvide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
main_exp_new_models.py
Outdated
| # chat_model_args = CHAT_MODEL_ARGS_DICT["openai/gpt-4.1-mini-2025-04-14"] | ||
| # chat_model_args = CHAT_MODEL_ARGS_DICT["openai/gpt-4.1-2025-04-14"] | ||
| chat_model_args = CHAT_MODEL_ARGS_DICT["openrouter/anthropic/claude-3.7-sonnet"] |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| if key_event.startswith("Cmd+Left"): | ||
| step = max(0, step - 1) | ||
| elif key_event.startswith("Cmd+Right"): | ||
| step = min(len(info.exp_result.steps_info) - 2, step + 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect Step Limit 
Tell me more
What is the issue?
The step limit calculation in handle_key_event incorrectly uses -2 instead of -1, preventing access to the last step.
Why this matters
Users cannot navigate to the final step of the experiment using keyboard navigation, missing potentially important information.
Suggested change ∙ Feature Preview
step = min(len(info.exp_result.steps_info) - 1, step + 1)Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
…dge cases hlighted in the tests
Description by Korbit AI
What change is being made?
Add new agents, include Anthropic chat model support, enhance tool-use logic, and implement a study archiving mechanism in the agent framework.
Why are these changes being made?
These changes modernize the agent support, expand compatibility with Anthropic models, enhance tool-use functionality by automatically converting tool calls to Python, and maintain storage efficiency by archiving completed or ineffective experiments. This will streamline agent performance by incorporating newer models and optimizing resource management.