-
Notifications
You must be signed in to change notification settings - Fork 104
Deep debug #262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deep debug #262
Conversation
Refactor OpenAIResponseModel and related classes to remove tool_choice attribute and streamline API calls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review by Korbit AI
Korbit automatically attempts to detect when you fix issues in new commits.
| Category | Issue | Status |
|---|---|---|
| Basic root logger configuration ▹ view | ||
| Agent Configuration Should Be Separated ▹ view | ||
| Misleading error handling comment ▹ view | ||
| Static Parallel Job Configuration ▹ view | ||
| Debug Print Statement in Production ▹ view | ||
| Incorrect Tool Choice Parameter Handling ▹ view | ||
| Hardcoded System Message Strings ▹ view | ||
| Missing explanation for removed NaN check ▹ view | ||
| Invalid GPT Model Name ▹ view |
Files scanned
| File Path | Reviewed |
|---|---|
| main_workarena_debug.py | ✅ |
| src/agentlab/analyze/overlay_utils.py | ✅ |
| src/agentlab/agents/tool_use_agent/tool_use_agent.py | ✅ |
| src/agentlab/llm/response_api.py | ✅ |
| src/agentlab/analyze/agent_xray.py | ✅ |
Explore our documentation to understand the languages and file types we support and the files we ignore.
Check out our docs on how you can make Korbit work best for you and your team.
| ) | ||
| from agentlab.experiments.study import Study | ||
|
|
||
| logging.getLogger().setLevel(logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basic root logger configuration 
Tell me more
What is the issue?
Using root logger without a proper logger configuration (no handler, formatter or name)
Why this matters
Without proper logger configuration, logs may lack contextual information like timestamps and source, making debugging production issues difficult.
Suggested change ∙ Feature Preview
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
datefmt='%Y-%m-%d %H:%M:%S'
)
logger = logging.getLogger(__name__)Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
| agent_configs = [ | ||
| ToolUseAgentArgs( | ||
| model_args=GPT_4_1, | ||
| config=config, | ||
| ), | ||
| # ToolUseAgentArgs( | ||
| # model_args=GPT_4_1, | ||
| # config=config, | ||
| # ), | ||
| ] | ||
|
|
||
| for agent_config in agent_configs: | ||
| agent_config.config.action_subsets = ("workarena",) # use the workarena action set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agent Configuration Should Be Separated 
Tell me more
What is the issue?
Agent configuration is mixed with experiment setup code, violating the Single Responsibility Principle.
Why this matters
Mixing configuration with setup code makes it harder to maintain different agent configurations and reduces code reusability across different experiments.
Suggested change ∙ Feature Preview
Extract agent configuration into a separate factory or builder class:
class AgentConfigFactory:
@staticmethod
def create_workarena_config(base_config):
config = deepcopy(base_config)
config.action_subsets = ("workarena",)
return ToolUseAgentArgs(model_args=GPT_4_1, config=config)
agent_configs = [AgentConfigFactory.create_workarena_config(DEFAULT_PROMPT_CONFIG)]Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
| n_jobs = 10 # Make sure to use 1 job when debugging in VSCode | ||
| parallel_backend = "ray" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static Parallel Job Configuration 
Tell me more
What is the issue?
Hard-coded number of parallel jobs without considering system resources (CPU cores, memory) could lead to suboptimal performance.
Why this matters
Setting a fixed number of jobs might either underutilize available resources or overload the system, causing overhead from context switching and memory pressure.
Suggested change ∙ Feature Preview
Use system information to determine optimal job count:
from multiprocessing import cpu_count
n_jobs = min(cpu_count(), 10) # Cap at 10 but adjust based on available coresProvide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
| if dashed: | ||
| # Draw dashed rectangle | ||
| print("Drawing dashed rectangle") | ||
| linedashed(draw, x, y, x + w, y, color, width) | ||
| linedashed(draw, x + w, y, x + w, y + h, color, width) | ||
| linedashed(draw, x + w, y + h, x, y + h, color, width) | ||
| linedashed(draw, x, y + h, x, y, color, width) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug Print Statement in Production 
Tell me more
What is the issue?
A debug print statement is left in production code for the dashed rectangle drawing.
Why this matters
Debug print statements in production code can clutter logs and make it harder to identify actual issues when they occur.
Suggested change ∙ Feature Preview
Remove the print statement:
if dashed:
linedashed(draw, x, y, x + w, y, color, width)
linedashed(draw, x + w, y, x + w, y + h, color, width)
linedashed(draw, x + w, y + h, x, y + h, color, width)
linedashed(draw, x, y + h, x, y, color, width)Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
| if tool_choice in ("any", "required"): | ||
| tool_choice = "required" | ||
|
|
||
| api_params["tool_choice"] = tool_choice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect Tool Choice Parameter Handling 
Tell me more
What is the issue?
In OpenAIChatCompletionModel._call_api, tool_choice is set to 'required' when it's 'any', which conflicts with intended behavior and documentation for tool choice parameter.
Why this matters
OpenAI's API expects tool_choice to be either 'none', 'auto', or an object specifying 'type': 'function' with a required name. The current implementation may lead to unexpected API behavior.
Suggested change ∙ Feature Preview
Replace the tool_choice handling with proper OpenAI API expected format:
if tool_choice == "any":
api_params["tool_choice"] = {"type": "function"}
elif tool_choice == "required":
api_params["tool_choice"] = "required"
else:
api_params["tool_choice"] = tool_choiceProvide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
| sys_msg = SYS_MSG + "\nYou can take multiple actions in a single step, if needed." | ||
| else: | ||
| sys_msg = SYS_MSG + "\nYou can only take one action at a time." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded System Message Strings 
Tell me more
What is the issue?
The system message construction contains duplicated code and hardcoded strings that should be constants.
Why this matters
String concatenation in multiple places makes the code harder to maintain and increases the risk of inconsistencies.
Suggested change ∙ Feature Preview
MULTI_ACTION_MSG = "\nYou can take multiple actions in a single step, if needed."
SINGLE_ACTION_MSG = "\nYou can only take one action at a time."
sys_msg = SYS_MSG + (MULTI_ACTION_MSG if self.config.multiaction else SINGLE_ACTION_MSG)Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
| summary_df = pd.read_csv(most_recent_summary) | ||
|
|
||
| if len(summary_df) == 0 or summary_df["avg_reward"].isna().all(): | ||
| if len(summary_df) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing explanation for removed NaN check 
Tell me more
What is the issue?
The condition was simplified from 'len(summary_df) == 0 or summary_df["avg_reward"].isna().all()' to just 'len(summary_df) == 0' without documenting why the NaN check was removed.
Why this matters
Future maintainers won't understand why a seemingly important NaN check was removed, which could lead to reintroducing it unnecessarily.
Suggested change ∙ Feature Preview
if len(summary_df) == 0: # NaN check removed since empty dataframes are already filtered out
Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
| if len(summary_df) == 0: | ||
| continue # skip if all avg_reward are NaN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Misleading error handling comment 
Tell me more
What is the issue?
The comment is misleading as it states 'skip if all avg_reward are NaN' but the code actually skips if the dataframe is empty.
Why this matters
Having misleading comments about error handling conditions can lead to confusion and make debugging more difficult when actual NaN values are present.
Suggested change ∙ Feature Preview
if len(summary_df) == 0:
continue # skip if dataframe is emptyProvide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
| GPT_4_1 = OpenAIResponseModelArgs( | ||
| model_name="gpt-4.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Invalid GPT Model Name 
Tell me more
What is the issue?
The code references a non-existent GPT model 'gpt-4.1' which will cause runtime errors when attempting to use the model.
Why this matters
Using an invalid model name will cause API calls to fail and prevent the agent from functioning.
Suggested change ∙ Feature Preview
Update to use a valid OpenAI model name like 'gpt-4-turbo-preview' or 'gpt-4':
GPT_4_1 = OpenAIResponseModelArgs(
model_name="gpt-4-turbo-preview",Provide feedback to improve future suggestions
💬 Looking for more details? Reply to this comment to chat with Korbit.
Description by Korbit AI
What change is being made?
Introduce a debug convenience script to launch experiments, update the
goal.applymethod to support custom system messages, enhance drawing capabilities with a dashed rectangle feature, and modify the agent tool selection logic in the response API.Why are these changes being made?
These changes streamline the debugging and testing process by adding a script for experiment execution, improve model guidance flexibility with customizable messages, enhance visualization with dashed rectangles for better action overlays, and refine LLM response handling to facilitate tool choice adjustments based on API requirements.