Skip to content

Conversation

@recursix
Copy link
Collaborator

@recursix recursix commented Jul 11, 2025

Description by Korbit AI

What change is being made?

Introduce a debug convenience script to launch experiments, update the goal.apply method to support custom system messages, enhance drawing capabilities with a dashed rectangle feature, and modify the agent tool selection logic in the response API.

Why are these changes being made?

These changes streamline the debugging and testing process by adding a script for experiment execution, improve model guidance flexibility with customizable messages, enhance visualization with dashed rectangles for better action overlays, and refine LLM response handling to facilitate tool choice adjustments based on API requirements.

Is this description stale? Ask me to generate a new description by commenting /korbit-generate-pr-description

Copy link

@korbit-ai korbit-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review by Korbit AI

Korbit automatically attempts to detect when you fix issues in new commits.
Category Issue Status
Logging Basic root logger configuration ▹ view
Design Agent Configuration Should Be Separated ▹ view
Error Handling Misleading error handling comment ▹ view
Performance Static Parallel Job Configuration ▹ view
Logging Debug Print Statement in Production ▹ view
Functionality Incorrect Tool Choice Parameter Handling ▹ view
Readability Hardcoded System Message Strings ▹ view
Documentation Missing explanation for removed NaN check ▹ view
Functionality Invalid GPT Model Name ▹ view
Files scanned
File Path Reviewed
main_workarena_debug.py
src/agentlab/analyze/overlay_utils.py
src/agentlab/agents/tool_use_agent/tool_use_agent.py
src/agentlab/llm/response_api.py
src/agentlab/analyze/agent_xray.py

Explore our documentation to understand the languages and file types we support and the files we ignore.

Check out our docs on how you can make Korbit work best for you and your team.

Loving Korbit!? Share us on LinkedIn Reddit and X

)
from agentlab.experiments.study import Study

logging.getLogger().setLevel(logging.INFO)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basic root logger configuration category Logging

Tell me more
What is the issue?

Using root logger without a proper logger configuration (no handler, formatter or name)

Why this matters

Without proper logger configuration, logs may lack contextual information like timestamps and source, making debugging production issues difficult.

Suggested change ∙ Feature Preview
import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S'
)
logger = logging.getLogger(__name__)
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +28 to +40
agent_configs = [
ToolUseAgentArgs(
model_args=GPT_4_1,
config=config,
),
# ToolUseAgentArgs(
# model_args=GPT_4_1,
# config=config,
# ),
]

for agent_config in agent_configs:
agent_config.config.action_subsets = ("workarena",) # use the workarena action set
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agent Configuration Should Be Separated category Design

Tell me more
What is the issue?

Agent configuration is mixed with experiment setup code, violating the Single Responsibility Principle.

Why this matters

Mixing configuration with setup code makes it harder to maintain different agent configurations and reduces code reusability across different experiments.

Suggested change ∙ Feature Preview

Extract agent configuration into a separate factory or builder class:

class AgentConfigFactory:
    @staticmethod
    def create_workarena_config(base_config):
        config = deepcopy(base_config)
        config.action_subsets = ("workarena",)
        return ToolUseAgentArgs(model_args=GPT_4_1, config=config)

agent_configs = [AgentConfigFactory.create_workarena_config(DEFAULT_PROMPT_CONFIG)]
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +58 to +59
n_jobs = 10 # Make sure to use 1 job when debugging in VSCode
parallel_backend = "ray"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static Parallel Job Configuration category Performance

Tell me more
What is the issue?

Hard-coded number of parallel jobs without considering system resources (CPU cores, memory) could lead to suboptimal performance.

Why this matters

Setting a fixed number of jobs might either underutilize available resources or overload the system, causing overhead from context switching and memory pressure.

Suggested change ∙ Feature Preview

Use system information to determine optimal job count:

from multiprocessing import cpu_count
n_jobs = min(cpu_count(), 10)  # Cap at 10 but adjust based on available cores
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +300 to +306
if dashed:
# Draw dashed rectangle
print("Drawing dashed rectangle")
linedashed(draw, x, y, x + w, y, color, width)
linedashed(draw, x + w, y, x + w, y + h, color, width)
linedashed(draw, x + w, y + h, x, y + h, color, width)
linedashed(draw, x, y + h, x, y, color, width)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug Print Statement in Production category Logging

Tell me more
What is the issue?

A debug print statement is left in production code for the dashed rectangle drawing.

Why this matters

Debug print statements in production code can clutter logs and make it harder to identify actual issues when they occur.

Suggested change ∙ Feature Preview

Remove the print statement:

if dashed:
        linedashed(draw, x, y, x + w, y, color, width)
        linedashed(draw, x + w, y, x + w, y + h, color, width)
        linedashed(draw, x + w, y + h, x, y + h, color, width)
        linedashed(draw, x, y + h, x, y, color, width)
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +343 to +346
if tool_choice in ("any", "required"):
tool_choice = "required"

api_params["tool_choice"] = tool_choice
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect Tool Choice Parameter Handling category Functionality

Tell me more
What is the issue?

In OpenAIChatCompletionModel._call_api, tool_choice is set to 'required' when it's 'any', which conflicts with intended behavior and documentation for tool choice parameter.

Why this matters

OpenAI's API expects tool_choice to be either 'none', 'auto', or an object specifying 'type': 'function' with a required name. The current implementation may lead to unexpected API behavior.

Suggested change ∙ Feature Preview

Replace the tool_choice handling with proper OpenAI API expected format:

if tool_choice == "any":
    api_params["tool_choice"] = {"type": "function"}
elif tool_choice == "required":
    api_params["tool_choice"] = "required"
else:
    api_params["tool_choice"] = tool_choice
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +448 to +450
sys_msg = SYS_MSG + "\nYou can take multiple actions in a single step, if needed."
else:
sys_msg = SYS_MSG + "\nYou can only take one action at a time."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded System Message Strings category Readability

Tell me more
What is the issue?

The system message construction contains duplicated code and hardcoded strings that should be constants.

Why this matters

String concatenation in multiple places makes the code harder to maintain and increases the risk of inconsistencies.

Suggested change ∙ Feature Preview
MULTI_ACTION_MSG = "\nYou can take multiple actions in a single step, if needed."
SINGLE_ACTION_MSG = "\nYou can only take one action at a time."

sys_msg = SYS_MSG + (MULTI_ACTION_MSG if self.config.multiaction else SINGLE_ACTION_MSG)
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

summary_df = pd.read_csv(most_recent_summary)

if len(summary_df) == 0 or summary_df["avg_reward"].isna().all():
if len(summary_df) == 0:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing explanation for removed NaN check category Documentation

Tell me more
What is the issue?

The condition was simplified from 'len(summary_df) == 0 or summary_df["avg_reward"].isna().all()' to just 'len(summary_df) == 0' without documenting why the NaN check was removed.

Why this matters

Future maintainers won't understand why a seemingly important NaN check was removed, which could lead to reintroducing it unnecessarily.

Suggested change ∙ Feature Preview

if len(summary_df) == 0: # NaN check removed since empty dataframes are already filtered out

Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +1167 to 1168
if len(summary_df) == 0:
continue # skip if all avg_reward are NaN
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Misleading error handling comment category Error Handling

Tell me more
What is the issue?

The comment is misleading as it states 'skip if all avg_reward are NaN' but the code actually skips if the dataframe is empty.

Why this matters

Having misleading comments about error handling conditions can lead to confusion and make debugging more difficult when actual NaN values are present.

Suggested change ∙ Feature Preview
if len(summary_df) == 0:
    continue  # skip if dataframe is empty
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

Comment on lines +500 to 501
GPT_4_1 = OpenAIResponseModelArgs(
model_name="gpt-4.1",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Invalid GPT Model Name category Functionality

Tell me more
What is the issue?

The code references a non-existent GPT model 'gpt-4.1' which will cause runtime errors when attempting to use the model.

Why this matters

Using an invalid model name will cause API calls to fail and prevent the agent from functioning.

Suggested change ∙ Feature Preview

Update to use a valid OpenAI model name like 'gpt-4-turbo-preview' or 'gpt-4':

GPT_4_1 = OpenAIResponseModelArgs(
    model_name="gpt-4-turbo-preview",
Provide feedback to improve future suggestions

Nice Catch Incorrect Not in Scope Not in coding standard Other

💬 Looking for more details? Reply to this comment to chat with Korbit.

@amanjaiswal73892 amanjaiswal73892 merged commit d92e0bf into main Jul 11, 2025
6 checks passed
@amanjaiswal73892 amanjaiswal73892 deleted the deep_debug branch July 11, 2025 20:58
@amanjaiswal73892 amanjaiswal73892 restored the deep_debug branch July 12, 2025 00:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants