Skip to content

Benchmark issues - baseline scores are ~7 times better #126

Description

@ColinEberhardt

Hi, firstly, thanks for including a benchmark - most projects of this type don't have anything of the sort.

When running the benchmarks I spotted a couple of issues:

  1. With the baseline test case, the model often responds in a verbose fashion with multiple options. This is because the test case lacks a system prompt that is consistent with the types of prompt used by coding harnesses
  2. The debounce tests fails often for all the test targets (baseline, caveman, ponytail) due to an ambiguous test case. The response often expects the presence of a DOM (i.e. referencing document), which is entirely reasonable based on the test case.

Fixing (1), by adding the simple system prompt Provide just one example for any given task, and no commentary or usage examples., makes quite an impact on the results:

Average LOC, running all tests (minus debounce) on haiku only:

  • Baseline (108)
  • Baseline – "one example" system prompt (16)
  • Ponytail (8.25)

As a result, the central claims of this project should be a bit more modest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions