Benchmark issues - baseline scores are ~7 times better

Hi, firstly, thanks for including a benchmark - most projects of this type don't have anything of the sort.

When running the benchmarks I spotted a couple of issues:

With the baseline test case, the model often responds in a verbose fashion with multiple options. This is because the test case lacks a system prompt that is consistent with the types of prompt used by coding harnesses
The debounce tests fails often for all the test targets (baseline, caveman, ponytail) due to an ambiguous test case. The response often expects the presence of a DOM (i.e. referencing document), which is entirely reasonable based on the test case.

Fixing (1), by adding the simple system prompt Provide just one example for any given task, and no commentary or usage examples., makes quite an impact on the results:

Average LOC, running all tests (minus debounce) on haiku only:

Baseline (108)
Baseline – "one example" system prompt (16)
Ponytail (8.25)

As a result, the central claims of this project should be a bit more modest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark issues - baseline scores are ~7 times better #126

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Benchmark issues - baseline scores are ~7 times better #126

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions