Hi, firstly, thanks for including a benchmark - most projects of this type don't have anything of the sort.
When running the benchmarks I spotted a couple of issues:
- With the baseline test case, the model often responds in a verbose fashion with multiple options. This is because the test case lacks a system prompt that is consistent with the types of prompt used by coding harnesses
- The debounce tests fails often for all the test targets (baseline, caveman, ponytail) due to an ambiguous test case. The response often expects the presence of a DOM (i.e. referencing
document), which is entirely reasonable based on the test case.
Fixing (1), by adding the simple system prompt Provide just one example for any given task, and no commentary or usage examples., makes quite an impact on the results:
Average LOC, running all tests (minus debounce) on haiku only:
- Baseline (108)
- Baseline – "one example" system prompt (16)
- Ponytail (8.25)
As a result, the central claims of this project should be a bit more modest.
Hi, firstly, thanks for including a benchmark - most projects of this type don't have anything of the sort.
When running the benchmarks I spotted a couple of issues:
document), which is entirely reasonable based on the test case.Fixing (1), by adding the simple system prompt
Provide just one example for any given task, and no commentary or usage examples., makes quite an impact on the results:Average LOC, running all tests (minus debounce) on haiku only:
As a result, the central claims of this project should be a bit more modest.