Build Intuition Not a Tool
As we work with AI models/tooling we build intuition around what they're good and bad at. I used to think the role of a software developer was to smooth rough edges and to Automate the Boring Stuff with Python. However, I'm finding a new perspective in the age of AI. In this post Riding the AI Wave I talked about a number of different tools I built to help cope with these edges.
- AI couldn't read files before (I built a file reader)
- When AI was good at reading files it wouldn't build high quality code (I built a code quality rater)
- AI wouldn't iterate on project management tooling, I built an agentic project management workflow
Models develop and deprecate the need for these tools. I'm still running into limitations though. For example, I've built a ton of prototypes to display concepts. Whenever I build UIs though AI I inevitably have been getting light text on a light background. I just figured out the issue today: these AI tools, defaults to making UIs theme-able. They are prepping the UI for a light/dark theme toggle feature (even if I don't ask for it). The limitation of this feature is that every element needs to register with the theme-able stylings, which doesn't always happen.
Now by habit I think to myself maybe I could update my github copliot prompt files or claude.md or cursor rules. I'd add something like "When adding tailwind classes make sure to check the background will switch with light/dark mode theme toggles." and give it a few shot prompt. With this solution I'd expect the next models may deprecate this system prompt.
Maintaining such a rudimentary band-aid is an expensive piece of technical debt. The "proper way" that I know to maintain a fix like this would be to have an evaluation framework checking to see that it's a helpful clause. In my head I'm still going back and forth on whether I should bite the bullet and start working on that evaluation framework.
Recommended by LinkedIn
There are other consistent problems out there like:
- Dependency Assumption: AI trained on a version and assuming that's the version being used. Band-aid: add a call out in system prompt for particular libraries that keep messing up. (Claude used to have a heck of a time setting up Azure Open AI SDK).
- Component Assumption: AI reads a component name and assumes it knows the inner workings of it. Band-aid: Build the component library documentation into the system prompt.
- Documentation Pollution: AI puts more value in a code comment than in reading the actual functionality of the code. Band-aid: Add to the system prompt that every AI action should maintain code comments
A big lesson I've learned over this last little bit here is that without proper resourcing your tool isn't going to be widely valuable. Looking at the industry landscape if these prompt snippets were valuable I'd assume there'd be semver managed prompts that have a team updating them with evaluation metrics on the package pages. But these moving targets are hard to plan for and hard to solution. This work though is in a known problem space that I'm pretty sure is being actively worked on by model makers.
So, I'll join the chorus of folks out there and relegate my message to say "get good at prompting" and keep building. That's at least what I'm doing for the time being. As long as folks coding with AI are checking the answers and understand the system then I'd imagine you'd notice where AI is commonly failing. Don't get discouraged that you have to pick up AI and manually code some stuff or really dig deep into some code.
I would like to hear about folks' experience supporting these little band-aids. What challenges are you running into? Have you seen consistent issues between models? Do you think they'll be fixed in future models? Or do you think our workflows will just eventually take these limitations into account?
Can't believe this article feels old after a single day. Just now seeing an eval framework for Github Copilot... https://github.com/bepuca/copilot-chat-eval