Diffbot reposted this
2025 was the year LLMs went from being good at simple chat interactions (ChatGPT) to being good at applications that involve calling tools and reading documents (coding with Claude Code). Yet thus far, there hasn't been really any good local alternative to the frontier models for terminal coding or web research applications. Why is that? Well, the shape of the data in these two applications couldn't be more different. A chat looks like short text snippets that ping-pong back and forth, and agentic tool use reads in large text blocks with potentially multiple turns interleaving human messages. If you've played at all with LLMs on consumer hardware, you'll notice that chit chatting with local LLMs works fine. But as soon as you enable tool use or try to use your local LLM as a backend to Claude Code, your session grinds to a halt right after the first tool call is returned, no matter how much VRAM you have. The reason? Standard transformers dot-product attention is a fundamentally quadratic computation. What's interesting is that this situation has all changed with the last wave of open source models released from the major players. NVIDIA (Nemotron 3), Qwen (qwen3-coder-next/qwen3.5), and GLM (4.7-flash/5) all introduced in their latest models some form of hybrid, sub-quadratic attention mechanism. Even Deepseek, the first breakout open-source model, updated their architecture with their own DeepSeek Sparse Attention (DSA) in v3.2. What these next-gen models have in common is they replace the standard quadratic dot-product attention with "linear" variants of attention. (In practice they aren't fully linear but stack full and linear layers with some ratio, hence "hybrid"). For example, probably the leading small open-source model right now, Qwen 3.5 uses a linear attention variant called GatedDeltaNet. Instead of a quadratic dot-product of matrices, this is a simple for-loop through the token sequence (see code in first image) that carries forward a vector last_recurrent_state, which is decayed by g, and updated with the new data by beta. Think about full attention like taking an open book exam with all of the pages of the book laid out on your desk so that you can randomly access, and linear recurrent state like a notecard you keep with you that you write and erase from as you read the pages. Full attention works, but you need a really large desk to in order to answer questions the first way! Nevertheless, that is essentially how DeepSeek V3, Kimi2.5, and Llama 4 work, and there's a clear limit to how far that approach goes. Check out our latest model, which fine-tunes qwen3-coder-next 80B A3B for GraphRAG. It can fit on a single consumer GPU or on your macbook with Q4 GGUFs. I've been using it as the backend model for to our LLM demo (try it at https://diffy.chat) and as a fully local alternative to Claude Code, when it is increasingly down. Link to our model in the comments.