There have been lots of interesting LLM releases last week. My favorite was actually the Olmo 3 release. Olmo models are always a highlight since they are fully transparent (including training methods and datasets) and come with very detailed technical reports. I am sure I'll talk more about the interesting training-related aspects from that 100-pager in the upcoming days and weeks. In the meantime, here's the side-by-side architecture comparison with Qwen3: 1) As we can see, the Olmo 3 architecture is relatively similar to Qwen3. However, it's worth noting that this is essentially likely inspired by the Olmo 2 predecessor, not Qwen3. 2) Similar to Olmo 2, Olmo 3 still uses a post-norm flavor instead of pre-norm, as they found in the Olmo 2 paper that it stabilizes the training. 3) Interestingly, the 7B model still uses multi-head attention similar to Olmo 2. However, to make things more efficient and reduce the KV cache size, they now use sliding-window attention (e.g., similar to Gemma 3). Next, the 32B model (the figure is not shown here due to space reasons, but you can find it in my "The Big LLM Architecture Comparison" article or my Olmo 3 from-scratch notebook): 4) Overall, it's the same architecture but just scaled up. Also, the proportions (e.g., going from the input to the intermediate size in the feed-forward layer, and so on) roughly match the ones in Qwen3. 5) My guess is the architecture was initially somewhat smaller than Qwen3 due to the smaller vocabulary, and they then scaled up the intermediate size expansion from 5x in Qwen3 to 5.4 in Olmo 3 to have a 32B model for a direct comparison. 6) Also, note that the 32B model (finally!) uses grouped query attention. And yes, I also did a from-scratch implementation. It was still a lot of work, but since I had already implemented Qwen3 from scratch, as well as Gemma 3 (for the sliding-window attention component), it wasn't too bad! If you are a coder, looking at the from-scratch implementation, it's probably the easiest way to understand the architecture: https://lnkd.in/gQgxtVUu
Sebastian Raschka, PhD — have you had a chance to look at the Samsung TRM? Or did I first hear about it from you? 🧐🤔🤣 https://arxiv.org/html/2510.04871v1
The pace is accelerating beyond anything I’ve seen. Do you think we’ve ever had a release cycle comparable to this , even during the OS wars, or the browser years?
Can you make a video of their training pipeline and post-train etc
My friends at NotebookLM said: "This report details a just a meticulously engineered pipeline. It really sets a new standard for transparency." The amount of work they put into this release is crazy. They generated a lot of datasets just because the existing ones did not have permissive licenses. I like how they were able to build something good while throwing 84.6% of the initial data. In general, it is always fun and refreshing to read Olmo's technical reports.
Why no one is using Deepseek's MLA for KV Cache optimization ?
Super interesting breakdown, I went through your Olmo3 vs Qwen3 notes and then re-watched your “Build an LLM” attention talk on YouTube to refresh the mental model. It’s amazing how much clearer the architecture comparisons feel once you’ve actually seen the components implemented from scratch. A few things stood out to me from your post: • The post-norm choice in Olmo3 feels like a very “earned through experiments” decision, the OLMo2 paper’s comments about stability really come through here. • The sliding-window attention in the 7B is a smart optimization. After seeing you code the attention loop manually, it’s interesting to see how much of transformer engineering is just “make this core loop cheaper without breaking it.” • And the proportion matching with Qwen3 in the 32B variant makes a lot of sense especially the 5× → 5.4× MLP expansion tweak. Easy to miss unless you’ve done the from-scratch implementation work yourself. Since you’ve now built Qwen3, Gemma-style sliding-window, and Olmo3 from scratch, which architectural pattern felt the cleanest when implementing it directly in code?
Always appreciate these deep dives! It helps me stay up to date with the latest models
Sebastian Raschka, PhD thank you for sharing. As always waiting for complete breakdown of olmo 3. Once again big thank you for democratising the knowledge.
Olmo 3 is a win for transparency.
Sebastian Raschka, PhD , do you have any plan to include these architectures in your book? It would be interesting to get all in the updated version of your book.