Olmo 3 release: A detailed comparison with Qwen3

There have been lots of interesting LLM releases last week. My favorite was actually the Olmo 3 release. Olmo models are always a highlight since they are fully transparent (including training methods and datasets) and come with very detailed technical reports. I am sure I'll talk more about the interesting training-related aspects from that 100-pager in the upcoming days and weeks. In the meantime, here's the side-by-side architecture comparison with Qwen3: 1) As we can see, the Olmo 3 architecture is relatively similar to Qwen3. However, it's worth noting that this is essentially likely inspired by the Olmo 2 predecessor, not Qwen3. 2) Similar to Olmo 2, Olmo 3 still uses a post-norm flavor instead of pre-norm, as they found in the Olmo 2 paper that it stabilizes the training. 3) Interestingly, the 7B model still uses multi-head attention similar to Olmo 2. However, to make things more efficient and reduce the KV cache size, they now use sliding-window attention (e.g., similar to Gemma 3). Next, the 32B model (the figure is not shown here due to space reasons, but you can find it in my "The Big LLM Architecture Comparison" article or my Olmo 3 from-scratch notebook): 4) Overall, it's the same architecture but just scaled up. Also, the proportions (e.g., going from the input to the intermediate size in the feed-forward layer, and so on) roughly match the ones in Qwen3. 5) My guess is the architecture was initially somewhat smaller than Qwen3 due to the smaller vocabulary, and they then scaled up the intermediate size expansion from 5x in Qwen3 to 5.4 in Olmo 3 to have a 32B model for a direct comparison. 6) Also, note that the 32B model (finally!) uses grouped query attention. And yes, I also did a from-scratch implementation. It was still a lot of work, but since I had already implemented Qwen3 from scratch, as well as Gemma 3 (for the sliding-window attention component), it wasn't too bad! If you are a coder, looking at the from-scratch implementation, it's probably the easiest way to understand the architecture: https://lnkd.in/gQgxtVUu

32 Comments

Arijit Barat

Sebastian Raschka, PhD , do you have any plan to include these architectures in your book? It would be interesting to get all in the updated version of your book.

4 Reactions

Gary Longsine

Sebastian Raschka, PhD — have you had a chance to look at the Samsung TRM? Or did I first hear about it from you? 🧐🤔🤣 https://arxiv.org/html/2510.04871v1

4 Reactions

Awadelrahman Ahmed

The pace is accelerating beyond anything I’ve seen. Do you think we’ve ever had a release cycle comparable to this , even during the OS wars, or the browser years?

6 Reactions

Shaheen Nabi

Can you make a video of their training pipeline and post-train etc

4 Reactions

Imad Saddik

My friends at NotebookLM said: "This report details a just a meticulously engineered pipeline. It really sets a new standard for transparency." The amount of work they put into this release is crazy. They generated a lot of datasets just because the existing ones did not have permissive licenses. I like how they were able to build something good while throwing 84.6% of the initial data. In general, it is always fun and refreshing to read Olmo's technical reports.

3 Reactions

Mohamed Rashad

Why no one is using Deepseek's MLA for KV Cache optimization ?

4 Reactions

Tarak ☁️

Super interesting breakdown, I went through your Olmo3 vs Qwen3 notes and then re-watched your “Build an LLM” attention talk on YouTube to refresh the mental model. It’s amazing how much clearer the architecture comparisons feel once you’ve actually seen the components implemented from scratch. A few things stood out to me from your post: • The post-norm choice in Olmo3 feels like a very “earned through experiments” decision, the OLMo2 paper’s comments about stability really come through here. • The sliding-window attention in the 7B is a smart optimization. After seeing you code the attention loop manually, it’s interesting to see how much of transformer engineering is just “make this core loop cheaper without breaking it.” • And the proportion matching with Qwen3 in the 32B variant makes a lot of sense especially the 5× → 5.4× MLP expansion tweak. Easy to miss unless you’ve done the from-scratch implementation work yourself. Since you’ve now built Qwen3, Gemma-style sliding-window, and Olmo3 from scratch, which architectural pattern felt the cleanest when implementing it directly in code?

3 Reactions

Ashley B. Cohen

Always appreciate these deep dives! It helps me stay up to date with the latest models

4 Reactions

Ravi Choudhary

Sebastian Raschka, PhD thank you for sharing. As always waiting for complete breakdown of olmo 3. Once again big thank you for democratising the knowledge.

3 Reactions

Paul Iusztin

Senior AI Engineer • Founder @ Decoding AI • Author @ LLM Engineer’s Handbook ~ I ship AI products and teach you about the process.

Olmo 3 is a win for transparency.

2 Reactions

See more comments

To view or add a comment, sign in

More from this author

Understanding Reasoning LLMs

Understanding Multimodal LLMs

Building a GPT-Style LLM Classifier From Scratch

Explore content categories