Modular Manifolds: A New Approach to Neural Network Training

This title was summarized by AI from the post below.

View organization page for Thinking Machines Lab

134,197 followers

8mo Edited

Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices. https://lnkd.in/gDyaXr-f We explore a fundamental understanding of the geometry of neural network optimization.

Modular Manifolds thinkingmachines.ai

12 Comments

Simona Vargiu 8mo

Ciao, I found this work really fascinating ! But I’m left with one key question: - how many degrees of freedom are actually lost when the system is constrained? - and how can it readapt and evolve if it remains constrained? Reading through it, I wondered whether more flexibility could be introduced. For example, instead of always projecting the weights W onto the manifold M, one could define an “effective” state as a dynamic interpolation: W_eff = (1 - α(W)) W + α(W) Π_M(W) with α(W) = 1 / (1 + κ · dist(W, M)^2), 0 < α < 1 This way, the degrees of freedom are never completely lost, but are dynamically renormalized depending on the distance from the manifold. Of course, there are practical challenges: - Computing the distance to M is not trivial (geodesic distance requires SVD). - The interpolation can introduce oscillations. - The stability guarantees are weaker compared to Muon. But perhaps this is precisely where there’s room to balance stability and adaptability.

5 Reactions

Shiqian Ma 8mo

The ideas are about normalization. Imposing manifold NN normalizes the weights, and using spectral norm normalizes the gradients. This helps stabilize the training. Nothing too big nor too small in the training. Clipping is in similar spirit. But the algorithm is still less understood. It is closely related to Riemannian spectral GD (hasn't received much attention) but not exactly the same. It is more like a combination of Riemannian FW and Riemannian trust-region with the trust region defined by spectral norm instead of traditional Euclidean norm.

2 Reactions

amber eltaieb 8mo

I would offer the following thought: "From Neural Roads to Modular Manifolds: What Machine Learning Can Learn from Human Repair” Whether it’s neural nets or trauma recovery, structure shapes behavior. Modularity, boundaries, and intentional design reduce chaos and improve learning. AI has a lot to learn from healing systems.

3 Reactions

Bibi Brahim, AI Architect - Gen AI Full-stack developer, graphic

Bibi Brahim, AI Architect - Gen AI Full-stack developer 8mo

Thinking Machines Lab It doesn't have to be difficult to be efficient. I found an economical solution, based on emergent algebra, discussed in this paper here : https://zenodo.org/records/17239568

Cyrus Azamfar, PhD 8mo

Interesting take on weight normalization—curious to see how it evolves!

1 Reaction

Subho Majumdar, PhD 8mo

Finally something that looks less like a bag of magic tricks. Thank you.

1 Reaction

Marc Delacroix

VP of Operations at Make.com | 2 Startup Exits | Building Products People Love

8mo

You guys are rockstars! Thank you for sharing these!

1 Reaction

artinet 8mo

🥳🎉

George Groves 4mo

VyTek: Intelligent Technology

See more comments

To view or add a comment, sign in

More Relevant Posts

Dan Furman
8mo Edited
Report this post
"Efficient training of neural networks is difficult" is an understatement. Its very difficult, maybe even "extremely" I'd say. So kudos to Thinking Machines for introducing a fresh new approach here, interesting framework for solving the seriously hard problem! Imho (1) attention heads should live on real-time cognitive response data manifolds, that is a clear next step for testing and (2) treating human emotion data as its own submanifold in a "non-Riemannian world" may be very fertile ground for further boosting efficiency. That was part of our thought process also behind our RLBF system https://lnkd.in/gk9cwvVh #ThinkingMachines #AI #Arctop #WeekendReading

Thinking Machines Lab

134,197 followers
8mo Edited

Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices. https://lnkd.in/gDyaXr-f We explore a fundamental understanding of the geometry of neural network optimization.

Modular Manifolds thinkingmachines.ai
Like Comment
To view or add a comment, sign in
Hammad Armghan, PhD
8mo Edited
Report this post
I read this interesting post about manifold optimization that made me think about something important: even though we've come a long way, training large language models is still surprisingly fragile. We've made considerable progress with adaptive optimizers like Adam, better learning rate schedules, gradient clipping, and scaling rules that work for a wide range of sizes. But we still have a lot of problems that make training seem more like an art than a science. The main difficulties still exist: tweaking hyperparameters takes a lot of trial and error, training can go horribly wrong if the initialization or learning rates are wrong, and we don't have strong theoretical assurances for the non-convex landscapes we're optimizing. Scaling rules help us understand loss curves better, but they don't tell us anything about how training works or why some configurations work better than others. When you compile code, you always get the same outputs, which is what I thought of when I heard the analogy to software compilation. Training a neural network means finding the best way to solve complex, non-convex problems where new behaviors can happen at any time. We aren't quite as reliable as we could be at the compilation level, but we could be considerably closer than we are now. We need algorithms that can automatically change learning rates for different parts of the network, stronger theoretical bases for training stability, and better ways to guess how changes to the architecture or data will affect the way training works. The manifold optimization approach looks promising because it uses geometric ideas, but like many other specialized methods, it needs more real-world testing before we can be sure of its effects. It's important to note that a lot of the ongoing problems with training come from bad data and limited computing power, not just the optimization techniques themselves. But for researchers new to AI, optimization is still one of the most important areas where basic contributions could change the whole field. There is still a big gap between what we know in theory and what we need in practice. Closing that gap could make training at scale more dependable and efficient.

Thinking Machines Lab

134,197 followers
8mo Edited

Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices. https://lnkd.in/gDyaXr-f We explore a fundamental understanding of the geometry of neural network optimization.

Modular Manifolds thinkingmachines.ai
Like Comment
To view or add a comment, sign in
Shiqian Ma
8mo Edited
Report this post
This blog article from Thinking Machines Lab is attracting lots of attention now. It is very exciting to see that Riemannian/manifold optimization is playing pivotal roles here. Since this is related to things I have been actively working on in the past few years, I would like to take this opportunity to share some insights. First, what is Muon? We all know gradient descent, which can be interpreted as minimizing the first-order approximation to the loss function with the distance measured by Euclidean norm. Muon is gradient descent but with distance measured by matrix spectral norm. So, essentially it is spectral gradient descent. This blog article proposes manifold Muon. There are two main ideas: (i) Imposing manifold constraint to the parameter matrix. Then the problem becomes minimizing the loss over manifold, which is a Riemannian optimization problem. Imposing manifold constraint to neural network weights is not a new idea, it has been studied in the literature. (ii) Solving this Riemannian optimization problem by Riemannian spectral gradient descent. So essentially this is extending the Muon idea to manifolds. The (*) equation in the blog article is actually the classical Riemannian gradient descent. Maybe a better way to write it is to move the size constraint to the objective, which gives Min a^T g + 1/(2\eta) ||a||_2^2, s.t., a^T w=0. This is precisely Riemannian gradient descent with \eta being the learning rate and a^T w=0 defining the tangent space. Manifold Muon replaces the Euclidean norm ||a||_2 by matrix spectral norm, when a is matrix, which gives exactly the manifold Muon equation in the blog article. So essentially this is Riemannian spectral gradient descent. More thoughts about normalization. Imposing manifold NN normalizes the weights, and using spectral norm normalizes the gradients. This helps stabilize the training. Nothing too big nor too small in the training. Clipping is in similar spirit. Taking this opportunity, I would like to advertise two recent works from my lab and collaborators that are closely related. The first one is the ASGO algorithm (NeurIPS 2025), which is a close variant of Muon and Shampoo and has demonstrated promising practical performance. The second one (will be on arxiv soon) is a tuning-free Riemannian gradient descent algorithm, which is learning-rate-free and hyperparameter-free. This can be potentially applied to manifold Muon to solve the issue of tuning learning rate and hyperprameters.

Thinking Machines Lab

134,197 followers
8mo Edited

Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices. https://lnkd.in/gDyaXr-f We explore a fundamental understanding of the geometry of neural network optimization.

Modular Manifolds thinkingmachines.ai

1 Comment
Like Comment
To view or add a comment, sign in
Xianxin Guo
8mo
Report this post
Very interesting to see this progress from Thinking Machines Lab. We explored this exact approach at Lumai as well. More interestingly, we considered it from the optical computing perspective, as it solves some critical hardware problems.

Thinking Machines Lab

134,197 followers
8mo Edited

Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices. https://lnkd.in/gDyaXr-f We explore a fundamental understanding of the geometry of neural network optimization.

Modular Manifolds thinkingmachines.ai
Like Comment
To view or add a comment, sign in
Stephen Pimentel
8mo
Report this post
Normalization keeps neural networks healthy by stabilizing the scale of activations, gradients, and especially weights, which improves learning speed, predictability, and robustness. Common practice normalizes activations with layer norm and gradients with methods like Muon, but extending normalization to weight matrices offers further benefits such as preventing norm explosions, simplifying hyperparameter tuning, improving conditioning, and enabling Lipschitz guarantees. One approach constrains weights to live on specific manifolds and optimizes directly in each layer’s tangent space so the learning rate matches the true step length, followed by a small retraction back to the manifold. Specializing to matrices, constraining weights to the Stiefel manifold controls singular values, while measuring step size with the spectral norm yields a manifold version of Muon solved via dual ascent and the matrix sign function for efficient retractions. https://lnkd.in/gAAagur2

Modular Manifolds thinkingmachines.ai
Like Comment
To view or add a comment, sign in
Virendra Kumar
8mo
Report this post
Revolutionizing Neural Network Training with Modular Manifolds Jeremy Bernstein's recent article, "Modular Manifolds" introduces a game-changing approach to training large neural networks by constraining weight matrices to manifolds, like the Stiefel manifold, using a novel manifold Muon optimizer. This ensures stable tensor scaling, preventing issues like exploding weights and enabling more predictable training. The concept of modular manifolds extends this to entire networks, intelligently budgeting learning rates across layers based on Lipschitz sensitivity. This framework promises more robust, scalable training for complex models like transformers, with early experiments showing improved accuracy over AdamW. This opens doors to co-designing architectures and optimizers for better performance. Check out the full article for a deep dive into manifold optimization and its potential to reshape how we train large-scale models! 🚀 https://lnkd.in/gCXM6rpB

Modular Manifolds thinkingmachines.ai
Like Comment
To view or add a comment, sign in
Ali Alauoubiy
8mo
Report this post
Keeping Neural Networks 'Healthy”' Training large neural networks is all about stability. If tensors whether weights, activations, or gradients grow too big or shrink too small, learning slows down, optimization gets messy, and results can fall apart. That’s why normalization has become standard practice: - Activations are kept in check with methods like LayerNorm - Gradients are scaled with optimizers like Muon - Weights are often left alone, even though they matter just as much Normalizing weights can make a big difference: - More stable training without exploding weights - Updates that are easier to tune and interpret - Stronger robustness against perturbations In Thinking Machines Lab latest work, they looked at a different approach: constraining weight matrices to mathematical submanifolds. This lets us rethink optimization and design algorithms that naturally fit within those constraints. => One example is a manifold-based version of the #Muon optimizer, where weights live on the Stiefel manifold (matrices with unit condition number). They're also introducing the idea of modular manifolds building blocks that can be combined to help train and scale larger, more reliable networks. https://lnkd.in/dt4aFTub

Modular Manifolds thinkingmachines.ai
Like Comment
To view or add a comment, sign in
Amin Iranpour
7mo
Report this post
Current North American design standards can overestimate the lateral-torsional buckling capacity of T-section beams by more than 40%. Our latest research, published in Thin-Walled Structures, explains why these deviations occur and introduces improved solutions using an energy-based method combined with artificial neural networks. 📄 Read the full article: https://lnkd.in/geVCTKSh #StructuralEngineering #SteelDesign #LateralTorsionalBuckling #FiniteElementAnalysis #ResearchToPractice

Design expressions for distortional lateral buckling of beams with T-sections sciencedirect.com
Like Comment
To view or add a comment, sign in
John Blair
8mo
Report this post
Is AI coming for our jobs? The short answer is 'yes'. Only parts of it for the moment, but the rest is a matter of time. While the test "spans 44 occupations and hundreds of knowledge work tasks, it is limited to one-shot evaluations, so it doesn’t capture cases where a model would need to build context or improve through multiple drafts." https://lnkd.in/gMTvqmtW

Measuring the performance of our models on real-world tasks openai.com
Like Comment
To view or add a comment, sign in
Jonas Porcar Ferrer
7mo
Report this post
𝗛𝗼𝘄 𝗴𝗼𝗼𝗱 𝗶𝘀 𝗔𝗜 𝗮𝘁 𝘳𝘦𝘢𝘭 𝗷𝗼𝗯𝘀? 𝗪𝗶𝗹𝗹 𝗺𝘆 𝗷𝗼𝗯 𝗯𝗲 𝗰𝗼𝗺𝗽𝗹𝗲𝘁𝗲𝗹𝘆 𝗼𝗿 𝗽𝗮𝗿𝘁𝗶𝗮𝗹𝗹𝘆 𝗯𝗲 𝗿𝗲𝗽𝗹𝗮𝗰𝗲𝗱 𝗯𝘆 𝗔𝗜 𝗶𝗻 𝘁𝗵𝗲 𝗻𝗲𝗮𝗿 𝗳𝘂𝘁𝘂𝗿𝗲?🤔 Well, OpenAI just gave us a major clue. They've released a new evaluation called #GDPval, and this is not another academic test. This benchmark measures AI performance on 𝟭,𝟯𝟬𝟬+ 𝗿𝗲𝗮𝗹-𝘄𝗼𝗿𝗹𝗱 𝘁𝗮𝘀𝗸𝘀 pulled directly from the daily work of professionals across 𝟰𝟰 𝗱𝗶𝗳𝗳𝗲𝗿𝗲𝗻𝘁 𝗿𝗲𝗮𝗹 𝗼𝗰𝗰𝘂𝗽𝗮𝘁𝗶𝗼𝗻𝘀 𝗶𝗻 𝘁𝗵𝗲 𝗺𝗼𝘀𝘁 𝗿𝗲𝗹𝗲𝘃𝗮𝗻𝘁 𝗚𝗗𝗣 𝘀𝗲𝗰𝘁𝗼𝗿𝘀. Think lawyers, software developers, nurses, or engineers. I think the early results are promising: ✅ 𝗔𝗜 𝘃𝘀. 𝗧𝗵𝗲 𝗣𝗿𝗼𝘀: In blind tests, the AI model that scored the best (Claude Opus 4.1) produced work that experts rated as good as or better than a human professional's in nearly half of the tasks (~47%). 📈 𝗥𝗮𝗽𝗶𝗱 𝗜𝗺𝗽𝗿𝗼𝘃𝗲𝗺𝗲𝗻𝘁: Performance has been skyrocketing, more than doubling in just a year from GPT-4o to GPT-5 on this GDPval benchmark. ⚡𝗔 𝗕𝗼𝗼𝘀𝘁 𝗳𝗼𝗿 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆: While human oversight is still key, the raw data shows models can complete these tasks 𝟭00x faster and cheaper than unaided experts. 📊𝗣𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝘃𝗶𝘁𝘆 𝘂𝗽𝘀𝗶𝗱𝗲: With human-in-the-loop review, AI assistance showed 1.3–1.6× gains in speed and cost savings, especially when paired with structured scaffolding. 🚧𝗖𝗹𝗲𝗮𝗿 𝗴𝗮𝗽𝘀 𝗿𝗲𝗺𝗮𝗶𝗻: Instruction-following and formatting are still major failure points, and models struggle more when context is limited. I think it is clear what we all are starting to feel in some way or another at our daily tasks: 𝗔𝗜 𝗶𝘀 𝗯𝗲𝗰𝗼𝗺𝗶𝗻𝗴 𝗮 𝗽𝗼𝘄𝗲𝗿𝗳𝘂𝗹 𝗮𝘀𝘀𝗶𝘀𝘁𝗮𝗻𝘁 𝘁𝗼 𝗵𝗮𝗻𝗱𝗹𝗲 𝗺𝗼𝘀𝘁 𝗼𝗳 𝗼𝘂𝗿 𝗿𝗼𝘂𝘁𝗶𝗻𝗲 𝘄𝗼𝗿𝗸 𝘀𝗼 𝘄𝗲 𝗰𝗮𝗻 𝗳𝗼𝗰𝘂𝘀 𝗼𝗻 𝘁𝗵𝗲 𝗰𝗿𝗲𝗮𝘁𝗶𝘃𝗲, 𝘀𝘁𝗿𝗮𝘁𝗲𝗴𝗶𝗰 𝗮𝗻𝗱 𝗮𝗺𝗯𝗶𝗴𝘂𝗼𝘂𝘀 𝗽𝗮𝗿𝘁𝘀 𝗼𝗳 𝗼𝘂𝗿 𝗷𝗼𝗯𝘀. *Link : https://lnkd.in/dnpB5NGB #AI #FutureOfWork #GDPval #OpenAI #Innovation #Productivity #ArtificialIntelligence

Measuring the performance of our models on real-world tasks openai.com

1 Comment
Like Comment
To view or add a comment, sign in

134,197 followers

View Profile Follow

Modular Manifolds: A New Approach to Neural Network Training

More Relevant Posts

Explore related topics

Explore content categories