Efficient training of neural networks is difficult. Our second Connectionism post introduces Modular Manifolds, a theoretical step toward more stable and performant training by co-designing neural net optimizers with manifold constraints on weight matrices. https://lnkd.in/gDyaXr-f We explore a fundamental understanding of the geometry of neural network optimization.
The ideas are about normalization. Imposing manifold NN normalizes the weights, and using spectral norm normalizes the gradients. This helps stabilize the training. Nothing too big nor too small in the training. Clipping is in similar spirit. But the algorithm is still less understood. It is closely related to Riemannian spectral GD (hasn't received much attention) but not exactly the same. It is more like a combination of Riemannian FW and Riemannian trust-region with the trust region defined by spectral norm instead of traditional Euclidean norm.
I would offer the following thought: "From Neural Roads to Modular Manifolds: What Machine Learning Can Learn from Human Repair” Whether it’s neural nets or trauma recovery, structure shapes behavior. Modularity, boundaries, and intentional design reduce chaos and improve learning. AI has a lot to learn from healing systems.
Thinking Machines Lab It doesn't have to be difficult to be efficient. I found an economical solution, based on emergent algebra, discussed in this paper here : https://zenodo.org/records/17239568
Interesting take on weight normalization—curious to see how it evolves!
Finally something that looks less like a bag of magic tricks. Thank you.
You guys are rockstars! Thank you for sharing these!
🥳🎉
Ciao, I found this work really fascinating ! But I’m left with one key question: - how many degrees of freedom are actually lost when the system is constrained? - and how can it readapt and evolve if it remains constrained? Reading through it, I wondered whether more flexibility could be introduced. For example, instead of always projecting the weights W onto the manifold M, one could define an “effective” state as a dynamic interpolation: W_eff = (1 - α(W)) W + α(W) Π_M(W) with α(W) = 1 / (1 + κ · dist(W, M)^2), 0 < α < 1 This way, the degrees of freedom are never completely lost, but are dynamically renormalized depending on the distance from the manifold. Of course, there are practical challenges: - Computing the distance to M is not trivial (geodesic distance requires SVD). - The interpolation can introduce oscillations. - The stability guarantees are weaker compared to Muon. But perhaps this is precisely where there’s room to balance stability and adaptability.