Thibaut Boissin's Avatar

Thibaut Boissin

@thib-s.bsky.social

10 Followers  |  29 Following  |  25 Posts  |  Joined: 09.01.2025  |  1.9582

Latest posts by thib-s.bsky.social on Bluesky

So in short:

AOL preconditioning (fused + re-tuned) -> 1 iter saved

Better convergence, singular values closer to 1

Kernel tweak removes extra memory load

This gives ~1.6x speedup, ~3x vs plain torch. ๐Ÿš€

21.09.2025 20:06 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Bonus: I spotted redundant memory loads in the 3rd NS line.
Wrote a small kernel to optimize bandwidth ->more free speed.

21.09.2025 20:06 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Problem 1: AOL adds extra cost.
Fix: fuse AOL's operation with an existing NS step -> essentially free.

Problem 2: NS isnโ€™t tuned for "almost orthogonal" inputs.
Fix: re-tune parameters with a genetic algorithm that is aware of the preconditioning.

21.09.2025 20:06 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

The inspiration comes from
Bernd Prach's Almost Orthogonal Layer (AOL).
It gives a cheap way to make a matrix "almost orthogonal."

Not great for full orthogonalization, but much better than rescaling -> perfect as a preconditioner for NS.

21.09.2025 20:06 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

The key idea: reduce the number of NS iterations.
How? By pre-conditioning the input matrix.

This makes the algorithm converge faster without losing precision.

21.09.2025 20:06 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
GitHub - thib-s/flash-newton-schulz: My attempt to improve the speed of the newton schulz algorithm, starting from the dion implementation. My attempt to improve the speed of the newton schulz algorithm, starting from the dion implementation. - thib-s/flash-newton-schulz

hereโ€™s the code: github.com/thib-s/flash... (I'll do a PR soon in Dion/Muon)

And hereโ€™s how I squeezed out the extra gain

21.09.2025 20:06 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

I used a mathematical trick to pre-condition the matrix, allowing to shave one iteration of the algorithm. This is not only faster, but also unlocks better convergence, with singular values closer to 1.

21.09.2025 20:06 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Good news: I managed to get an extra 1.6x speedup of the Newton Schulz algorithm (which is at the core of Dion/Muon). It reaches nearly a 3x speedup over the plain torch implementation !

21.09.2025 20:06 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

What is the S_n^++ ?

10.08.2025 10:17 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

It's crazy to think that I spent years using bjork&Bowie algorithm with 25 iters, and within a year, we got NS alg, an optimized set of parameters to run it in 5 iter, and triton kernels.

10.08.2025 10:15 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Large matrices are already compute bounded so the gain is small for those, so I will work to add fp8 support (once current code is consolidated).
I'll do a PR into the Dion repo when ready !

10.08.2025 10:15 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Sharing my journey to learn triton: still wip but io optimization yields some decent runtime improvement (around 25% on 512x512) on Newton Schulz (as used in Dion/Muon).

10.08.2025 10:15 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
A meme showing on the first line "how it started" with a screen capture showing a nice triton's tutorial, followed by "how it's going" with complex code about fp4 quantization for microscaling in some linear algebra algorithm.

A meme showing on the first line "how it started" with a screen capture showing a nice triton's tutorial, followed by "how it's going" with complex code about fp4 quantization for microscaling in some linear algebra algorithm.

My journey with Triton

07.08.2025 10:00 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Open Question: Does FP4 make fine-tuning easier or harder? On one side, fp4 weight might demand high precision gradients, on the other, it might be super compliant with QLoRA, what do you think ?

03.08.2025 11:01 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Robustness Check: Training in FP4 stress-tests hyperparameters and initialization quality.
If your model converges, you have robust, well-conditioned weights and gradients.
The model will likely be more resistant to input noise.

03.08.2025 11:01 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Not "Pure" FP4: FP4 rarely stands alone. It's usually accompanied by per-row or per-column scaling factors (FP8/FP16). Gradients are often accumulated at higher precision (FP16/FP32), making ultra-low precision practical.

03.08.2025 11:01 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Efficiency Boost: Halving precision (FP8 โ†’ FP4) allows doubling parameters with roughly similar FLOPs. But benefits can be even bigger because:
- Larger vector sizes enhance activations utilization.
- Lower precision floating-point math itself adds beneficial non-linearities.

03.08.2025 11:01 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

It's likely better to have a larger model in FP4 than a smaller one in FP8 (if you can train it):
- Improved non-linearity utilization with larger feature vects
- Enhanced hardware utilization on blackwell archs.
- Stress-test your training, yields models robust to input noise

more below

03.08.2025 11:01 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Pay attention to your loss: understanding misconceptions about 1-Lipschitz neural networks Lipschitz constrained networks have gathered considerable attention in the deep learning community, with usages ranging from Wasserstein distance estimation to the training of certifiably robust class...

Funny enough, this is the content of Appendix N (page 34) of arxiv.org/abs/2104.05097 ๐Ÿ˜‚

25.07.2025 19:44 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

This makes me wonder what happens in standard training: when your training loss increases, does it mean that optimization failed? Or that, thanks to weight decay, the networkโ€™s (unknown) Lipschitz constant got lower and the network is just getting more robust? ๐Ÿคท

25.07.2025 19:44 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

This has deeper implications: two networks with different initialization, batch order, or data augmentation end up learning the same function (same answers, same errors, both in train and val), even though the weights are completely different!

25.07.2025 19:44 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

The change in the Lipschitz constant makes the network more accurate (when increased) or more robust (when decreased). Unlike traditional classification, robust classification with a Lipschitz net has a unique minimizer once the Lipschitz constant is set.

25.07.2025 19:44 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

The Lipschitz constant of a network impacts its robustness, but what happens when you change it during training? Here, we train 16 networks with a fixed Lipschitz constant at first, then increase or decrease it by a factor of two mid-training.

25.07.2025 19:44 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Beyond robustness: Lipschitz networks = stability.
Different inits, different seeds, different weightsโ€”same function.
A thread ๐Ÿงต

25.07.2025 19:44 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Some bad, but creative, training losses ๐Ÿ‘Œ

10.06.2025 21:55 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

@thib-s is following 20 prominent accounts