sageev @sageev - Bluesky Profile

Sharpn🔪ss predicts ✅generalz'n xcept when it doesn't,eg ✖️formers

But what's🔪?Which point is🔪er?
Find out
*why it's a tricky Q(hint: #symmetry)
*why our answer does let🔪predict generalz'n, even in ✅formers!
@ our #ICML2025 #spotlight E-2001 on Wed 11AM
by MF da Silva and F Dangel
@vectorinstitute.ai

15.07.2025 18:04 — 👍 0 🔁 0 💬 0 📌 0

I've been finding a way to get chatGPT to talk in a tone I haven't seen/read it take on before... interesting...
Just a few examples here. Note that the use of 'bold' was chatGPT's own choice, and I had not mentioned anything about safety nets in my interactions in this session.

22.06.2025 20:48 — 👍 0 🔁 0 💬 0 📌 0

YouTube video by Sageev Oore Smart Looper DK demo 2025-05-15 T4 montuno - Part 2 teaser

i am so superduper excited to show a little glimpse of this system we're developing for an improvising #music system with #AI! i can't explain how *actually* fun it is to play with this.

15sec teaser vid: youtu.be/onPetq4gJ18

blog: osageev.github.io/introducing-...

17.06.2025 15:25 — 👍 0 🔁 0 💬 0 📌 0

Yeah! (at least for most of it)

21.05.2025 09:16 — 👍 2 🔁 0 💬 0 📌 0

are you planning to go?

20.05.2025 12:19 — 👍 0 🔁 0 💬 1 📌 0

Is the website info about it incorrect? (Says wkshops on 18 & 19)

19.05.2025 18:07 — 👍 1 🔁 0 💬 1 📌 0

Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It The concept of sharpness has been successfully applied to traditional architectures like MLPs and CNNs to predict their generalization. For transformers, however, recent work reported weak correlation...

[8/8] 📖
"Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It"
🎓 Marvin F. da Silva, Felix Dangel, and Sageev Oore
🔗 arxiv.org/abs/2505.05409

Questions/comments? 👇we'd love to hear from you!
#ICML2025 #ML #Transformers #Riemannian #Geometry #DeepLearning #Symmetry

09.05.2025 12:46 — 👍 0 🔁 0 💬 0 📌 0

[7b/8] We see this even for traditional sharpness measures in the diagonal networks we study, but it is particularly striking in the vision transformers we study, where geodesically sharper minima actually generalize better!
Maybe flatness isn’t universal after all—context matters.

09.05.2025 12:46 — 👍 0 🔁 0 💬 1 📌 0

[7a/8] 🔥 Surprising twist:
Interestingly, flatter is not always better!

09.05.2025 12:46 — 👍 0 🔁 0 💬 1 📌 0

[6b/8]
🖼️ Vision Transformers (ImageNet): -0.41 (adaptive sharpness) → -0.71 (geodesic sharpness)
💬 BERT fine-tuned on MNLI: 0.06 (adaptive sharpness) → 0.38 (geodesic sharpness)

09.05.2025 12:46 — 👍 0 🔁 0 💬 1 📌 0

[6a/8] 🔦 Why does this matter?
Perturbations are no longer arbitrary—they respect functional equivalences in transformer weights.
Result: Geodesic sharpness shows clearer, stronger correlations (as measured by the tau correlation coefficient) with generalization. 📈

09.05.2025 12:46 — 👍 0 🔁 0 💬 1 📌 0

[5c/8] Geodesic Sharpness

Whereas traditional sharpness measures are evaluated inside an L^2 ball, we look at a geodesic ball in this symmetry-corrected space, using tools from Riemannian geometry.

09.05.2025 12:46 — 👍 0 🔁 0 💬 1 📌 0

[5b/8] Geodesic Sharpness

Instead of considering the usual Euclidean metric, we look at metrics invariant both to symmetries of the attention mechanism and to previously studied re-scaling symmetries.

09.05.2025 12:46 — 👍 0 🔁 0 💬 1 📌 0

[5a/8] Our solution: ✨Ge🎯desic Sharpness✨

We need to redefine sharpness and measure it not directly in parameter space, but on a symmetry-aware quotient manifold.

09.05.2025 12:46 — 👍 0 🔁 0 💬 1 📌 0

[4/8] The disconnect between the geometry of the loss landscape and the geometry of differential objects (e.g. the loss gradient norm) is already present for extremely simple models, i.e., for linear 2-layer networks with scalar weights.

09.05.2025 12:46 — 👍 0 🔁 0 💬 1 📌 0

[3b/8] This obscures the link between sharpness and generalization. Works such as Kwon et al. (2021) introduce notions of adaptive sharpness that are invariant to re-scaling.
Transformers, however, have a richer set of symmetries that aren’t accounted for by traditional adaptive sharpness measures.

09.05.2025 12:46 — 👍 0 🔁 0 💬 1 📌 0

[3a/8] 🔍 But why?

Traditional sharpness metrics (like Hessian-based measures or gradient norms) don't account for symmetries, directions in parameter space that don't change the model's output. They measure sharpness along directions that may be irrelevant, making results noisy or meaningless.

09.05.2025 12:46 — 👍 0 🔁 0 💬 1 📌 0

[2/8] 📌 Sharpness falls flat for transformers.
Sharpness (flat minima) often predicts how well models generalize. But Andriushchenko et al. (2023) found that transformers consistently break this intuition. Why?

09.05.2025 12:46 — 👍 0 🔁 0 💬 1 📌 0

🧵 ✨ Hide & Seek: Transformer Symmetries Obscure Sharpness & Riemannian Geometry Finds It ✨

Excited to announce our paper on factoring out param symmetries to better predict generalization in transformers ( #ICML25 spotlight! 🎉)

Amazing work by @marvinfsilva.bsky.social and Felix Dangel.
👇

09.05.2025 12:46 — 👍 0 🔁 0 💬 1 📌 1

www.instagram.com/reel/DIVINrS...

12.04.2025 02:54 — 👍 0 🔁 0 💬 0 📌 0

Login • Instagram Welcome back to Instagram. Sign in to check out what your friends, family & interests have been capturing & sharing around the world.

What happens if I put an insta link here www.instagram.com/reel/DINNlMc...

09.04.2025 01:11 — 👍 0 🔁 0 💬 0 📌 0

i love when in the middle of a familiar tune suddenly a groove presents itself... with a slow, natural build that takes its time

28.03.2025 01:03 — 👍 2 🔁 0 💬 0 📌 0

Friday night is tango night… I love how mysterious that e minor chord at the end of the video sounds… it always has, and I don’t have a lot of things that sounds that way to me. I love that song.

22.03.2025 00:24 — 👍 1 🔁 0 💬 0 📌 0

Practising Rachmaninoff which turned into a kind of focused improv exercise— a bit noodling but with a specific technical intention…

17.03.2025 00:52 — 👍 0 🔁 0 💬 0 📌 0

at one point i was using sibelius a lot, so i preferred it because i got used to it. but i never spent enough time with any of the other options to make a fair comparison. but sibelius could do everything i wanted, as far as i recall.

24.02.2025 06:34 — 👍 1 🔁 0 💬 0 📌 0

a little bit of fiddling around making fun twinkly shit up on a sat night

16.02.2025 01:18 — 👍 1 🔁 0 💬 0 📌 0

Redirecting...

We had an ice storm and people can literally skate on the streets here

www.facebook.com/share/r/12GZ...

15.02.2025 17:50 — 👍 1 🔁 0 💬 0 📌 0

today i came across this book that a cousin of mine (3x removed) wrote. i've seen some of his other stuff, but this little elementary introduction to variational problems is new to me🙂. my grandmother knew him in her childhood and he used to show her math stuff for fun.

archive.org/details/lyus...

14.02.2025 04:35 — 👍 1 🔁 0 💬 0 📌 0

Will keep updating this. Might be interesting to compare what I’m posting in about a month or so from now to what I’m posting these days. exciting and fun for me to do this #human #learning

09.02.2025 19:26 — 👍 0 🔁 0 💬 0 📌 0

Next day. Another section. Played through this ~50x, in different rhythms, hands separate, together,different articulations. All at 68 beats/min. Tmrw: same at 72 bpm. Can feel myself finally learning parts of this I never properly learned before. Would already be able to play much faster if I tried

09.02.2025 19:21 — 👍 0 🔁 0 💬 0 📌 1

sageev

Latest posts by sageev.bsky.social on Bluesky

@sageev is following 20 prominent accounts