Practical upper-bound is an interesting concepts. What kind of practical upper bound would be interesting other than this?
arxiv.org/abs/2510.09378
@satoki-ishikawa.bsky.social
Institute of Science Tokyo / R. Yokota lab / Neural Network / Optimization https://riverstone496.github.io/
Practical upper-bound is an interesting concepts. What kind of practical upper bound would be interesting other than this?
arxiv.org/abs/2510.09378
Muon๏ผ่ซๆใซใใๅ ดๅใฏ๏ผๆฉใใซๆธใใชใใจ๏ผใพใ่ชฐใใจ่ขซใๅฏ่ฝๆงใใใไธๆนใง๏ผไฝใใใจใฒใจๆผใใฎใชใชใธใใชใใฃใๅบใใชใโฆ
13.10.2025 11:52 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0While the focus for generalization and implicit bias has been on robustness to sample-wise noise, the rise of large-scale models suggests that robustness to parameter-wise noise (e.g., from quantization) might be now just as important?
x.com/deepcohen/st...
So many papersโฆ itโs a bit overwhelming. Wish there were a field with fewer of them...
10.10.2025 13:12 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0Iโve been challenging myself to read a lot of NeurIPS 2025 papers, but maybe I should switch soon to reading ICLR 2025 submissions instead.
10.10.2025 13:11 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0This might be one of the advantages of methods that skip curvature EMA (like Muon) or use the function gradient (like NGD).
10.10.2025 12:46 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0This paper is really interesting.
NGD builds curvature from the function gradient df/dw, while optimizers like Adam and Shampoo use the loss gradient dL/dw.
Iโve always wondered which is better, since using the loss gradient with EMA might cause loss spikes later in training.
This paper studies why Adam occasionally causes loss spikes, which is attributed to the edge of stability phenomenon. As seen from the figure, once hitting EOS (see b) a loss spike is triggered. An interesting experimental report!
arxiv.org/abs/2506.04805
I'm looking at ICLR submissions and I've noticed a significant number of papers related to Muon.
10.10.2025 03:42 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0ใในใใผในใใจใใๅ่ชใ้ข้ฃใใๅ้ๆจชๆญๅใฎใฏใผใฏใทใงใใใใใจ๏ผใไบใใงใฉใใใ่ญฐ่ซใซใชใใ่ๅณใใใ๏ผ๏ผ๏ผ
08.10.2025 07:35 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0ๅญฆ็ฟใใคใใใฏในใๆ้ป็ใใคใขในใฎ่ฆณ็นใใๅฌใใในใใผในๆง้ ใจ๏ผGPUใ็จใใ่กๅ็ฉใซใจใฃใฆๅฌใใในใใผในๆง้ ใจ๏ผ่ณใๆใฃใฆใใในใใผในๆง้ ๏ผใฉใฎใใใใชใผใใผใฉใใใใใใฎใงใใใญ๐คGPUใ็จใใ่กๅ็ฉใซใจใฃใฆๅฌใใในใใผใน่กๅใฎๆง้ ใฏ่คๆฐใใฟใผใณ็ฅใใใฆใใพใใ๏ผใใฎๅญฆ็ฟ็่ซใ็ฅ็ต็งๅญฆใจใฎๆฅ็ถใฏใใพใ่ใใ๏ผใใ HPCใฎไบบใไปๅ้ใซ่ๅณใใใใฎใง๏ผ้ข้ฃใใใใชๆ็ฎใซๅผ็จใฏ้ฃใฐใใคใคใ๏ผใใจไธๆญฉใง่กใ่ฉฐใพใฃใฆใใๅฐ่ฑก๏ผ
08.10.2025 07:29 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 0And he is the genious.
www.youtube.com/watch?v=iZAp...
Everyone is giving a wonderful performance, but among the Japanese pianists, I'm particularly fond of Ushida-kun and Kuwahara-san.
www.youtube.com/watch?v=SPS4...
www.youtube.com/watch?v=DaY6...
Iโm watching Chopin Competition.
Iโm so surprised that he is using an office chair while Iโm deeply impressed by his performance.
m.youtube.com/watch?v=fDsg...
ไบใคใฎๅ จใ้ใใจใใใงใใฃใฆใใ้ใใใผใไธกๆนใจใชใผใใผใฉใใใๅคงใใ็ ็ฉถใๅบใฆใใใจ๏ผ็ฒพ็ฅ็ใใกใผใธใๅคงใใใฎใงใใ๏ผใใผใ่จญๅฎใๅฎๆใใใใฎใใใใใชใ
04.10.2025 01:21 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0ใใใใผใใใชใใๆพๆตชใใๆ ใใใพใ๐งณ
02.10.2025 18:40 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0ใใ2~3ใถๆ่ใใฆใใ๏ผใคใฎใใผใใซๆฅตใใฆ่ฟใ่ซๆใ๏ผๆจๆฅๅๆใซ๏ผๆฌๅบใฆใใพใ๏ผ็ ็ฉถใใผใใๆถๆป ใใฆใใพใฃใ๏ผ๏ผ๏ผๆฉใใซๅใๆฟใใฆ๏ผใใ็ ็ฉถใใผใใใผใญใใ่ใ็ดใใชใใจใใใชใใงใใญ๏ผ๏ผ๏ผ
02.10.2025 18:06 โ ๐ 1 ๐ 0 ๐ฌ 1 ๐ 1PINN๏ผๅ จ็ถ็ฅใใชใใฃใใฎใงใใ๏ผๆ้ฉๅ&Scalingใฎๅฏพ่ฑกใจใใฆ้ข็ฝใใ๏ผ
28.09.2025 05:59 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0Not all scaling laws are nice power laws. This monthโs blog post: Zipfโs law in next-token prediction and why Adam (ok, sign descent) scales better to large vocab sizes than gradient descent: francisbach.com/scaling-laws...
27.09.2025 14:57 โ ๐ 47 ๐ 12 ๐ฌ 1 ๐ 0I've made some small updates to the 'awesome list' for second-order optimization I made two years ago. It looks like Muon related works and the applications to PINNs have really taken off in the last couple of years.
github.com/riverstone49...
Fluid dynamics might serve as an interesting new benchmark for second-order optimization
23.09.2025 02:14 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0I donโt know anything about fluid dynamics, but I came across a paper that seemed to say that second-order optimization is key when using the power of neural networks to solve the NavierโStokes equations. If so, thereโs something romantic about that.
arxiv.org/abs/2509.14185
This is not OK.
I don't submit often to NeurIPS, but I reviewed papers for this conference almost every year. As a reviewer, why would I spend time trying to give a fair opinion on papers if it's what happens in the end???
ใใใใจใใใใใพใ๏ผ
18.09.2025 10:50 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0ใใใใจใใใใใพใ๏ผใพใ่ญฐ่ซใงใใใๅนธใใงใ๐
18.09.2025 10:50 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0ใญใผใฏใผใใจใใฆใฏ๏ผใ่ช็ถๅพ้ ๆณ๏ผMuon๏ผๅๆฃๅญฆ็ฟ/็ถ็ถๅญฆ็ฟ๏ผGPUใๅจใใงๅบใใพใใ๏ผใพใใฏ๏ผๆ้ฉๅใซ้ขใใฆ๏ผๆ่ฟ่ชญใใงใใใใฎไธๅใฎๆธ็ฑใ็ตใฟๅใใใใใใช่ซๆใไธๆฌ๏ผๅบใใใใ้ ๅผตใใพใ๏ผ
18.09.2025 05:50 โ ๐ 2 ๐ 0 ๐ฌ 2 ๐ 0ACT-Xใซๆกๆใใใพใใ๏ผๅผใ็ถใ๏ผใใฅใผใฉใซใใใใฏใผใฏใฎๆ้ฉๅใซใคใใฆ๏ผๆทฑใ็่งฃใๅพใใใใใ็ ็ฉถใใฆใใใพใ๐
www.jst.go.jp/kisoken/act-...
When a paper has more than 40 figures, I can really feel the authorโs dedication just by looking at it - itโs energizing
arxiv.org/abs/2509.01440
ใ็ฅ็ตๅ่ทฏ็ถฒใฎๆฐ็ใใ่ฑ่ชใซ็ฟป่จณใใใฆใใชใใใจใ่ใใใจ๏ผไธ็ซ ใใค็พๅจใฎๆทฑๅฑคๅญฆ็ฟใซๅฏพๅฟใฅใใฆใใ๏ผใ็พไปฃ็ใ็ฅ็ตๅ่ทฏ็ถฒใฎๆฐ็ใใๅท็ญใใฆ่ฑ่ชๅใใใใญใธใงใฏใใจใใใฃใใ้ข็ฝใใใงใใ๏ผๅๅฟ่ฒซๅพนใฎๆๅณใงใฏ๏ผใใฎใตใใผใใซใชใใใใชไฝใใใฉใใใงใใใใ๏ผ
31.08.2025 05:56 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0็งใๆ ๅ ฑ็ณปใฎ้ฒ่ทฏใ้ธใใ ไธ็ชใฎใใฃใใใฏ๏ผใ็ๅฉๅ ็ใฎใ็ฅ็ตๅ่ทฏ็ถฒใฎๆฐ็ใใ็พไปฃ็ใชใขใใซใจๅคง่ฆๆจก่จ็ฎๆฉใงๅฎ้จๆค่จผใใใฆใฟใใใใชใฎใงใใ๏ผใ็ฅ็ตๅ่ทฏ็ถฒใฎๆฐ็ใใไน ใใถใใซ้ใใฆ่ชญใใงใฟใใ๏ผไปใกใใใฉใใฃใฆใใใขใใซใฎ็ฐก็ดๅ๏ผ่งฃๆใจๅฎ่ณชๅใใใจใๆธใใใฆใใใใจใซๆฐใใคใ้ฉใใฆใใ๏ผ
31.08.2025 05:54 โ ๐ 3 ๐ 0 ๐ฌ 1 ๐ 0