Satoki Ishikawa's Avatar

Satoki Ishikawa

@satoki-ishikawa.bsky.social

Institute of Science Tokyo / R. Yokota lab / Neural Network / Optimization https://riverstone496.github.io/

43 Followers  |  453 Following  |  65 Posts  |  Joined: 25.11.2024  |  1.6326

Latest posts by satoki-ishikawa.bsky.social on Bluesky

Preview
The Potential of Second-Order Optimization for LLMs: A Study with Full Gauss-Newton Recent efforts to accelerate LLM pretraining have focused on computationally-efficient approximations that exploit second-order structure. This raises a key question for large-scale training: how much...

Practical upper-bound is an interesting concepts. What kind of practical upper bound would be interesting other than this?
arxiv.org/abs/2510.09378

15.10.2025 01:04 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Muon๏ผŒ่ซ–ๆ–‡ใซใ™ใ‚‹ๅ ดๅˆใฏ๏ผŒๆ—ฉใ‚ใซๆ›ธใ‹ใชใ„ใจ๏ผŒใพใŸ่ชฐใ‹ใจ่ขซใ‚‹ๅฏ่ƒฝๆ€งใŒใ‚ใ‚‹ไธ€ๆ–นใง๏ผŒไฝ•ใ‹ใ‚ใจใฒใจๆŠผใ—ใฎใ‚ชใƒชใ‚ธใƒŠใƒชใƒ†ใ‚ฃใŒๅ‡บใ›ใชใ„โ€ฆ

13.10.2025 11:52 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Jeremy Cohen on X: "This nice, thorough paper on LLM pretraining shows that quantization error rises sharply when the learning rate is decayed. But, why would that be? The answer is likely related to curvature dynamics. https://t.co/cdkt3DU1iw" / X This nice, thorough paper on LLM pretraining shows that quantization error rises sharply when the learning rate is decayed. But, why would that be? The answer is likely related to curvature dynamics. https://t.co/cdkt3DU1iw

While the focus for generalization and implicit bias has been on robustness to sample-wise noise, the rise of large-scale models suggests that robustness to parameter-wise noise (e.g., from quantization) might be now just as important?

x.com/deepcohen/st...

13.10.2025 05:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

So many papersโ€ฆ itโ€™s a bit overwhelming. Wish there were a field with fewer of them...

10.10.2025 13:12 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Iโ€™ve been challenging myself to read a lot of NeurIPS 2025 papers, but maybe I should switch soon to reading ICLR 2025 submissions instead.

10.10.2025 13:11 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

This might be one of the advantages of methods that skip curvature EMA (like Muon) or use the function gradient (like NGD).

10.10.2025 12:46 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

This paper is really interesting.
NGD builds curvature from the function gradient df/dw, while optimizers like Adam and Shampoo use the loss gradient dL/dw.
Iโ€™ve always wondered which is better, since using the loss gradient with EMA might cause loss spikes later in training.

10.10.2025 12:46 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

This paper studies why Adam occasionally causes loss spikes, which is attributed to the edge of stability phenomenon. As seen from the figure, once hitting EOS (see b) a loss spike is triggered. An interesting experimental report!

arxiv.org/abs/2506.04805

10.10.2025 07:55 โ€” ๐Ÿ‘ 5    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

I'm looking at ICLR submissions and I've noticed a significant number of papers related to Muon.

10.10.2025 03:42 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

ใ€Œใ‚นใƒ‘ใƒผใ‚นใ€ใจใ„ใ†ๅ˜่ชžใŒ้–ข้€ฃใ™ใ‚‹ๅˆ†้‡Žๆจชๆ–ญๅž‹ใฎใƒฏใƒผใ‚ฏใ‚ทใƒงใƒƒใƒ—ใ‚„ใ‚‹ใจ๏ผŒใŠไบ’ใ„ใงใฉใ†ใ„ใ†่ญฐ่ซ–ใซใชใ‚‹ใ‹่ˆˆๅ‘ณใŒใ‚ใ‚‹๏ผŽ๏ผŽ๏ผŽ

08.10.2025 07:35 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

ๅญฆ็ฟ’ใƒ€ใ‚คใƒŠใƒŸใ‚ฏใ‚นใ‚„ๆš—้ป™็š„ใƒใ‚คใ‚ขใ‚นใฎ่ฆณ็‚นใ‹ใ‚‰ๅฌ‰ใ—ใ„ใ‚นใƒ‘ใƒผใ‚นๆง‹้€ ใจ๏ผŒGPUใ‚’็”จใ„ใŸ่กŒๅˆ—็ฉใซใจใฃใฆๅฌ‰ใ—ใ„ใ‚นใƒ‘ใƒผใ‚นๆง‹้€ ใจ๏ผŒ่„ณใŒๆŒใฃใฆใ„ใ‚‹ใ‚นใƒ‘ใƒผใ‚นๆง‹้€ ๏ผŒใฉใฎใใ‚‰ใ„ใ‚ชใƒผใƒใƒผใƒฉใƒƒใƒ—ใŒใ‚ใ‚‹ใฎใงใ™ใ‹ใญ๐Ÿค”GPUใ‚’็”จใ„ใŸ่กŒๅˆ—็ฉใซใจใฃใฆๅฌ‰ใ—ใ„ใ‚นใƒ‘ใƒผใ‚น่กŒๅˆ—ใฎๆง‹้€ ใฏ่ค‡ๆ•ฐใƒ‘ใ‚ฟใƒผใƒณ็Ÿฅใ‚‰ใ‚Œใฆใ„ใพใ™ใŒ๏ผŒใใฎๅญฆ็ฟ’็†่ซ–ใ‚„็ฅž็ตŒ็ง‘ๅญฆใจใฎๆŽฅ็ถšใฏใ‚ใพใ‚Š่žใ‹ใš๏ผŒใŸใ HPCใฎไบบใ‚‚ไป–ๅˆ†้‡Žใซ่ˆˆๅ‘ณใŒใ‚ใ‚‹ใฎใง๏ผŒ้–ข้€ฃใ—ใใ†ใชๆ–‡็Œฎใซๅผ•็”จใฏ้ฃ›ใฐใ—ใคใคใ‚‚๏ผŒใ‚ใจไธ€ๆญฉใง่กŒใ่ฉฐใพใฃใฆใ„ใ‚‹ๅฐ่ฑก๏ผŸ

08.10.2025 07:29 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
KEVIN CHEN โ€“ first round (19th Chopin Competition, Warsaw)
YouTube video by Chopin Institute KEVIN CHEN โ€“ first round (19th Chopin Competition, Warsaw)

And he is the genious.
www.youtube.com/watch?v=iZAp...

06.10.2025 19:38 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
SHIORI KUWAHARA โ€“ first round (19th Chopin Competition, Warsaw)
YouTube video by Chopin Institute SHIORI KUWAHARA โ€“ first round (19th Chopin Competition, Warsaw)

Everyone is giving a wonderful performance, but among the Japanese pianists, I'm particularly fond of Ushida-kun and Kuwahara-san.

www.youtube.com/watch?v=SPS4...
www.youtube.com/watch?v=DaY6...

06.10.2025 19:31 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
ERIC LU โ€“ first round (19th Chopin Competition, Warsaw)
YouTube video by Chopin Institute ERIC LU โ€“ first round (19th Chopin Competition, Warsaw)

Iโ€™m watching Chopin Competition.
Iโ€™m so surprised that he is using an office chair while Iโ€™m deeply impressed by his performance.

m.youtube.com/watch?v=fDsg...

06.10.2025 19:27 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

ไบŒใคใฎๅ…จใ้•ใ†ใจใ“ใ‚ใงใ‚„ใฃใฆใ„ใŸ้•ใ†ใƒ†ใƒผใƒžไธกๆ–นใจใ‚ชใƒผใƒใƒผใƒฉใƒƒใƒ—ใŒๅคงใใ„็ ”็ฉถใŒๅ‡บใฆใใ‚‹ใจ๏ผŒ็ฒพ็ฅž็š„ใƒ€ใƒกใƒผใ‚ธใŒๅคงใใ„ใฎใงใ™ใŒ๏ผŒใƒ†ใƒผใƒž่จญๅฎšใŒๅฎ‰ๆ˜“ใ™ใŽใŸใฎใ‹ใ‚‚ใ—ใ‚Œใชใ„

04.10.2025 01:21 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

ใ„ใ„ใƒ†ใƒผใƒžใŒใชใ„ใ‹ๆ”พๆตชใ™ใ‚‹ๆ—…ใ‚’ใ—ใพใ™๐Ÿงณ

02.10.2025 18:40 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

ใ“ใ“2~3ใƒถๆœˆ่€ƒใˆใฆใ„ใŸ๏ผ’ใคใฎใƒ†ใƒผใƒžใซๆฅตใ‚ใฆ่ฟ‘ใ„่ซ–ๆ–‡ใŒ๏ผŒๆ˜จๆ—ฅๅŒๆ™‚ใซ๏ผ’ๆœฌๅ‡บใฆใ—ใพใ„๏ผŒ็ ”็ฉถใƒ†ใƒผใƒžใŒๆถˆๆป…ใ—ใฆใ—ใพใฃใŸ๏ผŽ๏ผŽ๏ผŽๆ—ฉใ‚ใซๅˆ‡ใ‚Šๆ›ฟใˆใฆ๏ผŒใ„ใ„็ ”็ฉถใƒ†ใƒผใƒžใ‚’ใ‚ผใƒญใ‹ใ‚‰่€ƒใˆ็›ดใ•ใชใ„ใจใ„ใ‘ใชใ„ใงใ™ใญ๏ผŽ๏ผŽ๏ผŽ

02.10.2025 18:06 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

PINN๏ผŒๅ…จ็„ถ็Ÿฅใ‚‰ใชใ‹ใฃใŸใฎใงใ™ใŒ๏ผŒๆœ€้ฉๅŒ–&Scalingใฎๅฏพ่ฑกใจใ—ใฆ้ข็™ฝใใ†๏ผŽ

28.09.2025 05:59 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Video thumbnail

Not all scaling laws are nice power laws. This monthโ€™s blog post: Zipfโ€™s law in next-token prediction and why Adam (ok, sign descent) scales better to large vocab sizes than gradient descent: francisbach.com/scaling-laws...

27.09.2025 14:57 โ€” ๐Ÿ‘ 47    ๐Ÿ” 12    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

I've made some small updates to the 'awesome list' for second-order optimization I made two years ago. It looks like Muon related works and the applications to PINNs have really taken off in the last couple of years.
github.com/riverstone49...

26.09.2025 12:18 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Fluid dynamics might serve as an interesting new benchmark for second-order optimization

23.09.2025 02:14 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Discovery of Unstable Singularities Whether singularities can form in fluids remains a foundational unanswered question in mathematics. This phenomenon occurs when solutions to governing equations, such as the 3D Euler equations, develo...

I donโ€™t know anything about fluid dynamics, but I came across a paper that seemed to say that second-order optimization is key when using the power of neural networks to solve the Navierโ€“Stokes equations. If so, thereโ€™s something romantic about that.
arxiv.org/abs/2509.14185

23.09.2025 02:13 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1

This is not OK.

I don't submit often to NeurIPS, but I reviewed papers for this conference almost every year. As a reviewer, why would I spend time trying to give a fair opinion on papers if it's what happens in the end???

20.09.2025 06:10 โ€” ๐Ÿ‘ 51    ๐Ÿ” 11    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 1

ใ‚ใ‚ŠใŒใจใ†ใ”ใ–ใ„ใพใ™๏ผ

18.09.2025 10:50 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

ใ‚ใ‚ŠใŒใจใ†ใ”ใ–ใ„ใพใ™๏ผใพใŸ่ญฐ่ซ–ใงใใŸใ‚‰ๅนธใ„ใงใ™๐Ÿ™

18.09.2025 10:50 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

ใ‚ญใƒผใƒฏใƒผใƒ‰ใจใ—ใฆใฏ๏ผŒใ€Œ่‡ช็„ถๅ‹พ้…ๆณ•๏ผŒMuon๏ผŒๅˆ†ๆ•ฃๅญฆ็ฟ’/็ถ™็ถšๅญฆ็ฟ’๏ผŒGPUใ€ๅ‘จใ‚Šใงๅ‡บใ—ใพใ—ใŸ๏ผŽใพใšใฏ๏ผŒๆœ€้ฉๅŒ–ใซ้–ขใ—ใฆ๏ผŒๆœ€่ฟ‘่ชญใ‚“ใงใ„ใ‚‹ใ“ใฎไธ‰ๅ†Šใฎๆ›ธ็ฑใ‚’็ต„ใฟๅˆใ‚ใ›ใŸใ‚ˆใ†ใช่ซ–ๆ–‡ใ‚’ไธ€ๆœฌ๏ผŒๅ‡บใ›ใ‚‹ใ‚ˆใ†้ ‘ๅผตใ‚Šใพใ™๏ผŽ

18.09.2025 05:50 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
2025ๅนดๅบฆใ€€ๆˆฆ็•ฅ็š„ๅ‰ต้€ ็ ”็ฉถๆŽจ้€ฒไบ‹ๆฅญ๏ผˆACT-X๏ผ‰ใฎๆ–ฐ่ฆ็ ”็ฉถ่ชฒ้กŒๅŠใณ่ฉ•ไพก่€…ใซใคใ„ใฆ | ACT-X

ACT-XใซๆŽกๆŠžใ•ใ‚Œใพใ—ใŸ๏ผŽๅผ•ใ็ถšใ๏ผŒใƒ‹ใƒฅใƒผใƒฉใƒซใƒใƒƒใƒˆใƒฏใƒผใ‚ฏใฎๆœ€้ฉๅŒ–ใซใคใ„ใฆ๏ผŒๆทฑใ„็†่งฃใ‚’ๅพ—ใ‚‰ใ‚Œใ‚‹ใ‚ˆใ†็ ”็ฉถใ—ใฆใ„ใใพใ™๐Ÿ˜
www.jst.go.jp/kisoken/act-...

18.09.2025 05:43 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
Benchmarking Optimizers for Large Language Model Pretraining The recent development of Large Language Models (LLMs) has been accompanied by an effervescence of novel ideas and methods to better optimize the loss of deep learning models. Claims from those method...

When a paper has more than 40 figures, I can really feel the authorโ€™s dedication just by looking at it - itโ€™s energizing
arxiv.org/abs/2509.01440

03.09.2025 17:05 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

ใ€Ž็ฅž็ตŒๅ›ž่ทฏ็ถฒใฎๆ•ฐ็†ใ€ใŒ่‹ฑ่ชžใซ็ฟป่จณใ•ใ‚Œใฆใ„ใชใ„ใ“ใจใ‚’่€ƒใˆใ‚‹ใจ๏ผŒไธ€็ซ ใšใค็พๅœจใฎๆทฑๅฑคๅญฆ็ฟ’ใซๅฏพๅฟœใฅใ‘ใฆใ„ใ๏ผŒใ€Ž็พไปฃ็‰ˆใ€€็ฅž็ตŒๅ›ž่ทฏ็ถฒใฎๆ•ฐ็†ใ€ใ‚’ๅŸท็ญ†ใ—ใฆ่‹ฑ่ชžๅŒ–ใ™ใ‚‹ใƒ—ใƒญใ‚ธใ‚งใ‚ฏใƒˆใจใ‹ใ‚ใฃใŸใ‚‰้ข็™ฝใใ†ใงใ™ใ—๏ผŒๅˆๅฟ—่ฒซๅพนใฎๆ„ๅ‘ณใงใฏ๏ผŒใใฎใ‚ตใƒใƒผใƒˆใซใชใ‚‹ใ‚ˆใ†ใชไฝ•ใ‹ใ‚’ใฉใ“ใ‹ใงใ‚„ใ‚ŠใŸใ„๏ผŽ

31.08.2025 05:56 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

็งใŒๆƒ…ๅ ฑ็ณปใฎ้€ฒ่ทฏใ‚’้ธใ‚“ใ ไธ€็•ชใฎใใฃใ‹ใ‘ใฏ๏ผŒใ€Œ็”˜ๅˆฉๅ…ˆ็”Ÿใฎใ€Ž็ฅž็ตŒๅ›ž่ทฏ็ถฒใฎๆ•ฐ็†ใ€ใ‚’็พไปฃ็š„ใชใƒขใƒ‡ใƒซใจๅคง่ฆๆจก่จˆ็ฎ—ๆฉŸใงๅฎŸ้จ“ๆคœ่จผใ‚’ใ—ใฆใฟใŸใ„ใ€ใชใฎใงใ™ใŒ๏ผŒใ€Œ็ฅž็ตŒๅ›ž่ทฏ็ถฒใฎๆ•ฐ็†ใ€ใ‚’ไน…ใ—ใถใ‚Šใซ้–‹ใ„ใฆ่ชญใ‚“ใงใฟใŸใ‚‰๏ผŒไปŠใกใ‚‡ใ†ใฉใ‚„ใฃใฆใ„ใŸใƒขใƒ‡ใƒซใฎ็ฐก็ด„ๅŒ–๏ผ†่งฃๆžใจๅฎŸ่ณชๅŒใ˜ใ“ใจใŒๆ›ธใ‹ใ‚Œใฆใ„ใŸใ“ใจใซๆฐ—ใŒใคใ้ฉšใ„ใฆใ„ใ‚‹๏ผŽ

31.08.2025 05:54 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

@satoki-ishikawa is following 20 prominent accounts