Some SAC context for NeurIPS 2025 review scores - out of my batch of 100 papers:
- 1 paper ≥5.0
- 6 papers ≥4.5
- 11 papers ≥4.0
- 25 papers ≥3.75
- 42 papers ≥3.5
Good luck to all with rebuttals!
@oymak.bsky.social
EECS Prof @UMich, Research on the Foundations of ML+RL+LLM https://sota.engin.umich.edu/
Some SAC context for NeurIPS 2025 review scores - out of my batch of 100 papers:
- 1 paper ≥5.0
- 6 papers ≥4.5
- 11 papers ≥4.0
- 25 papers ≥3.75
- 42 papers ≥3.5
Good luck to all with rebuttals!
I will be at #NeurIPS between Dec 10-15. Looking forward to catching up with friends and colleagues!
09.12.2024 20:21 — 👍 5 🔁 0 💬 0 📌 0I was actually discussing SimPO a few weeks ago in my LLM class. Solid work!
03.12.2024 17:37 — 👍 2 🔁 0 💬 0 📌 0I like "slay" here. Makes theory research more RPG-like
28.11.2024 14:59 — 👍 3 🔁 0 💬 0 📌 0NeurIPS Test of Time Awards:
Generative Adversarial Nets
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio
Sequence to Sequence Learning with Neural Networks
Ilya Sutskever, Oriol Vinyals, Quoc V. Le
All credits go to my amazing students :)
21.11.2024 22:19 — 👍 1 🔁 0 💬 0 📌 0Our method uniformly improves language modeling evals with negligible compute overhead. During evals, we just plug in SSA and don't touch hyperparams/architecture so there is likely further headroom.
21.11.2024 22:19 — 👍 0 🔁 0 💬 2 📌 0We can also see the approximation benefit directly from the quality/sharpness of the attention maps.
21.11.2024 22:19 — 👍 0 🔁 0 💬 1 📌 0Why is this useful? Consider the tokens "Hinton" and "Scientist". These have high cosine similarity but we wish to assign them different spikiness levels. We show that this is provably difficult to achieve for vanilla attention, namely its weights have to grow much larger compared to our method.
21.11.2024 22:19 — 👍 0 🔁 0 💬 1 📌 0The method adds a temperature-scaling (scalar gating) after K/Q/V embeddings and before softmax. Temperature is a function of the token embedding and its position. Notably, this can be done by - fine-tuning rather than pretraining - using very few additional parameters
21.11.2024 22:19 — 👍 0 🔁 0 💬 1 📌 0The intuition is that specific tokens like "Hinton" should receive a spikier attention map compared to generalist tokens like "Scientist". Learning token-dependent temperatures with this results in the colormap above where (arguably) more specific words receive low temperatures.
21.11.2024 22:19 — 👍 1 🔁 0 💬 1 📌 0Hello world! Unfortunately, my first post happens to be a paper (thre)ad 😊: Our “Selective Attention” is a simple but effective method that dynamically adjusts the sparsity of the attention maps through temperature scaling: arxiv.org/pdf/2411.12892 (#neurips2024)
21.11.2024 22:19 — 👍 7 🔁 0 💬 1 📌 0