Bruno Mlodozeniec brunokm - Bluesky Statics

With other folks at 🍏, @brunokm.bsky.social has worked on a complete(d) parameterisation for NNs that can *transfer* locally tuned hyperparameters: tune optimizers' parameters (e.g. LR) *per module/depth* using an evolutionary search on small models → they transfer perf. gains to much larger models

06.01.2026 16:33 — 👍 5 🔁 1 💬 0 📌 0

Huge thank you to my co-authors at Apple MLR! @marcocuturi.bsky.social @pierreablin.bsky.social @louisbethune.bsky.social, @danbusbridge, Michal Klein & Jason Ramapuram

06.01.2026 15:25 — 👍 1 🔁 0 💬 0 📌 0

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration Hyperparameter tuning can dramatically impact training stability and final performance of large-scale models. Recent works on neural network parameterisations, such as $μ$P, have enabled transfer of o...

As with anything parameterisation-related, the devil is in the details. The paper describes a multitude of such details that we've tried to get right.

For more, you can check out the arxiv link:

www.arxiv.org/abs/2512.22382

06.01.2026 15:21 — 👍 0 🔁 0 💬 1 📌 0

Most importantly: with the right parameterisation the same per-module HPs can be re-used at 7B/140B token scale (10000× more FLOPs), giving a 1.32× speed-up over well-tuned global HPs

06.01.2026 15:21 — 👍 0 🔁 0 💬 1 📌 0

Finding good per-module HPs is not easy. The per-module HP loss landscape (how the final loss values change as we vary these HPs) is highly irregular and non-smooth, and make some recommendations for what kind of HP-search algorithms appear to work well in this setting.

06.01.2026 15:21 — 👍 0 🔁 0 💬 1 📌 0

With a right parameterisation, we can answer: do per-module HPs matter at scale?

We thoroughly searched AdamW per-module HPs (LR, WD, betas, eps, init. scale) on a 50M transformer model.

At this scale, you can get huge speed-ups over a well-tuned global HP baseline – over 2× in wall-clock time.

06.01.2026 15:21 — 👍 0 🔁 0 💬 1 📌 0

We extend Depth-µP/CompleteP in several ways:
‣ Adapting to QK norms (not-trivial because weight-sharing) & reparameterise for Cut Cross-entropy (+ fixes)
‣ New scaling rule in token horizon
‣ Scaling rules for weight-decay in batch-size (by extending Malladi et al. SDE to AdamW)

06.01.2026 15:21 — 👍 2 🔁 0 💬 1 📌 0

Extending μP and CompleteP, we’ve compiled a unified parameterisation that allows for zero-shot HP transfer across some of the most important scaling axes:

• Width
• Depth
• Batch Size
• Token Horizon

The table summarises the ‘all-in-one’ recipe:

06.01.2026 15:21 — 👍 0 🔁 0 💬 1 📌 0

We set out to find the benefit of per-module HPs. Since tuning 100s of HPs on a 7B model is infeasible, we wanted to opt for a “tune small, transfer to big” approach. For this to work, we needed a parameterisation that enables HP transfer in modern transformer training.

06.01.2026 15:21 — 👍 0 🔁 0 💬 1 📌 0

In our new work — Complete(d)P — we try to answer 3 questions about hyperparameter (HP) scaling:
● How to transfer across model size, tokens&batch-size?→ Complete(d)P
● Do per-module HPs matter? ✔️2x speed-ups possible
● Do they transfer to larger scale? ✔️ With the right parameterisation

06.01.2026 15:21 — 👍 8 🔁 4 💬 1 📌 1

It's that time of the year! 🎁

The Apple Machine Learning Research (MLR) team in Paris is hiring a few interns, to do cool research for ±6 months 🚀🚀 & work towards publications/OSS.

Check requirements and apply: ➡️ jobs.apple.com/en-us/detail...

More❓→ ✉️ mlr_paris_internships@group.apple.com

17.10.2025 13:07 — 👍 7 🔁 4 💬 0 📌 0

Influence Functions for Scalable Data Attribution in Diffusion Models Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data attribution and interpretability. In this paper, we aim to ...

If you want to learn more about how to apply influence functions to diffusion models, and the key take-aways for their use in this setting, check-out the paper! arxiv.org/abs/2410.13850

16.04.2025 12:44 — 👍 0 🔁 0 💬 0 📌 0

For example: for even moderately sized datasets, the trained diffusion models' marginal probability distribution stays the same irrespective of which examples were removed from the training data, potentially making the influence functions task vacuous.

16.04.2025 12:44 — 👍 0 🔁 0 💬 1 📌 0

We also point out several empirical challenges to the use of influence functions in diffusion models.

16.04.2025 12:44 — 👍 0 🔁 0 💬 1 📌 0

In our paper, we empirically show that the choice of GGN and K-FAC approximation is crucial for the performance of influence functions, and that following our recommended design principles leads to the better performing approximations.

16.04.2025 12:44 — 👍 0 🔁 0 💬 1 📌 0

Influence functions require the training loss Hessian matrix. Typically, a K-FAC approximation to a Generalised Gauss-Newton (GGN) matrix is used instead of the Hessian. However, it's not immediately obvious which GGN and K-FAC approximations to use in the diffusion

16.04.2025 12:44 — 👍 0 🔁 0 💬 1 📌 0

Influence functions are already being used in deep learning, from classification and regression through to autoregressive LLMs. What's the challenge in adapting them to the diffusion setting?

16.04.2025 12:44 — 👍 0 🔁 0 💬 1 📌 0

• Identifying and removing data responsible for undesirable behaviours (e.g. generating explicit content)
• Data valuation (how much did each training datapoint contribute towards generating the samples my users pay me for?)

16.04.2025 12:44 — 👍 0 🔁 0 💬 1 📌 0

Answering how a model's behaviour changes upon removing training datapoints could help with:
• Quantifying impact of copyrighted data on a given sample (how much less likely is it that the model would generate this image if not for the works of a given artist?)

16.04.2025 12:44 — 👍 0 🔁 0 💬 1 📌 0

Influence functions attempt to answer: how would the model's behaviour (e.g. probability of generating an image) change if the model was trained from scratch with some training datapoints removed.

They give an approximate answer, but without actually retraining the model.

16.04.2025 12:44 — 👍 0 🔁 0 💬 1 📌 0

How do you identify training data responsible for an image generated by your diffusion model? How could you quantify how much copyrighted works influenced the image?

In our ICLR oral paper we propose how to approach such questions scalably with influence functions.

16.04.2025 12:44 — 👍 2 🔁 0 💬 1 📌 0

It’s an awesome piece of work, done on a surprisingly small budget compared to the performance

21.03.2025 13:54 — 👍 1 🔁 0 💬 0 📌 0

AI-driven weather prediction breakthrough reported Researchers say Aardvark Weather uses thousands of times less computing power and is much faster than current systems

Rich Turner with other members of our group recently published a paper on Aardvark — end-to-end weather prediction with deep learning — in Nature, and it was just featured in The Guardian and Financial Times!

www.theguardian.com/technology/2...

21.03.2025 13:54 — 👍 6 🔁 1 💬 1 📌 0

Myself, James and and Shreyas will be at NeurIPS presenting this work. Come chat to us if you’re interested!

05.12.2024 18:23 — 👍 6 🔁 1 💬 0 📌 0

Denoising Diffusion Probabilistic Models in Six Simple Steps Denoising Diffusion Probabilistic Models (DDPMs) are a very popular class of deep generative model that have been successfully applied to a diverse range of problems including image and video generati...

Diffusion models are so ubiquitous, but it's difficult to find an introduction that is concise, simple and comprehensive.

My supervisor Rich Turner (with me & some other students) has written an introduction to diffusion models that fills this gap:

arxiv.org/abs/2402.04384

15.11.2024 01:40 — 👍 4 🔁 1 💬 0 📌 0

Posts by Bruno Mlodozeniec (@brunokm.bsky.social)