With other folks at π, @brunokm.bsky.social has worked on a complete(d) parameterisation for NNs that can *transfer* locally tuned hyperparameters: tune optimizers' parameters (e.g. LR) *per module/depth* using an evolutionary search on small models β they transfer perf. gains to much larger models
06.01.2026 16:33 β
π 5
π 1
π¬ 0
π 0
Huge thank you to my co-authors at Apple MLR! @marcocuturi.bsky.social @pierreablin.bsky.social @louisbethune.bsky.social, @danbusbridge, Michal Klein & Jason Ramapuram
06.01.2026 15:25 β
π 1
π 0
π¬ 0
π 0
Most importantly: with the right parameterisation the same per-module HPs can be re-used at 7B/140B token scale (10000Γ more FLOPs), giving a 1.32Γ speed-up over well-tuned global HPs
06.01.2026 15:21 β
π 0
π 0
π¬ 1
π 0
Finding good per-module HPs is not easy. The per-module HP loss landscape (how the final loss values change as we vary these HPs) is highly irregular and non-smooth, and make some recommendations for what kind of HP-search algorithms appear to work well in this setting.
06.01.2026 15:21 β
π 0
π 0
π¬ 1
π 0
With a right parameterisation, we can answer: do per-module HPs matter at scale?
We thoroughly searched AdamW per-module HPs (LR, WD, betas, eps, init. scale) on a 50M transformer model.
At this scale, you can get huge speed-ups over a well-tuned global HP baseline β over 2Γ in wall-clock time.
06.01.2026 15:21 β
π 0
π 0
π¬ 1
π 0
We extend Depth-Β΅P/CompleteP in several ways:
β£ Adapting to QK norms (not-trivial because weight-sharing) & reparameterise for Cut Cross-entropy (+ fixes)
β£ New scaling rule in token horizon
β£ Scaling rules for weight-decay in batch-size (by extending Malladi et al. SDE to AdamW)
06.01.2026 15:21 β
π 2
π 0
π¬ 1
π 0
Extending ΞΌP and CompleteP, weβve compiled a unified parameterisation that allows for zero-shot HP transfer across some of the most important scaling axes:
β’ Width
β’ Depth
β’ Batch Size
β’ Token Horizon
The table summarises the βall-in-oneβ recipe:
06.01.2026 15:21 β
π 0
π 0
π¬ 1
π 0
We set out to find the benefit of per-module HPs. Since tuning 100s of HPs on a 7B model is infeasible, we wanted to opt for a βtune small, transfer to bigβ approach. For this to work, we needed a parameterisation that enables HP transfer in modern transformer training.
06.01.2026 15:21 β
π 0
π 0
π¬ 1
π 0
In our new work β Complete(d)P β we try to answer 3 questions about hyperparameter (HP) scaling:
β How to transfer across model size, tokens&batch-size?β Complete(d)P
β Do per-module HPs matter? βοΈ2x speed-ups possible
β Do they transfer to larger scale? βοΈ With the right parameterisation
06.01.2026 15:21 β
π 8
π 4
π¬ 1
π 1
It's that time of the year! π
The Apple Machine Learning Research (MLR) team in Paris is hiring a few interns, to do cool research for Β±6 months ππ & work towards publications/OSS.
Check requirements and apply: β‘οΈ jobs.apple.com/en-us/detail...
Moreββ βοΈ mlr_paris_internships@group.apple.com
17.10.2025 13:07 β
π 7
π 4
π¬ 0
π 0
For example: for even moderately sized datasets, the trained diffusion models' marginal probability distribution stays the same irrespective of which examples were removed from the training data, potentially making the influence functions task vacuous.
16.04.2025 12:44 β
π 0
π 0
π¬ 1
π 0
We also point out several empirical challenges to the use of influence functions in diffusion models.
16.04.2025 12:44 β
π 0
π 0
π¬ 1
π 0
In our paper, we empirically show that the choice of GGN and K-FAC approximation is crucial for the performance of influence functions, and that following our recommended design principles leads to the better performing approximations.
16.04.2025 12:44 β
π 0
π 0
π¬ 1
π 0
Influence functions require the training loss Hessian matrix. Typically, a K-FAC approximation to a Generalised Gauss-Newton (GGN) matrix is used instead of the Hessian. However, it's not immediately obvious which GGN and K-FAC approximations to use in the diffusion
16.04.2025 12:44 β
π 0
π 0
π¬ 1
π 0
Influence functions are already being used in deep learning, from classification and regression through to autoregressive LLMs. What's the challenge in adapting them to the diffusion setting?
16.04.2025 12:44 β
π 0
π 0
π¬ 1
π 0
β’ Identifying and removing data responsible for undesirable behaviours (e.g. generating explicit content)
β’ Data valuation (how much did each training datapoint contribute towards generating the samples my users pay me for?)
16.04.2025 12:44 β
π 0
π 0
π¬ 1
π 0
Answering how a model's behaviour changes upon removing training datapoints could help with:
β’ Quantifying impact of copyrighted data on a given sample (how much less likely is it that the model would generate this image if not for the works of a given artist?)
16.04.2025 12:44 β
π 0
π 0
π¬ 1
π 0
Influence functions attempt to answer: how would the model's behaviour (e.g. probability of generating an image) change if the model was trained from scratch with some training datapoints removed.
They give an approximate answer, but without actually retraining the model.
16.04.2025 12:44 β
π 0
π 0
π¬ 1
π 0
How do you identify training data responsible for an image generated by your diffusion model? How could you quantify how much copyrighted works influenced the image?
In our ICLR oral paper we propose how to approach such questions scalably with influence functions.
16.04.2025 12:44 β
π 2
π 0
π¬ 1
π 0
Itβs an awesome piece of work, done on a surprisingly small budget compared to the performance
21.03.2025 13:54 β
π 1
π 0
π¬ 0
π 0
AI-driven weather prediction breakthrough reported
Researchers say Aardvark Weather uses thousands of times less computing power and is much faster than current systems
Rich Turner with other members of our group recently published a paper on Aardvark β end-to-end weather prediction with deep learning β in Nature, and it was just featured in The Guardian and Financial Times!
www.theguardian.com/technology/2...
21.03.2025 13:54 β
π 6
π 1
π¬ 1
π 0
Myself, James and and Shreyas will be at NeurIPS presenting this work. Come chat to us if youβre interested!
05.12.2024 18:23 β
π 6
π 1
π¬ 0
π 0
Denoising Diffusion Probabilistic Models in Six Simple Steps
Denoising Diffusion Probabilistic Models (DDPMs) are a very popular class of deep generative model that have been successfully applied to a diverse range of problems including image and video generati...
Diffusion models are so ubiquitous, but it's difficult to find an introduction that is concise, simple and comprehensive.
My supervisor Rich Turner (with me & some other students) has written an introduction to diffusion models that fills this gap:
arxiv.org/abs/2402.04384
15.11.2024 01:40 β
π 4
π 1
π¬ 0
π 0