Pierre Ablin's Avatar

Pierre Ablin

@pierreablin.bsky.social

Research scientist at Apple | machine learning, optimization, language modeling pierreablin.com

253 Followers  |  216 Following  |  4 Posts  |  Joined: 21.11.2024  |  1.4148

Latest posts by pierreablin.bsky.social on Bluesky

Paper๐Ÿงต (cross-posted at X): When does composition of diffusion models "work"? Intuitively, the reason dog+hat works and dog+horse doesnโ€™t has something to do with independence between the concepts being composed. The tricky part is to formalize exactly what this means. 1/

11.02.2025 05:59 โ€” ๐Ÿ‘ 39    ๐Ÿ” 15    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 2
Preview
The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedul...

Learning rate schedules seem mysterious? Why is the loss going down so fast during cooldown?
Turns out that this behaviour can be described with a bound from *convex, nonsmooth* optimization.

A short thread on our latest paper ๐Ÿšž

arxiv.org/abs/2501.18965

05.02.2025 10:13 โ€” ๐Ÿ‘ 31    ๐Ÿ” 6    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image

Excited to share Soup-of-Experts, a new neural network architecture that, for any given specific task, can instantiate in a flash a small model that is very good on it.

Made with โค๏ธ at Apple

Thanks to my co-authors David Grangier, Angelos Katharopoulos, and Skyler Seto!

arxiv.org/abs/2502.01804

05.02.2025 09:32 โ€” ๐Ÿ‘ 12    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Really proud of these two companion papers by our team at GDM:

1) Joint Learning of Energy-based Models and their Partition Function
arxiv.org/abs/2501.18528

2) Loss Functions and Operators Generated by f-Divergences
arxiv.org/abs/2501.18537

A thread.

31.01.2025 12:06 โ€” ๐Ÿ‘ 14    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Video thumbnail

How do tokens evolve as they are processed by a deep Transformer?

With Josรฉ A. Carrillo, @gabrielpeyre.bsky.social and @pierreablin.bsky.social, we tackle this in our new preprint: A Unified Perspective on the Dynamics of Deep Transformers arxiv.org/abs/2501.18322

ML and PDE lovers, check it out!

31.01.2025 16:56 โ€” ๐Ÿ‘ 95    ๐Ÿ” 16    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Video thumbnail

Byte Pair Encoding is a tokenization method that starts with all characters as initial tokens. It iteratively merges the most frequent adjacent byte pairs in the text, adding new tokens to the vocabulary until reaching a predefined size. The output is a sequence of tokens. https://buff.ly/42oG80f

30.01.2025 06:00 โ€” ๐Ÿ‘ 14    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 1
Post image

๐ŸŽ“ ๐Ÿ’ซ We are opening post-doc positions at the intersection of AI, data science, and medicine:
โ€ข Large Language Models for French medical texts
โ€ข Evaluating digital medical devices: statistics and causal inference

29.01.2025 08:19 โ€” ๐Ÿ‘ 27    ๐Ÿ” 16    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

Mixture of experts are all the rage when it comes to shipping low-latency LLMs.

Check out this awesome work by Samira et al. about scaling laws for mixture of experts !

28.01.2025 10:15 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

๐Ÿšจ One question that has always intrigued me is the role of different ways to increase a model's capacity: parameters, parallelizable compute, or sequential compute?

We explored this through the lens of MoEs:

28.01.2025 06:25 โ€” ๐Ÿ‘ 18    ๐Ÿ” 8    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 3
Post image

Thrilled to share the latest work from our team at
@Apple
where we achieve interpretable and fine-grained control of LLMs and Diffusion models via Activation Transport ๐Ÿ”ฅ

๐Ÿ“„ arxiv.org/abs/2410.23054
๐Ÿ› ๏ธ github.com/apple/ml-act

0/9 ๐Ÿงต

10.12.2024 13:09 โ€” ๐Ÿ‘ 47    ๐Ÿ” 15    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 5
Preview
Theory, Analysis, and Best Practices for Sigmoid Self-Attention Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as...

Excited to see Sigmoid Attention accepted at ICLR 2025 !!

Make attention ~18% faster with a drop-in replacement ๐Ÿš€

Code:
github.com/apple/ml-sig...

Paper
arxiv.org/abs/2409.04431

24.01.2025 18:46 โ€” ๐Ÿ‘ 28    ๐Ÿ” 5    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0

The Apple Machine Learning Research (MLR) team in Paris has openings for both FTE roles and a short-term post-doc position to contribute to our team's research agenda. Researchers at Apple's MLR (led by Samy Bengio) target impactful publications in top-tier ML venues and OSS.

18.12.2024 17:05 โ€” ๐Ÿ‘ 13    ๐Ÿ” 3    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 2

Congratulations for these new models !!

22.11.2024 10:33 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

๐——๐—ผ๐—ฒ๐˜€ ๐—ฎ๐˜‚๐˜๐—ผ๐—ฟ๐—ฒ๐—ด๐—ฟ๐—ฒ๐˜€๐˜€๐—ถ๐˜ƒ๐—ฒ ๐—ฝ๐—ฟ๐—ฒ-๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ถ๐—ป๐—ด ๐˜„๐—ผ๐—ฟ๐—ธ ๐—ณ๐—ผ๐—ฟ ๐˜ƒ๐—ถ๐˜€๐—ถ๐—ผ๐—ป? ๐Ÿค”
Delighted to share AIMv2, a family of strong, scalable, and open vision encoders that excel at multimodal understanding, recognition, and grounding ๐Ÿงต

paper: arxiv.org/abs/2411.14402
code: github.com/apple/ml-aim
HF: huggingface.co/collections/...

22.11.2024 08:32 โ€” ๐Ÿ‘ 59    ๐Ÿ” 19    ๐Ÿ’ฌ 3    ๐Ÿ“Œ 1
Why the MinHashEncoder is great for boosted trees
YouTube video by probabl Why the MinHashEncoder is great for boosted trees

Great video explaining a clever vectorization for learning on strings and dirty categories:

the MinHashEncoder is fast, stateless, and excellent with tree-based learners.
It's in @skrub-data.bsky.social
youtu.be/ZMQrNFef8fg

21.11.2024 10:12 โ€” ๐Ÿ‘ 75    ๐Ÿ” 8    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0

@pierreablin is following 20 prominent accounts