Maybe. But both should just converge to the true posterior with infinite pretraining scale.
24.02.2026 22:16 — 👍 1 🔁 0 💬 0 📌 0@dholzmueller.bsky.social
Postdoc in machine learning with Francis Bach & @GaelVaroquaux: neural networks, tabular data, uncertainty, active learning, atomistic ML, learning theory. https://dholzmueller.github.io
Maybe. But both should just converge to the true posterior with infinite pretraining scale.
24.02.2026 22:16 — 👍 1 🔁 0 💬 0 📌 0I tried 512 and 999 in small-scale experiments and there didn't seem to be a significant difference. I don't think a smaller choice would be better for small datasets, since we have a lot of small datasets in pretraining. But you never know...
24.02.2026 20:55 — 👍 1 🔁 0 💬 1 📌 0
To piggy-back a bit on foundation models for structured data discussion here
My colleagues at Yandex Research just updated the GraphPFN paper. It's a Graph Foundation Model that works on graph datasets with tabular features, and shows SOTA results both in ICL regimes and when fine-tuned.
Super hyped that it's finally out!
12.02.2026 14:05 — 👍 16 🔁 1 💬 2 📌 0
My 2025 highlights for AI research and code:
▪ Unpacking the AI scale narrative
▪ Tabular-learning research
- TabICL: table foundation model
- Retrieve merge predict: data lakes
▪ Better software
- Skrub: machine learning with tables
- Fundamentals in scikit-learn
gael-varoquaux.info/science/2025...
Sacha Braun, David Holzm\"uller, Michael I. Jordan, Francis Bach
Conditional Coverage Diagnostics for Conformal Prediction
https://arxiv.org/abs/2512.11779
Let's kick off 2026 with a workshop on Survival Analysis and Foundation Models, co-organized w. Julie Alberge Linus Bleistein Clément Berenfeld Agathe Guilloux and Julie Josse on January 27th at PariSanté Campus !
Registrations and submissions are open!!! www.linusbleistein.com/ramh
Still using temperature scaling?
With @dholzmueller.bsky.social, Michael I. Jordan and @bachfrancis.bsky.social we argue that with well designed regularization, more expressive models like matrix scaling can outperform simpler ones across calibration set sizes, data dimensions, and applications.
⚡ Release 0.6.2 is out ⚡
github.com/skrub-data/s...
Daniel Beaglehole, David Holzm\"uller, Adityanarayanan Radhakrishnan, Mikhail Belkin: xRFM: Accurate, scalable, and interpretable feature learning models for tabular data https://arxiv.org/abs/2508.10053 https://arxiv.org/pdf/2508.10053 https://arxiv.org/html/2508.10053
15.08.2025 06:32 — 👍 3 🔁 3 💬 0 📌 0Thanks!
30.07.2025 12:35 — 👍 0 🔁 0 💬 0 📌 0
Solution write-up with additional insights: kaggle.com/competitions...
🥈2nd place used stacking with diverse models
🥇1st place found a larger dataset
I got 3rd out of 691 in a tabular kaggle competition – with only neural networks! 🥉
My solution is short (48 LOC) and relatively general-purpose – I used skrub to preprocess string and date columns, and pytabkit to create an ensemble of RealMLP and TabM models. Link below👇
Excited to have co-contributed the SquashingScaler, which implements the robust numerical preprocessing from RealMLP!
24.07.2025 16:00 — 👍 8 🔁 4 💬 0 📌 0Is it because mathematicians think in terms of the number of assumptions that are satisfied, while physicists think in terms of the number of things that satisfy them?
23.07.2025 21:04 — 👍 1 🔁 0 💬 1 📌 0
👨🎓🧾✨#icml2025 Paper: TabICL, A Tabular Foundation Model for In-Context Learning on Large Data
With Jingang Qu, @dholzmueller.bsky.social, and Marine Le Morvan
TL;DR: a well-designed architecture and pretraining gives best tabular learner, and more scalable
On top, it's 100% open source
1/9
🚨What is SOTA on tabular data, really? We are excited to announce 𝗧𝗮𝗯𝗔𝗿𝗲𝗻𝗮, a living benchmark for machine learning on IID tabular data with:
📊 an online leaderboard (submit!)
📑 carefully curated datasets
📈 strong tree-based, deep learning, and foundation models
🧵
Missed the school? We have uploaded recordings of most talks to our YouTube Channel www.youtube.com/@AutoML_org 🙌
20.06.2025 15:35 — 👍 4 🔁 2 💬 1 📌 0📝 The skrub TextEncoder brings the power of HuggingFace language models to embed text features in tabular machine learning, for all those use cases that involve text-based columns.
28.05.2025 08:43 — 👍 6 🔁 4 💬 2 📌 0
Poster: iclr.cc/virtual/2025...
Hall 3 + Hall 2B #32, 10am Singapore time
Paper: arxiv.org/abs/2408.01536
🚨ICLR poster in 1.5 hours, presented by @danielmusekamp.bsky.social :
Can active learning help to generate better datasets for neural PDE solvers?
We introduce a new benchmark to find out!
Featuring 6 PDEs, 6 AL methods, 3 architectures and many ablations - transferability, speed, etc.!
The Skrub TableReport is a lightweight tool that allows to get a rich overview of a table quickly and easily.
✅ Filter columns
🔎 Look at each column's distribution
📊 Get a high level view of the distributions through stats and plots, including correlated columns
🌐 Export the report as html
#ICLR2025 Marine Le Morvan presents "Imputation for prediction: beware of diminishing returns": poster Thu 24th
arxiv.org/abs/2407.19804
Concludes 6 years of research on prediction with missing values: Imputation is useful but improvements are expensive, while better learners yield easier gains.
Details about the seminar talk titled TabICL: A Tabular Foundation Model for In-Context Learning on Large Data by Marine Le Morvan
Excited to share the new monthly Table Representation Learning (TRL) Seminar under the ELLIS Amsterdam TRL research theme! To recur every 2nd Friday.
Who: Marine Le Morvan, Inria (in-person)
When: Friday 11 April 4-5pm (+drinks)
Where: L3.36 Lab42 Science Park / Zoom
trl-lab.github.io/trl-seminar/
We are excited to announce #FMSD: "1st Workshop on Foundation Models for Structured Data" has been accepted to #ICML 2025!
Call for Papers: icml-structured-fm-workshop.github.io
Trying something new:
A 🧵 on a topic I find many students struggle with: "why do their 📊 look more professional than my 📊?"
It's *lots* of tiny decisions that aren't the defaults in many libraries, so let's break down 1 simple graph by @jburnmurdoch.bsky.social
🔗 www.ft.com/content/73a1...
🚀Continuing the spotlight series with the next @iclr-conf.bsky.social MLMP 2025 Oral presentation!
📝LOGLO-FNO: Efficient Learning of Local and Global Features in Fourier Neural Operators
📷 Join us on April 27 at #ICLR2025!
#AI #ML #ICLR #AI4Science
Links:
www.kaggle.com/competitions...
www.kaggle.com/competitions...
Link to the repo: github.com/dholzmueller...
PS: The newest pytabkit version now includes multiquantile regression for RealMLP and a few other improvements.
bsky.app/profile/dhol...
Practitioners are often sceptical of academic tabular benchmarks, so I am elated to see that our RealMLP model outperformed boosted trees in two 2nd place Kaggle solutions, for a $10,000 forecasting challenge and a research competition on survival analysis.
10.03.2025 15:53 — 👍 7 🔁 1 💬 1 📌 0The benchmark is limited to classification with AUC as a metric, which is one of RealMLP’s weaker points. Datasets are from the CC-18 benchmark, and the benchmark uses nested cross-validation unlike many other benchmarks. Link: arxiv.org/abs/2402.039...
04.03.2025 14:25 — 👍 1 🔁 0 💬 0 📌 0