Gaël Varoquaux's Avatar

Gaël Varoquaux

@gaelvaroquaux.bsky.social

Research & code: Research director @inria ►Data, Health, & Computer science ►Python coder, (co)founder of scikit-learn, joblib, & @probabl.bsky.social ►Sometimes does art photography ►Physics PhD

13,551 Followers  |  206 Following  |  444 Posts  |  Joined: 26.08.2023  |  1.9804

Latest posts by gaelvaroquaux.bsky.social on Bluesky

Les messageries, ou les logiciels en général, n'ont du succès que si ils sont utilisés par beaucoup de monde.

Une mesure d'isolationisme numérique sera un échec long terme.
Pourquoi pas @signal.org avec un hébergement en France?

En attendant, mes visios avec des agences d'état sont sous teams...

01.08.2025 12:19 — 👍 15    🔁 1    💬 2    📌 0
Post image

EurIPS includes a call for both Workshops and Affinity Workshops!
We look forward to making #EurIPS a diverse and inclusive event with you.

The submission deadlines are August 22nd, AoE.

More information at:
eurips.cc/call-for-wor...
eurips.cc/call-for-aff...

28.07.2025 08:51 — 👍 34    🔁 18    💬 0    📌 2
Post image

📢 Talk Announcement

"PyPI in the face: running jokes that PyPI download stats can play on you", by Loïc Estève.

📜 Talk info: pretalx.com/pydata-paris-2025/talk/DSHHZK
📅 Schedule: pydata.org/paris2025/schedule
🎟 Tickets: pydata.org/paris2025/tickets

28.07.2025 10:27 — 👍 2    🔁 1    💬 0    📌 0

Excited to have co-contributed the SquashingScaler, which implements the robust numerical preprocessing from RealMLP!

24.07.2025 16:00 — 👍 7    🔁 4    💬 0    📌 0

Huge release, and the first one where I felt like I actually contributed a lot to the final result.

I really think DataOps are a game changer, and I can't wait to see what people come up with with them.

I also ended up rewriting most of the user guide, hopefully improving it along on the way 😂

24.07.2025 16:05 — 👍 3    🔁 3    💬 0    📌 0

✨️💥skrub: machine learning with dataframes

New release 💫 0.6
A huge one, with the super powerful new "DataOps", and many improvements all over the library.
Exciting!!

24.07.2025 16:16 — 👍 15    🔁 3    💬 0    📌 0
Preview
GitHub - skrub-data/skrub: Machine learning with dataframes Machine learning with dataframes. Contribute to skrub-data/skrub development by creating an account on GitHub.

And even more changes and improvements!

We hope you enjoy the new release, and if you do, don't forget to 🌟 the repo 😉

github.com/skrub-data/s...

24.07.2025 15:55 — 👍 0    🔁 1    💬 0    📌 0
Preview
StringEncoder Gallery examples: Various string encoders: a sentiment analysis example Hands-On with Column Selection and Transformers Multiples tables: building machine learning pipelines with DataOps Hyperparam...

🔢 We changed the default high cardinality encoder: now the StringEncoder is used as high cardinality encoder by default.

skrub-data.org/stable/refer...

24.07.2025 15:55 — 👍 1    🔁 1    💬 1    📌 0
config Configuration:

⚙️ A global Skrub config has been introduced, which allows to set a number of parameters to customize the behavior of Skrub.

skrub-data.org/stable/refer...

24.07.2025 15:55 — 👍 0    🔁 1    💬 1    📌 0
DropUninformative

🗑️ DropUninformative is a transformer that uses various heuristics to remove columns that are unlikely to bring information for training a model.

skrub-data.org/dev/referenc...

24.07.2025 15:55 — 👍 0    🔁 1    💬 1    📌 0
SquashingScaler: Robust numerical preprocessing for neural networks The following example illustrates the use of the SquashingScaler, a transformer that can rescale and squash numerical features to a range that works well with neural networks and perhaps also other...

📏 The SquashingScaler has been added: it robustly rescales and smoothly clips numerical columns, enabling more robust handling of numerical columns with neural networks.

skrub-data.org/dev/auto_exa...

24.07.2025 15:55 — 👍 0    🔁 1    💬 1    📌 0
Hands-On with Column Selection and Transformers In previous examples, we saw how skrub provides powerful abstractions like TableVectorizer and tabular_learner() to create pipelines. In this new example, we show how to create more flexible pipeli...

🛠️ selectors, ApplyToCols and ApplyToFrame are now available, providing utilities for selecting columns to which a transformer should be applied in a flexible way.

skrub-data.org/dev/auto_exa...

24.07.2025 15:55 — 👍 0    🔁 1    💬 1    📌 0

📊 The TableReport has been improved with many new features: series are now supported directly, it is possible to skip generating plots when the number of columns in the dataframe exceeds a user-defined threshold. Columns with high cardinality and sorted columns are now highlighted.

24.07.2025 15:55 — 👍 0    🔁 1    💬 1    📌 0
Skrub DataOps: fit, tune, and validate arbitrary data wrangling What are Skrub DataOps, and why do we need them?: Skrub provides an easy way to build complex, flexible machine learning pipelines. There are several needs that are not easily addressed with standa...

Get started with DataOps in the user guide:

skrub-data.org/dev/userguid...

24.07.2025 15:55 — 👍 0    🔁 1    💬 1    📌 0
Post image

Form complex DataOps plans to train and tune machine learning models, then export the plans as learners, standalone objects that can be used on new data.

Tune hyperparameters where they're defined, and explore the resulting space with a parallel coordinate plot

24.07.2025 15:55 — 👍 0    🔁 1    💬 1    📌 0
Post image

🌟 Major feature! Skrub DataOps are a powerful new way of combining dataframe transformations over multiple tables with machine learning pipelines.

24.07.2025 15:55 — 👍 1    🔁 1    💬 1    📌 0
Release history Release 0.6.0: Highlights: Major feature! Skrub DataOps are a powerful new way of combining dataframe transformations over multiple tables, and machine learning pipelines. DataOps can be combined t...

📖 Keep reading for the highlights, or check out the full changelog here: skrub-data.org/stable/CHANG...

24.07.2025 15:55 — 👍 0    🔁 1    💬 1    📌 0
User Guide Skrub is a library that eases machine learning with dataframes for machine learning. Starting from rich, complex data stored in one or several dataframes, it helps performing the data wrangling nec...

📚 On top of that, we revamped most of the user guide, documentation, and API reference to make it easier to learn how to use the features of Skrub.

skrub-data.org/dev/document...

24.07.2025 15:55 — 👍 0    🔁 1    💬 1    📌 0
Post image

⚡ Release 0.6.0 is now out! ⚡

🚀 Major update! Skrub DataOps, various improvements for the TableReport, new tools for applying transformers to the columns, and a new robust transformer for numerical features are only some of the features included in this release.

24.07.2025 15:55 — 👍 5    🔁 3    💬 1    📌 3

Nope

The trick is to delete email.

Reading is optional, and quite unproductive.

21.07.2025 20:42 — 👍 3    🔁 0    💬 0    📌 0
Preview
Cette campagne a besoin de vous Sauvons le Palais de la découverte

Sauvons le Palais de la découverte!

chng.it/fjzpCWfw5t

12.07.2025 08:20 — 👍 36    🔁 24    💬 0    📌 1
Post image

📢 Present your NeurIPS paper in Europe!

Join EurIPS 2025 + ELLIS UnConference in Copenhagen for in-person talks, posters, workshops and more. Registration opens soon; save the date:

📅 Dec 2–7, 2025
📍 Copenhagen 🇩🇰
🔗eurips.cc

#EurIPS
@euripsconf.bsky.social‬

16.07.2025 23:00 — 👍 60    🔁 17    💬 2    📌 3

Oui, la vie n'est pas juste, les humains ne sont pas rationnels (bisous aux économistes néo-classiques), et la sociologie est une force puissante.
Mais il faut l'accepter et se battre sur ce terrain. On peut moduler les équilibres (bravo @valmasdel.bsky.social )

11.07.2025 10:30 — 👍 1    🔁 0    💬 0    📌 0
Preview
An open mindset The commitments required for fully open source machine learning

Fully open machine learning requires not only GPU access but a community commitment to openness. (Some nostalgic lessons from the ImageNet decade.)

10.07.2025 14:28 — 👍 27    🔁 4    💬 1    📌 1

Bluesky is cool

10.07.2025 05:19 — 👍 5    🔁 0    💬 0    📌 0
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

This work is presented at ICML next week.
• The paper arxiv.org/html/2502.05...
• The python package: pypistats.org/packages/tab... (try it out 🐍)
• The source code github.com/soda-inria/t... (100% open source, including pre-training 💞)

Longer read (5mn): gael-varoquaux.info/science/tabi...
8/9

09.07.2025 18:41 — 👍 12    🔁 2    💬 0    📌 0
Post image Post image Post image

We find that TabICL is state-of-the-art, performing as well as TabPFNv2 (unpublished when we submitted TabICL) or ever better in independent studies (figure from tabarena).
Also, we find that it gains an edge over TabPFNv2 for larger datasets.
7/9

09.07.2025 18:41 — 👍 2    🔁 0    💬 1    📌 0
Post image

With this architecture, both scalable and flexible, we can do intense pretraining, on rich simulated datasets, including large one, baking in subtle implicit biases.

For instance, in a comparison of classifiers, we see that TabICL has some of the axis-aligned aspect of trees
6/9

09.07.2025 18:41 — 👍 0    🔁 0    💬 1    📌 0
Post image

We also add positional encoding at the input using a "fingerprint" of the column distribution computed with a set-transformer.
Implicitly, this representation ends up capturing aspects of the distribution of input columns.
5/9

09.07.2025 18:41 — 👍 0    🔁 0    💬 1    📌 0
Post image

In TabICL, we do row-wise encoding, with a transformer across columns, before using a transformer across rows for in-context learning.
As a result, the cost is is O(n p² + n²), which is more scalable. Also, the architecture is more amenable to memory offloading and caching.
4/9

09.07.2025 18:41 — 👍 0    🔁 0    💬 1    📌 0

@gaelvaroquaux is following 20 prominent accounts