Skrub's Avatar

Skrub

@skrub-data.bsky.social

skrub is a Python library to ease preprocessing and feature engineering for tabular machine learning. Our long-term goal is to directly connect database tables to machine learning estimators. https://skrub-data.org https://discord.gg/ABaPnm7fDC

611 Followers  |  48 Following  |  107 Posts  |  Joined: 19.11.2024  |  2.1858

Latest posts by skrub-data.bsky.social on Bluesky

Skrub learning materials – Skrub

Slides:
skrub-data.org/skrub-materi...

07.10.2025 14:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Thanks to @riccardocappuzzo.com , @glemaitre58.bsky.social and Jérôme Dockès for preparing the talk, and mentoring at the sprint!

07.10.2025 14:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

The sprint was also a big hit, with both new and old contributors working on issues and getting to know the repository.

And to cap it all off, thanks to P16 we have stickers now πŸš€

07.10.2025 14:36 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
The skrub sticker on the back of a laptop

The skrub sticker on the back of a laptop

@pydataparis.bsky.social 2025 is over, and it was a big success!

Our talk was very well received, and we got a lot of great questions, especially about scalability and how to interface with other libraries in production environments.

07.10.2025 14:36 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1
Post image

What a banger is skrub @skrub-data.bsky.social !

Big thumbs up for the sklearn team & the maintainer of this package

01.10.2025 08:23 β€” πŸ‘ 12    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0

πŸ“… Less than a week away! The talk will be on Oct 1st at 10.05AM in room Louis Armand 1 - Est.

If you want to contribute to skrub, we will also have a sprint on Thursday.

See you there!

26.09.2025 08:50 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 1

πŸ› οΈ Main bugfixes
- Fixed the display of DataOp objects in Google Colab cell outputs.
- Fixed the range from which choose_float and choose_int sample values when log=False and n_steps is None.
- The SkrubLearner used to do a prediction on the train set during fit(), this has been fixed.

26.09.2025 08:48 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

πŸ‘€ Changes and deprecations
- Ken embeddings are now deprecated.
- The accepted values for the parameter how of .skb.apply() have changed. The new values are "auto", "cols", "frame", and "no_wrap".
- The parameter splitter of .skb.train_test_split() has been renamed split_func.

26.09.2025 08:48 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

πŸš€ New features
- The DataOp.skb.full_report() now displays the time each node took to evaluate.
- The User guide has been reworked and expanded.

26.09.2025 08:48 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Preview
Release 0.6.2 · skrub-data/skrub New features The DataOp.skb.full_report() now displays the time each node took to evaluate. #1596 by Jérôme Dockès. The User guide has been reworked and expanded. Changes and deprecations Ken em...

⚑ Release 0.6.2 is out ⚑

github.com/skrub-data/s...

26.09.2025 08:48 β€” πŸ‘ 5    πŸ” 4    πŸ’¬ 1    πŸ“Œ 0

Reminder: skrub == cool

12.09.2025 13:34 β€” πŸ‘ 8    πŸ” 4    πŸ’¬ 0    πŸ“Œ 0
Hyperparameter tuning with DataOps A machine-learning pipeline typically contains some values or choices which may influence its prediction performance, such as hyperparameters (e.g. the regularization parameter alpha of a RidgeClas...

Here's another example on how to tune ML models with skrub Data Ops: skrub-data.org/stable/auto_...

12.09.2025 12:56 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Skrub DataOps applied to forecasting timeseries β€” Skrub DataOps applied to forecasting timeseries

The plot in the video was created for our EurosciPy 2025 tutorial on forecasting time series: skrub-data.org/EuroSciPy202...

12.09.2025 12:56 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

The plot is interactive: you can select a range of results, and it will highlight only the runs within that range, enabling you to refine your search further. It also tracks fit and score times, so you can identify which parameters most impact runtime.

12.09.2025 12:56 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Video thumbnail

skrub DataOps help you construct complex and extensive hyperparameter search spaces. However, interpreting results from large grids can be challenging.
To address this, skrub generates a parallel coordinate plot that visualizes all runs and the parameters used to achieve specific results.

12.09.2025 12:56 β€” πŸ‘ 6    πŸ” 0    πŸ’¬ 1    πŸ“Œ 1
SquashingScaler Gallery examples: SquashingScaler: Robust numerical preprocessing for neural networks

skrub-data.org/stable/refer...

05.09.2025 08:47 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Do you have to deal with numerical features that involve large outliers, and need to train linear models or neural networks?

Then you might want to try the skrub SquashingScaler. The SquashingScaler behaves like scikit-learn RobustScaler, but smoothly clips outliers to predefined boundaries.

05.09.2025 08:47 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image Post image

Our first talk tonight is from @gaelvaroquaux.bsky.social on @skrub-data.bsky.social.

Real tables are too messy for sklearn - skrub preprocesses them for you.

02.09.2025 18:28 β€” πŸ‘ 8    πŸ” 3    πŸ’¬ 1    πŸ“Œ 1

Had a great PyData London tonight! Was a real treat to hear from @gaelvaroquaux.bsky.social on @skrub-data.bsky.social and the real world data pains its solving. (Try it if you haven’t already; super easy to get going!)

03.09.2025 00:06 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Release Skrub release 0.6.1 Β· skrub-data/skrub Bugfixes get_feature_names_out now works correctly when used by GapEncoder, DropCols, SelectCols: from within a scikit-learn Pipeline. In addition, DropCols’s get_feature_names_out method now retu...

⚑Maintenance release ⚑

Release 0.6.1 fixes a bug that may happen when combining certain column-based skrub transformers with the scikit-learn ColumnTransformer.

github.com/skrub-data/s...

29.08.2025 15:59 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

We had a great tutorial at #EuroScipy2025!

We had the opportunity of showing off the features of skrub to a wide audience, and show how they can be used in a pretty complex use case.

29.08.2025 15:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
GitHub - skrub-data/skrub: Machine learning with dataframes Machine learning with dataframes. Contribute to skrub-data/skrub development by creating an account on GitHub.

And even more changes and improvements!

We hope you enjoy the new release, and if you do, don't forget to 🌟 the repo πŸ˜‰

github.com/skrub-data/s...

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
StringEncoder Gallery examples: Various string encoders: a sentiment analysis example Hands-On with Column Selection and Transformers Multiples tables: building machine learning pipelines with DataOps Hyperparam...

πŸ”’ We changed the default high cardinality encoder: now the StringEncoder is used as high cardinality encoder by default.

skrub-data.org/stable/refer...

24.07.2025 15:55 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
config Configuration:

βš™οΈ A global Skrub config has been introduced, which allows to set a number of parameters to customize the behavior of Skrub.

skrub-data.org/stable/refer...

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
DropUninformative

πŸ—‘οΈ DropUninformative is a transformer that uses various heuristics to remove columns that are unlikely to bring information for training a model.

skrub-data.org/dev/referenc...

24.07.2025 15:55 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
SquashingScaler: Robust numerical preprocessing for neural networks The following example illustrates the use of the SquashingScaler, a transformer that can rescale and squash numerical features to a range that works well with neural networks and perhaps also other...

πŸ“ The SquashingScaler has been added: it robustly rescales and smoothly clips numerical columns, enabling more robust handling of numerical columns with neural networks.

skrub-data.org/dev/auto_exa...

24.07.2025 15:55 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Hands-On with Column Selection and Transformers In previous examples, we saw how skrub provides powerful abstractions like TableVectorizer and tabular_learner() to create pipelines. In this new example, we show how to create more flexible pipeli...

πŸ› οΈ selectors, ApplyToCols and ApplyToFrame are now available, providing utilities for selecting columns to which a transformer should be applied in a flexible way.

skrub-data.org/dev/auto_exa...

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

πŸ“Š The TableReport has been improved with many new features: series are now supported directly, it is possible to skip generating plots when the number of columns in the dataframe exceeds a user-defined threshold. Columns with high cardinality and sorted columns are now highlighted.

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Skrub DataOps: fit, tune, and validate arbitrary data wrangling What are Skrub DataOps, and why do we need them?: Skrub provides an easy way to build complex, flexible machine learning pipelines. There are several needs that are not easily addressed with standa...

Get started with DataOps in the user guide:

skrub-data.org/dev/userguid...

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

Form complex DataOps plans to train and tune machine learning models, then export the plans as learners, standalone objects that can be used on new data.

Tune hyperparameters where they're defined, and explore the resulting space with a parallel coordinate plot

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

@skrub-data is following 20 prominent accounts