Skrub's Avatar

Skrub

@skrub-data.bsky.social

skrub is a Python library to ease preprocessing and feature engineering for tabular machine learning. Our long-term goal is to directly connect database tables to machine learning estimators. https://skrub-data.org https://discord.gg/ABaPnm7fDC

591 Followers  |  47 Following  |  90 Posts  |  Joined: 19.11.2024  |  2.2505

Latest posts by skrub-data.bsky.social on Bluesky

Preview
GitHub - skrub-data/skrub: Machine learning with dataframes Machine learning with dataframes. Contribute to skrub-data/skrub development by creating an account on GitHub.

And even more changes and improvements!

We hope you enjoy the new release, and if you do, don't forget to 🌟 the repo πŸ˜‰

github.com/skrub-data/s...

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
StringEncoder Gallery examples: Various string encoders: a sentiment analysis example Hands-On with Column Selection and Transformers Multiples tables: building machine learning pipelines with DataOps Hyperparam...

πŸ”’ We changed the default high cardinality encoder: now the StringEncoder is used as high cardinality encoder by default.

skrub-data.org/stable/refer...

24.07.2025 15:55 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
config Configuration:

βš™οΈ A global Skrub config has been introduced, which allows to set a number of parameters to customize the behavior of Skrub.

skrub-data.org/stable/refer...

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
DropUninformative

πŸ—‘οΈ DropUninformative is a transformer that uses various heuristics to remove columns that are unlikely to bring information for training a model.

skrub-data.org/dev/referenc...

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
SquashingScaler: Robust numerical preprocessing for neural networks The following example illustrates the use of the SquashingScaler, a transformer that can rescale and squash numerical features to a range that works well with neural networks and perhaps also other...

πŸ“ The SquashingScaler has been added: it robustly rescales and smoothly clips numerical columns, enabling more robust handling of numerical columns with neural networks.

skrub-data.org/dev/auto_exa...

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Hands-On with Column Selection and Transformers In previous examples, we saw how skrub provides powerful abstractions like TableVectorizer and tabular_learner() to create pipelines. In this new example, we show how to create more flexible pipeli...

πŸ› οΈ selectors, ApplyToCols and ApplyToFrame are now available, providing utilities for selecting columns to which a transformer should be applied in a flexible way.

skrub-data.org/dev/auto_exa...

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

πŸ“Š The TableReport has been improved with many new features: series are now supported directly, it is possible to skip generating plots when the number of columns in the dataframe exceeds a user-defined threshold. Columns with high cardinality and sorted columns are now highlighted.

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Skrub DataOps: fit, tune, and validate arbitrary data wrangling What are Skrub DataOps, and why do we need them?: Skrub provides an easy way to build complex, flexible machine learning pipelines. There are several needs that are not easily addressed with standa...

Get started with DataOps in the user guide:

skrub-data.org/dev/userguid...

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

Form complex DataOps plans to train and tune machine learning models, then export the plans as learners, standalone objects that can be used on new data.

Tune hyperparameters where they're defined, and explore the resulting space with a parallel coordinate plot

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

🌟 Major feature! Skrub DataOps are a powerful new way of combining dataframe transformations over multiple tables with machine learning pipelines.

24.07.2025 15:55 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Release history Release 0.6.0: Highlights: Major feature! Skrub DataOps are a powerful new way of combining dataframe transformations over multiple tables, and machine learning pipelines. DataOps can be combined t...

πŸ“– Keep reading for the highlights, or check out the full changelog here: skrub-data.org/stable/CHANG...

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
User Guide Skrub is a library that eases machine learning with dataframes for machine learning. Starting from rich, complex data stored in one or several dataframes, it helps performing the data wrangling nec...

πŸ“š On top of that, we revamped most of the user guide, documentation, and API reference to make it easier to learn how to use the features of Skrub.

skrub-data.org/dev/document...

24.07.2025 15:55 β€” πŸ‘ 0    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

⚑ Release 0.6.0 is now out! ⚑

πŸš€ Major update! Skrub DataOps, various improvements for the TableReport, new tools for applying transformers to the columns, and a new robust transformer for numerical features are only some of the features included in this release.

24.07.2025 15:55 β€” πŸ‘ 5    πŸ” 3    πŸ’¬ 1    πŸ“Œ 3
to_datetime

skrub-data.org/dev/referenc...

19.06.2025 12:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
DatetimeEncoder Gallery examples: Encoding: from a dataframe to a numerical matrix for machine learning Handling datetime features with the DatetimeEncoder

skrub-data.org/dev/referenc...

19.06.2025 12:45 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ”„ Finally, the DatetimeEncoder can also add periodic features: trigonometric (or circular) features, and b-spline features can be generated directly by setting the specific parameter. 4/4

Docs below!

19.06.2025 12:45 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

πŸ”’ For more extensive feature engineering, the DatetimeEncoder parses datetimes, then converts each datetime part into a numerical column (hours, minutes, seconds, days etc.). Additional features such as weekdays and time since epoch can also be added. 3/

19.06.2025 12:45 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

⏱️ If you need to convert columns from strings to datetimes, then to_datetime() does that for you. It's also available as a scikit-learn compatible transformer as ToDatetime(). Both objects parse most common formats automatically, but can accept a specific time format if needed. 2/

19.06.2025 12:45 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ“… The skrub API includes various functions and objects that help with dealing with datetime strings. 1/

19.06.2025 12:45 β€” πŸ‘ 3    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1
Release history Release 0.5.4: Maintenance: Make skrub compatible with scikit-learn 1.7.#1434 by Vincent Maladiere.. Release 0.5.3: Changes: The SimpleCleaner has been renamed to Cleaner. Use of the name SimpleCle...

πŸš€βš‘ Release: 0.5.4:
Maintenance release!
This release makes skrub compatible with scikit-learn 1.7.

Changelog:
skrub-data.org/stable/CHANG...

07.06.2025 16:06 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Tuning pipelines Our machine-learning pipeline typically contains some values or choices which may influence its prediction performance, such as hyperparameters (e.g. the regularization parameter alpha of a RidgeCl...

In the dev docs you will find examples on how to use expressions, like this:
skrub-data.org/dev/auto_exa...

04.06.2025 12:46 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

⚠️ As a disclaimer, expressions are still under development and things may change. However, if you're interested in learning more or testing them out, you can do so by checking the dev docs and examples, or by cloning the main branch of the skrub repo.

04.06.2025 12:46 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

Finally, results can be shown with a parallel coordinate plot to find out the impact of different hyperparameters on the prediction task.

04.06.2025 12:46 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Once you're happy with the parameter grid, it's possible to either cross-validate it with default values, or run a full randomized or grid search on the parameter grid.

04.06.2025 12:46 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Even better, choices can be nested: an estimator in a choose_from can be defined with its own set of choices, which are then expanded by the library.

04.06.2025 12:46 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

With the skrub expressions, it will be possible to build complex hyperparameter grids by composing "choose_" functions: choose from a list of values or estimators, generate a linear or logarithmic distribution of integers or floats, select boolean flags.

04.06.2025 12:46 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

In scikit-learn, parameter grids are often built by setting all required parameters in a dictionary, then passing the dictionary to GridSearchCV or RandomizedSearchCV. This process adds a lot of redundant code and may lead to missing some configurations.

04.06.2025 12:46 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Post image

πŸ‘€ This week's post will be another sneak peek into skrub expressions, an upcoming feature that will ease the preparation and execution of machine learning pipelines on dataframes.

This time we will focus on how expressions can simplify the construction of complex hyperparameter grids.

04.06.2025 12:46 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 1    πŸ“Œ 1
Preview
TextEncoder Gallery examples: Various string encoders: a sentiment analysis example

Docs: skrub-data.org/stable/refer...

28.05.2025 08:43 β€” πŸ‘ 2    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0

🐌 Just remember that language models are expensive to run. In the worst case, there's always the skrub StringEncoder πŸ˜‰

28.05.2025 08:43 β€” πŸ‘ 1    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

@skrub-data is following 20 prominent accounts