Yeah, shredding is a very clever optimization
27.02.2026 14:57 β π 1 π 0 π¬ 0 π 0Yeah, shredding is a very clever optimization
27.02.2026 14:57 β π 1 π 0 π¬ 0 π 0
Here is a new blog about Parquet Variant, including use case, and shredding examples
parquet.apache.org/blog/2026/02...
It came up on the Parquet sync today if anyone has practical experience with comparing FastLanes encoding vs "classic" bit packing (without the unified shuffled layouts). If you have would love to know your experience
25.02.2026 19:17 β π 2 π 0 π¬ 0 π 0I suggest getting comfortable with rm -rf every few days -- it works wonders for me :)
25.02.2026 19:17 β π 3 π 0 π¬ 0 π 0
Simply applying basic linting rules (like don't compress pages where it doesn't help) reduces parquet files sizes by 5% and decreases decode time by 20%.
@xiangpeng.systems shows how in his latest blog
blog.xiangpeng.systems/posts/parque...
Great inaugural post about the geospatial types on the Parquet blog.
Thank you Jia Yu, Dewey Dunnington , Kristin Cowalcijk, Feng Zhang.
More posts coming !
parquet.apache.org/blog/2026/02...
Just a few more weeks until the Stockholm DataFusion meetup: luma.com/ctqtiqap
14.02.2026 12:36 β π 5 π 0 π¬ 0 π 0π Apache Parquet recently added native support for Geospatial. This post explains what that means and why it is important: parquet.apache.org/blog/2026/02...
13.02.2026 13:56 β π 13 π 2 π¬ 0 π 0kellabyte.substack.com/p/building-a... -- Exactly the kind of thing that shows the power of DataFusion. You can build the database and not (re) build the core query engine
12.02.2026 11:13 β π 8 π 2 π¬ 0 π 2
You can use ApacheParquet for Vector Search with embedded indexes:
> We donβt change the file format; we just tune it.
Xiangpeng Hao explains how in blog.xiangpeng.systems/posts/vector...
Different techniques are needed to max out modern NVMe SSDs.
@westonpace.bsky.social LanceDB blog is so good if you want the industrial version: lancedb.com/blog/one-mil...
Viktor Leis's LeanStore paper is great if you want the academic version: vldb.org/pvldb/vol16/...
A somewhat academic talk about the AI usecases driving changes in Apache Parquet and new formats in "Column Storage for the AI Era"
Recording: youtu.be/k9uhw7yqPsQ
Slides: docs.google.com/presentation...
What I really need is to focus more on reviews / getting stuff merged as now the coding is even easier π
02.02.2026 14:12 β π 2 π 0 π¬ 0 π 0
Optimized implementation of SQL CASE expressions in column stores requires careful engineering. The latest Apache DataFusion blog from Pepijn Van Eeckhoudt and Raz Luvaton explains how it works
datafusion.apache.org/blog/2026/02...
One downside of tools like Codex is that it enables even more "side quests" -- I was already pretty bad at focusing, and now the ability to write the equivalent of a ticket and have some code to review in 10 minutes makes the problem far worse.
30.01.2026 11:29 β π 7 π 0 π¬ 2 π 0DataFusion 52 Release Blog is Published datafusion.apache.org/blog/2026/01...
28.01.2026 20:17 β π 5 π 0 π¬ 0 π 0
I love it when I see a whole pile of commits I didn't review go to DataFusion main
github.com/apache/dataf...
1/5 β‘οΈ Why Rust Will Help You Deliver Better Low-latency Systems and Happier Developers with Andrew Lamb
bit.ly/47kgmwU
@andrewlamb1111.bsky.social
#RustLang #LowLatency
I have been working on a talk about the future of table formats, specifically what is needed for AI workloads, and I found Weston's blogs on LanceDB well written and super helpful: lancedb.com/blog/designi...
26.01.2026 11:51 β π 7 π 1 π¬ 1 π 0
Weβre excited to share the complete list of speakers joining us at TokioConf 2026 covering performance tricks, architecture patterns, and more.
See all our speakers: www.tokioconf.com/speakers
(Schedule coming soon)
Tickets are on sale: www.eventbrite.com/e/tokioconf-...
Apache DataFusion meetup in San Francisco: luma.com/p7r6fp2z Thursday, February 19. We are looking for more speakers and attendees!
17.01.2026 11:17 β π 8 π 1 π¬ 0 π 0
DataFusion blog from Geoffrey Claude explains how to extend DataFusion to support:
-- Postgres style operators
SELECT payload->'user'->>'id'
FROM logs;
-- Statistical sampling
SELECT * FROM sensor_data
TABLESAMPLE BERNOULLI(10 PERCENT);
datafusion.apache.org/blog/2026/01...
Since you don't seem to want to cite your own work, I will do it for you: db.cs.cmu.edu/mmap-cidr2022/
12.01.2026 13:29 β π 4 π 0 π¬ 0 π 0Stoked to be attending North East Database Day on Friday. It is a great mini conference and highlights some of the great research work going on in this area nedbday.github.io/2026/
12.01.2026 13:28 β π 4 π 1 π¬ 0 π 0
Great paper about pruning from Snowflake: arxiv.org/pdf/2504.11540
The LIMIT pruning they describe is π€― (so clever once you get it)
We have implemented almost all of the techniques in Apache DataFusion, FWIW
Come meet fellow Apache DataFusion users and committers at the Stockholm meetup March 5, luma.com/ctqtiqap
07.01.2026 21:05 β π 4 π 0 π¬ 0 π 0Latest Apache DataFusion blog: more efficient plans and how to efficiently contribute: datafusion.apache.org/blog/output/...
20.12.2025 12:37 β π 10 π 1 π¬ 0 π 1
Qiwei Huang explains how we use Late Materialization (LM) in the Apache Rust Parquet reader to accelerate filtering. LM can describe several techniques, but this is a core one (also applies to joins, Top-K, etc)
arrow.apache.org/blog/2025/12...
Thanks to funnel.io, we are hosting a DataFusion meetup in Stockholm
Date: Thursday March 5, 2026: 17:30 - 20:00
Signup: luma.com/ctqtiqap
There is some crazy (good) activity on the Apache Parquet mailing list for new encodings. A sample: PFOR, FSST, ALP, Strings and Cascaded Encodings. π€― Huge kudos to Arnav Balyan, Prateek Gaur, and Micah Kornfield for driving this.
lists.apache.org/list.html?de...