We are doing another DataFusion meetup in Boston Wednesday Nov 12, 2025: lu.ma/w9pw5rce
30.07.2025 21:41 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0@andrewlamb1111.bsky.social
We are doing another DataFusion meetup in Boston Wednesday Nov 12, 2025: lu.ma/w9pw5rce
30.07.2025 21:41 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0In my opinion, the only actual criticism of Parquet that can not be solved with more software engineering (rather than changing the format) is adding new encodings.
Fastlanes, FSST and the BtrBlocks style cascaded encodings are great candidates. Now we need to get then adopted into Parquet
"EDB claimed the new engine, which pushes queries to open source @apachedatafusion.bsky.social , returned queries 30x faster than standard Postgres while tiering offloads cold transactional data to storage is up 18x more cost-efficient."
www.theregister.com/2025/06/20/e...
@apachedatafusion.bsky.social 48.0.0 release. Spark Compatible functions, ORDER BY ALL, FFI for aggregates and window functions: datafusion.apache.org/blog/2025/07...
16.07.2025 12:46 โ ๐ 3 ๐ 0 ๐ฌ 0 ๐ 0It is a common misconception that Apche Parquet files are restricted to basic statistics. Footer metadata and offset-based addressing permit user-defined index structures today.
@apachedatafusion.bsky.social blog from Qi Zhi, Jigao Luo and myself explains how datafusion.apache.org/blog/2025/07...
Belated DataFusion 47.0.0 release blog: datafusion.apache.org/blog/2025/07...
11.07.2025 11:10 โ ๐ 6 ๐ 0 ๐ฌ 0 ๐ 0Apahce Iceberg compaction in Rust via github.com/nimtable/ice.... (based on @apachedatafusion.bsky.social ). Thanks to Rising Wave CEO YingjunWu for the shout out
10.07.2025 22:02 โ ๐ 5 ๐ 1 ๐ฌ 0 ๐ 0Apache Arrow Summit 25 is happening! Join us in person on October 2nd, in Paris (hosted by @pydataparis.bsky.social ). The Call For Proposal is open, submit your talks before July 26th:
sessionize.com/arrow-summit...
Sweet VLDB paper from TUM (Mateusz Gienieczko / github.com/v0ldek) proposing extending Apache Parquet using user defined encodings (via WASM).
Favorite image shows the ease of integrating into @apachedatafusion.bsky.social
gienieczko.com/anyblox-paper
I am speaking at the #ApacheIceberg NYC Meetup on July 10th about Variant in
#ApacheParquet
which enable more efficient of processing semi structured data such as that found in JSON.
lu.ma/95a5qys1
Quite a list of contributors already to the Rust Apache Parquet implementation of Variant (support for semi structured data). I was making some slides to explain what Variant is and made up a list I wanted to share. The feature will be amazing
github.com/apache/arrow...
I publicly apologize for snapping today with @andypavlo.bsky.social @pateljm.bsky.social. "You need to have a push based scheduler to do ..."
TUM / DuckDB(CWI) created group-think in Databases where push schedulers are required, ClickHouse, Spark, DataFusion, etc not withstanding ๐คฆ
New blog post about cooperative scheduling using tokio and Rust async, and how cancellation works in
@apachedatafusion.bsky.social
datafusion.apache.org/blog/2025/06...
So many of those systems are not open source. How much "BenchMaxxxing" is going on beyond "prying" eyes?
We can at least **see** optimizations of dubious applicability like github.com/duckdb/duckd... in open source systems
(Thank you to @danielheres.bsky.social for introducing me to BenchMaxxxing)
Two new (reposted) blogs about Optimizing SQL and DataFrames: datafusion.apache.org/blog/2025/06...
datafusion.apache.org/blog/2025/06...
This is so cool -- an example of embedding a special index (a DistinctValues index no less) inside a Paruqet file: github.com/apache/dataf... (coming in DataFusion 49.0.0) @apachedatafusion.bsky.social
20.06.2025 14:19 โ ๐ 4 ๐ 1 ๐ฌ 0 ๐ 0Here is another product built on @apachedatafusion.bsky.social :
blog.cloudflare.com/logexplorer-...
Intro to @apachedatafusion.bsky.social : Technology, Community and Not Quite Enough Time: www.youtube.com/watch?v=3per...
15.06.2025 09:31 โ ๐ 6 ๐ 4 ๐ฌ 0 ๐ 0Hot off the presses: I recorded my talk from this week about Accelerating Query Performance of Apache Parquet using Specialized Indexes: youtu.be/74YsJT1-Rdk
11.06.2025 14:02 โ ๐ 3 ๐ 1 ๐ฌ 0 ๐ 0Reminder: San Francisco @ApacheDataFusio meetup tomorrow: lu.ma/uuxd443e
09.06.2025 03:03 โ ๐ 3 ๐ 1 ๐ฌ 0 ๐ 0Generating TPCH SF 10 (almost SF-100) in Apache Parquet format on 4GB Raspberry pi
whateverforever.computer/blog/tpchgen...
Also found it could fry an ๐ณ
Given the scripts are all open source, maybe someone could run that experiment
Another thing is that the IO characteristics for local disk vs object store are quite different, and DataFusion's parquet reader has several optimizations for object store
We are organizing another @apachedatafusion.bsky.social meetup in San Francisco that coincides with Data and AI. Come hear about Sail, Embucket and Apache Parquet filtering. Signup: lu.ma/uuxd443e
23.05.2025 10:28 โ ๐ 4 ๐ 1 ๐ฌ 0 ๐ 0BTW the delta between DuckDB and DataFusion is almost entirely explained by filter pushdown and late materialization, which we are working on: github.com/apache/dataf...
22.05.2025 13:04 โ ๐ 9 ๐ 1 ๐ฌ 0 ๐ 0ClickBench keeps me convinced that Parquet can be quite fast. There is only a 2.3x performance difference vs @duckdb 's own format and unoptimized parquet: tinyurl.com/5aexvsfw
I am surprised that the (closed source) Umbra only reports 3.3x faster than DuckDB on parquet
New DataFusion is looking pretty sweet performance wise. Thanks to github.com/MrPowers for running the numbers
More details github.com/apache/dataf...
DataFusion -> Open Telemetry integration, courtesy of Geoffrey Claude at @datadoghq.com github.com/datafusion-c...
05.05.2025 19:51 โ ๐ 5 ๐ 0 ๐ฌ 0 ๐ 0InfluxData Staff Engineer @andrewlamb1111.bsky.social breaks down how @apachedatafusion.bsky.social lets you build database prototypes fast and innovate faster.
Get a behind-the-scenes look at modern database stacks. ๐ง bit.ly/4jypZLB #InfluxDB
@apachedatafusion.bsky.social meetup June 9, in San Francisco.
f you are in town for Data and AI summit perhaps you can join us
lu.ma/uuxd443e
Huge thanks to @krisajenkins.bsky.social for the wonderful discussion about @apachedatafusion.bsky.social on Developer Voices: pod.link/developer-vo.... Kris's smooth voice and wide ranging knowledge made it most pleasurable.
28.04.2025 11:31 โ ๐ 3 ๐ 1 ๐ฌ 0 ๐ 1