's Avatar

@andrewlamb1111.bsky.social

592 Followers  |  21 Following  |  102 Posts  |  Joined: 17.11.2024  |  1.5009

Latest posts by andrewlamb1111.bsky.social on Bluesky

Preview
Boston Apache DataFusion Meetup ยท Luma Join us for an evening of talks, panel discussion, and community discussion about Apache DataFusion and its growing role in modern data infrastructure. Thisโ€ฆ

We are doing another DataFusion meetup in Boston Wednesday Nov 12, 2025: lu.ma/w9pw5rce

30.07.2025 21:41 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

In my opinion, the only actual criticism of Parquet that can not be solved with more software engineering (rather than changing the format) is adding new encodings.

Fastlanes, FSST and the BtrBlocks style cascaded encodings are great candidates. Now we need to get then adopted into Parquet

30.07.2025 12:19 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

"EDB claimed the new engine, which pushes queries to open source @apachedatafusion.bsky.social , returned queries 30x faster than standard Postgres while tiering offloads cold transactional data to storage is up 18x more cost-efficient."
www.theregister.com/2025/06/20/e...

30.07.2025 12:12 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

@apachedatafusion.bsky.social 48.0.0 release. Spark Compatible functions, ORDER BY ALL, FFI for aggregates and window functions: datafusion.apache.org/blog/2025/07...

16.07.2025 12:46 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

It is a common misconception that Apche Parquet files are restricted to basic statistics. Footer metadata and offset-based addressing permit user-defined index structures today.

@apachedatafusion.bsky.social blog from Qi Zhi, Jigao Luo and myself explains how datafusion.apache.org/blog/2025/07...

14.07.2025 13:30 โ€” ๐Ÿ‘ 25    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Belated DataFusion 47.0.0 release blog: datafusion.apache.org/blog/2025/07...

11.07.2025 11:10 โ€” ๐Ÿ‘ 6    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Apahce Iceberg compaction in Rust via github.com/nimtable/ice.... (based on @apachedatafusion.bsky.social ). Thanks to Rising Wave CEO YingjunWu for the shout out

10.07.2025 22:02 โ€” ๐Ÿ‘ 5    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Arrow Summit 2025: Call for Speakers

Apache Arrow Summit 25 is happening! Join us in person on October 2nd, in Paris (hosted by @pydataparis.bsky.social ). The Call For Proposal is open, submit your talks before July 26th:
sessionize.com/arrow-summit...

07.07.2025 07:37 โ€” ๐Ÿ‘ 9    ๐Ÿ” 8    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1
Post image Post image

Sweet VLDB paper from TUM (Mateusz Gienieczko / github.com/v0ldek) proposing extending Apache Parquet using user defined encodings (via WASM).
Favorite image shows the ease of integrating into @apachedatafusion.bsky.social
gienieczko.com/anyblox-paper

08.07.2025 00:24 โ€” ๐Ÿ‘ 5    ๐Ÿ” 2    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Preview
NYC Apache Icebergโ„ข Community Meetup ยท Luma ๐ŸงŠ Apache Iceberg Meetup is coming to the Big Apple! ๐Ÿ—ฝ Join us in NYC for an afternoon of ideas, innovation, and Iceberg. Whether you're building lakehousesโ€ฆ

I am speaking at the #ApacheIceberg NYC Meetup on July 10th about Variant in
#ApacheParquet
which enable more efficient of processing semi structured data such as that found in JSON.

lu.ma/95a5qys1

03.07.2025 09:43 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Quite a list of contributors already to the Rust Apache Parquet implementation of Variant (support for semi structured data). I was making some slides to explain what Variant is and made up a list I wanted to share. The feature will be amazing
github.com/apache/arrow...

01.07.2025 15:29 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

I publicly apologize for snapping today with @andypavlo.bsky.social @pateljm.bsky.social. "You need to have a push based scheduler to do ..."

TUM / DuckDB(CWI) created group-think in Databases where push schedulers are required, ClickHouse, Spark, DataFusion, etc not withstanding ๐Ÿคฆ

30.06.2025 22:04 โ€” ๐Ÿ‘ 0    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

New blog post about cooperative scheduling using tokio and Rust async, and how cancellation works in
@apachedatafusion.bsky.social
datafusion.apache.org/blog/2025/06...

30.06.2025 11:27 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
SUM(x + C) rewrite by Mytherin ยท Pull Request #15017 ยท duckdb/duckdb Suppose there is a query that contains the aggregate SUM(x + 1), this aggregate can be decomposed into SUM(x) + COUNT(x). In particular if there are multiple of such clauses, e.g. SUM(x + 1), SUM(x...

So many of those systems are not open source. How much "BenchMaxxxing" is going on beyond "prying" eyes?

We can at least **see** optimizations of dubious applicability like github.com/duckdb/duckd... in open source systems

(Thank you to @danielheres.bsky.social for introducing me to BenchMaxxxing)

25.06.2025 14:39 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 1    ๐Ÿ“Œ 0
Post image

Two new (reposted) blogs about Optimizing SQL and DataFrames: datafusion.apache.org/blog/2025/06...
datafusion.apache.org/blog/2025/06...

25.06.2025 12:29 โ€” ๐Ÿ‘ 3    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Add an example of embedding indexes inside a parquet file by zhuqi-lucas ยท Pull Request #16395 ยท apache/datafusion Which issue does this PR close? Closes Add an example of embedding indexes *inside* a parquet fileย #16374 Rationale for this change //! Example: embedding a "distinct values" index in ...

This is so cool -- an example of embedding a special index (a DistinctValues index no less) inside a Paruqet file: github.com/apache/dataf... (coming in DataFusion 49.0.0) @apachedatafusion.bsky.social

20.06.2025 14:19 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Cloudflare Log Explorer is now GA, providing native observability and forensics We are happy to announce the General Availability of Cloudflare Log Explorer, a powerful product designed to bring observability and forensics capabilities directly into your Cloudflare dashboard.

Here is another product built on @apachedatafusion.bsky.social :
blog.cloudflare.com/logexplorer-...

18.06.2025 17:36 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Intro to @apachedatafusion.bsky.social : Technology, Community and Not Quite Enough Time: www.youtube.com/watch?v=3per...

15.06.2025 09:31 โ€” ๐Ÿ‘ 6    ๐Ÿ” 4    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Accelerating Apache Parquet with metadata stores and specialized indexes using Apache DataFusion
YouTube video by Andrew Lamb Accelerating Apache Parquet with metadata stores and specialized indexes using Apache DataFusion

Hot off the presses: I recorded my talk from this week about Accelerating Query Performance of Apache Parquet using Specialized Indexes: youtu.be/74YsJT1-Rdk

11.06.2025 14:02 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
SF Apache DataFusion Meetup ยท Luma Join us for an evening of learning, networking, and diving into Apache DataFusion, the blazing-fast query execution framework for Rust-based dataโ€ฆ

Reminder: San Francisco @ApacheDataFusio meetup tomorrow: lu.ma/uuxd443e

09.06.2025 03:03 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

Generating TPCH SF 10 (almost SF-100) in Apache Parquet format on 4GB Raspberry pi
whateverforever.computer/blog/tpchgen...
Also found it could fry an ๐Ÿณ

27.05.2025 16:36 โ€” ๐Ÿ‘ 2    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0

Given the scripts are all open source, maybe someone could run that experiment

Another thing is that the IO characteristics for local disk vs object store are quite different, and DataFusion's parquet reader has several optimizations for object store

23.05.2025 10:29 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
SF Apache DataFusion Meetup ยท Luma Join us for an evening of learning, networking, and diving into Apache DataFusion, the blazing-fast query execution framework for Rust-based dataโ€ฆ

We are organizing another @apachedatafusion.bsky.social meetup in San Francisco that coincides with Data and AI. Come hear about Sail, Embucket and Apache Parquet filtering. Signup: lu.ma/uuxd443e

23.05.2025 10:28 โ€” ๐Ÿ‘ 4    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Dynamic pruning filters from TopK state (optimize `ORDER BY LIMIT` queries) ยท Issue #15037 ยท apache/datafusion Is your feature request related to a problem or challenge? From discussion with @alamb yesterday the idea came up of optimizing queries like select * from data order by timestamp desc limit 10 for ...

BTW the delta between DuckDB and DataFusion is almost entirely explained by filter pushdown and late materialization, which we are working on: github.com/apache/dataf...

22.05.2025 13:04 โ€” ๐Ÿ‘ 9    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Post image

ClickBench keeps me convinced that Parquet can be quite fast. There is only a 2.3x performance difference vs @duckdb 's own format and unoptimized parquet: tinyurl.com/5aexvsfw
I am surprised that the (closed source) Umbra only reports 3.3x faster than DuckDB on parquet

22.05.2025 13:03 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 2    ๐Ÿ“Œ 0
Post image

New DataFusion is looking pretty sweet performance wise. Thanks to github.com/MrPowers for running the numbers

More details github.com/apache/dataf...

16.05.2025 12:23 โ€” ๐Ÿ‘ 4    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
GitHub - datafusion-contrib/datafusion-tracing: Integration of opentelemetry with the tracing crate Integration of opentelemetry with the tracing crate - datafusion-contrib/datafusion-tracing

DataFusion -> Open Telemetry integration, courtesy of Geoffrey Claude at @datadoghq.com github.com/datafusion-c...

05.05.2025 19:51 โ€” ๐Ÿ‘ 5    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Developer Voices Deep-dive discussions with the smartest developers we know, explaining what they're working on, how they're trying to move the industry forward, and what we can learn from them.You might find the solu...

InfluxData Staff Engineer @andrewlamb1111.bsky.social breaks down how @apachedatafusion.bsky.social lets you build database prototypes fast and innovate faster.

Get a behind-the-scenes look at modern database stacks. ๐ŸŽง bit.ly/4jypZLB #InfluxDB

05.05.2025 17:37 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
SF Apache DataFusion Meetup ยท Luma Join us for an evening of learning, networking, and diving into Apache DataFusion, the blazing-fast query execution framework for Rust-based dataโ€ฆ

@apachedatafusion.bsky.social meetup June 9, in San Francisco.

f you are in town for Data and AI summit perhaps you can join us
lu.ma/uuxd443e

30.04.2025 11:38 โ€” ๐Ÿ‘ 1    ๐Ÿ” 0    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 0
Preview
Developer Voices Deep-dive discussions with the smartest developers we know, explaining what they're working on, how they're trying to move the industry forward, and what we can learn from them.You might find the solu...

Huge thanks to @krisajenkins.bsky.social for the wonderful discussion about @apachedatafusion.bsky.social on Developer Voices: pod.link/developer-vo.... Kris's smooth voice and wide ranging knowledge made it most pleasurable.

28.04.2025 11:31 โ€” ๐Ÿ‘ 3    ๐Ÿ” 1    ๐Ÿ’ฌ 0    ๐Ÿ“Œ 1

@andrewlamb1111 is following 20 prominent accounts