Andrew Lamb's Avatar

Andrew Lamb

@andrewlamb1111.bsky.social

Apache {DataFusion PMC}, Database Internals

698 Followers  |  22 Following  |  174 Posts  |  Joined: 17.11.2024
Posts Following

Posts by Andrew Lamb (@andrewlamb1111.bsky.social)

Yeah, shredding is a very clever optimization

27.02.2026 14:57 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Here is a new blog about Parquet Variant, including use case, and shredding examples

parquet.apache.org/blog/2026/02...

27.02.2026 14:21 β€” πŸ‘ 7    πŸ” 2    πŸ’¬ 1    πŸ“Œ 1

It came up on the Parquet sync today if anyone has practical experience with comparing FastLanes encoding vs "classic" bit packing (without the unified shuffled layouts). If you have would love to know your experience

25.02.2026 19:17 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I suggest getting comfortable with rm -rf every few days -- it works wonders for me :)

25.02.2026 19:17 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
parquet-linter: A better Parquet is Parquet itself – Xiangpeng’s blog Unleash the performance potential of your Parquet files

Simply applying basic linting rules (like don't compress pages where it doesn't help) reduces parquet files sizes by 5% and decreases decode time by 20%.

@xiangpeng.systems shows how in his latest blog
blog.xiangpeng.systems/posts/parque...

23.02.2026 15:26 β€” πŸ‘ 18    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Native Geospatial Types in Apache Parquet Native Geospatial Types in Apache Parquet

Great inaugural post about the geospatial types on the Parquet blog.

Thank you Jia Yu, Dewey Dunnington , Kristin Cowalcijk, Feng Zhang.

More posts coming !

parquet.apache.org/blog/2026/02...

14.02.2026 00:36 β€” πŸ‘ 8    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Preview
Stockholm Apache DataFusion Meetup Β· Luma Join us for an evening of talks, panel discussions, and community discussions about Apache DataFusion and its growing role in modern data infrastructure. This…

Just a few more weeks until the Stockholm DataFusion meetup: luma.com/ctqtiqap

14.02.2026 12:36 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

πŸ“– Apache Parquet recently added native support for Geospatial. This post explains what that means and why it is important: parquet.apache.org/blog/2026/02...

13.02.2026 13:56 β€” πŸ‘ 13    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0
Preview
Building A Distributed SQL Database in 30 Days with AI My journey building HoloStore a distributed key/value store and HoloFusion a distributed SQL DB using AI using the Accord consensus protocol from Cassandra.

kellabyte.substack.com/p/building-a... -- Exactly the kind of thing that shows the power of DataFusion. You can build the database and not (re) build the core query engine

12.02.2026 11:13 β€” πŸ‘ 8    πŸ” 2    πŸ’¬ 0    πŸ“Œ 2
Post image Post image

You can use ApacheParquet for Vector Search with embedded indexes:

> We don’t change the file format; we just tune it.

Xiangpeng Hao explains how in blog.xiangpeng.systems/posts/vector...

10.02.2026 12:17 β€” πŸ‘ 6    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
The Quest for One Million IOPS: Benchmarking Storage at LanceDB Learn how LanceDB benchmarks storage and how we achieved one million disk reads per second.

Different techniques are needed to max out modern NVMe SSDs.

@westonpace.bsky.social LanceDB blog is so good if you want the industrial version: lancedb.com/blog/one-mil...

Viktor Leis's LeanStore paper is great if you want the academic version: vldb.org/pvldb/vol16/...

07.02.2026 11:59 β€” πŸ‘ 13    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image

A somewhat academic talk about the AI usecases driving changes in Apache Parquet and new formats in "Column Storage for the AI Era"

Recording: youtu.be/k9uhw7yqPsQ
Slides: docs.google.com/presentation...

03.02.2026 19:35 β€” πŸ‘ 8    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

What I really need is to focus more on reviews / getting stuff merged as now the coding is even easier πŸ˜…

02.02.2026 14:12 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Optimized implementation of SQL CASE expressions in column stores requires careful engineering. The latest Apache DataFusion blog from Pepijn Van Eeckhoudt and Raz Luvaton explains how it works

datafusion.apache.org/blog/2026/02...

02.02.2026 14:08 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

One downside of tools like Codex is that it enables even more "side quests" -- I was already pretty bad at focusing, and now the ability to write the equivalent of a ticket and have some code to review in 10 minutes makes the problem far worse.

30.01.2026 11:29 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0
Post image

DataFusion 52 Release Blog is Published datafusion.apache.org/blog/2026/01...

28.01.2026 20:17 β€” πŸ‘ 5    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

I love it when I see a whole pile of commits I didn't review go to DataFusion main
github.com/apache/dataf...

27.01.2026 21:55 β€” πŸ‘ 10    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Why Rust Will Help You Deliver Better Low-latency Systems and Happier Developers Andrew Lamb, a veteran of database engine development, shares his thoughts on why Rust is the right tool for developing low-latency systems, not only from the perspective of the code’s performance, bu...

1/5 ➑️ Why Rust Will Help You Deliver Better Low-latency Systems and Happier Developers with Andrew Lamb
bit.ly/47kgmwU

@andrewlamb1111.bsky.social

#RustLang #LowLatency

23.01.2026 11:37 β€” πŸ‘ 5    πŸ” 2    πŸ’¬ 1    πŸ“Œ 0
Preview
Designing a Table Format for ML Workloads Explore designing a table format for ML workloads with practical insights and expert guidance from the LanceDB team.

I have been working on a talk about the future of table formats, specifically what is needed for AI workloads, and I found Weston's blogs on LanceDB well written and super helpful: lancedb.com/blog/designi...

26.01.2026 11:51 β€” πŸ‘ 7    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0
Preview
Meet the Speakers β€” TokioConf 2026 Discover the speakers behind TokioConf 2026. From core maintainers to community leaders, our lineup shares real-world experience and insights.

We’re excited to share the complete list of speakers joining us at TokioConf 2026 covering performance tricks, architecture patterns, and more.

See all our speakers: www.tokioconf.com/speakers
(Schedule coming soon)

Tickets are on sale: www.eventbrite.com/e/tokioconf-...

09.01.2026 22:11 β€” πŸ‘ 12    πŸ” 4    πŸ’¬ 0    πŸ“Œ 1
Preview
San Francisco Apache DataFusion Meetup Β· Luma Join us for an evening of talks and community discussion about Apache DataFusion and its growing role in modern data infrastructure. This year’s meetup will…

Apache DataFusion meetup in San Francisco: luma.com/p7r6fp2z Thursday, February 19. We are looking for more speakers and attendees!

17.01.2026 11:17 β€” πŸ‘ 8    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image

DataFusion blog from Geoffrey Claude explains how to extend DataFusion to support:

-- Postgres style operators
SELECT payload->'user'->>'id'
FROM logs;

-- Statistical sampling
SELECT * FROM sensor_data
TABLESAMPLE BERNOULLI(10 PERCENT);

datafusion.apache.org/blog/2026/01...

14.01.2026 21:04 β€” πŸ‘ 10    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Are You Sure You Want to Use MMAP in Your Database Management System? MMAP Databases = πŸ’©

Since you don't seem to want to cite your own work, I will do it for you: db.cs.cmu.edu/mmap-cidr2022/

12.01.2026 13:29 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Stoked to be attending North East Database Day on Friday. It is a great mini conference and highlights some of the great research work going on in this area nedbday.github.io/2026/

12.01.2026 13:28 β€” πŸ‘ 4    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image

Great paper about pruning from Snowflake: arxiv.org/pdf/2504.11540

The LIMIT pruning they describe is 🀯 (so clever once you get it)

We have implemented almost all of the techniques in Apache DataFusion, FWIW

08.01.2026 16:17 β€” πŸ‘ 13    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Post image

Come meet fellow Apache DataFusion users and committers at the Stockholm meetup March 5, luma.com/ctqtiqap

07.01.2026 21:05 β€” πŸ‘ 4    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Latest Apache DataFusion blog: more efficient plans and how to efficiently contribute: datafusion.apache.org/blog/output/...

20.12.2025 12:37 β€” πŸ‘ 10    πŸ” 1    πŸ’¬ 0    πŸ“Œ 1
Post image

Qiwei Huang explains how we use Late Materialization (LM) in the Apache Rust Parquet reader to accelerate filtering. LM can describe several techniques, but this is a core one (also applies to joins, Top-K, etc)

arrow.apache.org/blog/2025/12...

12.12.2025 11:40 β€” πŸ‘ 10    πŸ” 1    πŸ’¬ 0    πŸ“Œ 0
Preview
Funnel | The leading marketing intelligence platform Use Funnel to aggregate data from all your marketing platforms. Access powerful reporting and data modeling, and seamlessly export to any destination.

Thanks to funnel.io, we are hosting a DataFusion meetup in Stockholm
Date: Thursday March 5, 2026: 17:30 - 20:00
Signup: luma.com/ctqtiqap

10.12.2025 14:05 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

There is some crazy (good) activity on the Apache Parquet mailing list for new encodings. A sample: PFOR, FSST, ALP, Strings and Cascaded Encodings. 🀯 Huge kudos to Arnav Balyan, Prateek Gaur, and Micah Kornfield for driving this.

lists.apache.org/list.html?de...

08.12.2025 14:37 β€” πŸ‘ 7    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0