Weston Pace's Avatar

Weston Pace

@westonpace.bsky.social

Software developer working on all things arrow and columnar storage, currently, Lance.

120 Followers  |  363 Following  |  253 Posts  |  Joined: 18.11.2024  |  1.9754

Latest posts by westonpace.bsky.social on Bluesky

Ah, I ran into something very similar yesterday with an async "find or insert" cache. The first caller canceled the request while the insert future was in progress (dropped the future) and that cache key was forever blocked.

31.10.2025 19:27 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Post image

Conclusion of a little halloween tradition. If I'm going to traumatize the kids it might as well be interesting.

31.10.2025 15:39 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
Announcing Columnar Back to the future of data connectivity

The future of data connectivity is columnar. Today we launched
@columnar.tech to accelerate the shift from slow, row-oriented APIs like ODBC and JDBC to >10x faster alternatives powered by @arrow.apache.org. Learn more πŸ‘‡

29.10.2025 22:51 β€” πŸ‘ 29    πŸ” 7    πŸ’¬ 0    πŸ“Œ 4

Nice definition! This matches my use. I also usually have a touch of "please don't hate me I'm doing my best"

29.10.2025 20:05 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

A bittersweet story but glad to see a principled stance!

27.10.2025 19:11 β€” πŸ‘ 2    πŸ” 2    πŸ’¬ 0    πŸ“Œ 0

However - a word that exists because someone decided we aren't allowed to start a sentence with "but"

27.10.2025 13:18 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
The awkward monkey puppet meme with the text "Well..." from a maintainer of Lance, a lake house format that might just happen to be what the author is describing...

The awkward monkey puppet meme with the text "Well..." from a maintainer of Lance, a lake house format that might just happen to be what the author is describing...

17.10.2025 18:01 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0
Preview
a black and white photo of a woman wearing a turtleneck sweater and a dress . ALT: a black and white photo of a woman wearing a turtleneck sweater and a dress .

Your coworkers about to flood the channel because "I guess he doesn't want threads for this one"

16.10.2025 14:56 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Douglas squirrels are 1/3 the size of gray squirrels but six times more ferocious.

10.10.2025 15:52 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

I suspect this will change as caching layers become more mature. The selectivity threshold for cloud storage is something like "one in a million" but more like "one in a thousand" for NVMe.

Also, a self-promotional shout out that you might want to look at lance (lancedb.github.io/lance/format...)

08.10.2025 21:46 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

They do a bit of both. The base model is unsupervised and is generally described as "learning the language". The model is then fine tuned with supervision for a specific task.

The "suck up as much data as you can" is for the first part.

07.10.2025 23:37 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Yesterday, OP responded to my 11 year old comment on their 13 year old post with a pedantic correction.

07.10.2025 11:26 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Though I think the "we can't change Parquet" problem is a bit of a false problem. 90% of Parquet users are probably fine to just keep using Parquet. I'm not sure I agree that "the long time archival format" and the "database storage format" need to be the same thing.

03.10.2025 21:38 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

That might be next week's blog post ;). Short answer is I see it as a table format problem and not a file format problem. Change "decoder" to "file reader". Change "stored in the page" to "stored in a folder on the table" and change "wasm" to "pluggable" (native or wasm).

03.10.2025 21:38 β€” πŸ‘ 2    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Hope this helps, it's fun to see so much exciting innovation in a space that's been relatively quiet for many years!

03.10.2025 17:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

F3 is from a joint project between CMU and Tsinghua University. They have tackled the "forwards compatibility" problem by storing WASM decoders with the data so that old readers can read data written by futuristic writers.

03.10.2025 17:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 2    πŸ“Œ 0

FastLanes comes from CWI. They're the group that's designed some of the new lightweight compression algorithms (e.g. FSST). They definitely focus on compression and they likely have the best layout for processing data already in memory.

03.10.2025 17:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Vortex comes from SpiralDB. They've done a good job explaining what they do and writing about it. They've made a big focus on compression but, especially, on pushing down compute to run against compressed data.

03.10.2025 17:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Nimble comes from Meta, and there has sadly not been much written about it publicly. The best I can say at the moment is that Nimble has made perhaps the biggest emphasis of all the formats on extremely wide schemas (again, all formats have done some here).

03.10.2025 17:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

I work on Lance! So I'm most biased here. We focus on balancing random access and full scans. All formats have focused on better random access / large data, but none to the extent that we have, especially for tensors / embeddings.

03.10.2025 17:18 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Lot's of work being done on columnar file formats lately. I count 5 new formats so far (Lance, Nimble, Vortex, FastLanes, F3).

It's definitely something we follow at LanceDB and it can be confusing to track. So here is my very biased head-canon (trying to stay positive)

03.10.2025 17:18 β€” πŸ‘ 7    πŸ” 1    πŸ’¬ 1    πŸ“Œ 0

Newest house mate is an industrious spider that spends every day building a beautiful web right at eye level so I can blearily walk face first into it every morning.

03.10.2025 15:38 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Son got mad and told me he wouldn't take me to the creamery when I died. I have some questions.

14.08.2025 01:42 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Discussion points so far...

Should we slap `urn:` in the front so that users get a free parser?

Should the coordinates be repeated in the file itself?

13.08.2025 17:49 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

We're trying to figure out "Substrait coordinates" (e.g. organization, name, version tuples) for Substrait functions. Is anyone out there actually passionate about the topic or have any lessons or advice?

At the moment, leaning towards `organization:name:version` (and forbid colon in each field)

13.08.2025 17:48 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

Hmm, it shouldn't be _that_ slow. DuckDb is going to do one query to get column values (O(N), pretty fast) and another with a "case when" for each possible value (O(C*N)). I wonder if there's some optimization opportunity for hundreds of "case when" statements collapsing into a dict lookup.

12.08.2025 00:53 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

768 is very common, probably the most common I see from users. 128 is still around but rare. One user even has 1536.

09.08.2025 15:21 β€” πŸ‘ 3    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0
Preview
concurrent TLS connection segfault in x509 storage (regression on 3.0.17) Β· Issue #28171 Β· openssl/openssl During multiple connection the TLS serving side of openssl crashes: ref: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1110254 ref: https://jira.mariadb.org/browse/MDEV-37361 The debian bug has...

I'm not crazy! Just unlucky...

github.com/openssl/open...

06.08.2025 03:22 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Tike to double check my reservation

05.08.2025 23:51 β€” πŸ‘ 1    πŸ” 0    πŸ’¬ 0    πŸ“Œ 0

Sounds like you got yourself a new DIY project πŸ˜‰

04.08.2025 19:04 β€” πŸ‘ 0    πŸ” 0    πŸ’¬ 1    πŸ“Œ 0

@westonpace is following 19 prominent accounts