DataFrames at Scale Comparison: TPC-H
Hendrik Makait, Sarah Johnson, Matthew Rocklin 2024-05-14 14 min read We run benchmarks derived from the TPC-H benchmark suite on a variety of scales, hardware architectures, and dataframe projects...
Benchmarks Show This:
โ @DuckDB beats @Spark for small queries.
โ Even at 700GB, DuckDB (native files) is competitive.
โ Spark scales dynamically for 1TB+ workloads.
Details: https://buff.ly/47UvlMc
๐ The lesson? If data fits on one node, go single-node for speed. Scale to MPP only when needed.
10.12.2024 10:51 โ ๐ 3 ๐ 0 ๐ฌ 2 ๐ 0
MPP vs. Single-Node Engines
Small workloads? Use @DuckDb or @Polars for faster in-memory performance.
Massive datasets? MPP systems like @Spark or @Snowflake scale dynamically.
Experiment: @DuckDB outperformed Spark at <100GB.
๐ก Don't drive groceries shopping with a tank!
10.12.2024 10:51 โ ๐ 5 ๐ 0 ๐ฌ 1 ๐ 0
Why Are Object Stores So Attractive?
1๏ธโฃ Scalability: Handle massive amounts of data.
2๏ธโฃ Flexibility: Open formats like Iceberg for interoperability.
3๏ธโฃ Advanced Features: Replication, immutability, and consistency.
They became the backbone of modern distributed systems.
08.12.2024 10:51 โ ๐ 0 ๐ 0 ๐ฌ 0 ๐ 0
What Are "One-Way Door" Risks?
โ One-way doors = irreversible decisions.
In tech: adopting new tools or models without clear exit paths.
08.12.2024 10:51 โ ๐ 0 ๐ 0 ๐ฌ 1 ๐ 0
The Future of Distributed Systems
Object storage like S3 has become databases โ scalable & efficient for transactional & analytical workloads.
Emerging programming models:
1๏ธโฃ Distributed DBs: On files
2๏ธโฃ Serverless: Focus on code
3๏ธโฃ Wasm: Portable execution
Challenge: "one-way-doorโ innovation
08.12.2024 10:51 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
The Iceberg Effect
Modern data is evolving:
โ Iceberg now leads open table formats (Snowflake & Databricks adoption confirms it).
โ Cloud-native storage is a must (legacy systems wonโt keep up).
โ AI thrives on scalable, open architectures.
More innovation. Less vendor lock-in.
Ready to shift?
06.12.2024 15:14 โ ๐ 2 ๐ 0 ๐ฌ 0 ๐ 0
GitHub - resource-disaggregation/snowset: Snowflake dataset containing statistics for 70 million queries over 14 day period
Snowflake dataset containing statistics for 70 million queries over 14 day period - resource-disaggregation/snowset
Curious where the data comes from?
๐ Snowset (Snowflake's dataset): https://buff.ly/4eULXoQ
๐ Redset (Redshift's dataset): https://buff.ly/3CScB4x
Both share real-world query samples, packed with insights into how data warehouses are used. Check them out!
04.12.2024 10:51 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
The plot breaks down the cost of various query types in Redshift and Snowflake data warehouses:
Ingest: Involves bringing new data into the system and merging it with existing data.
Transformation: Converts raw data into simplified, easy-to-query views, making it more usable for business applications.
Read: Focuses on business intelligence dashboards and data science workloads, extracting insights from the data.
Export: Sends data out of the data warehouse for use in other systems.
Other: Consists mainly of system maintenance functions necessary to keep the data warehouse operational.
Of queries that scan at least 1 MB, the median query scans around 100 MB, while the 99.9th percentile reaches 300 GB. Despite being 'massively parallel processing' systems, databases like Snowflake and Redshift mainly handle queries that could easily fit on a single large node.
What Do Data Warehouses Really Do?
โ $300K/year on Snowflake, and 90% is spent on queries.
โ Most queries are tiny (median: 100MB, 99.9% <300GB).
โ Most workloads = ingestion + transformation (not analytics).
๐ก Small Data > Massive Complexity.
Are we overpaying for simplicity?
04.12.2024 10:51 โ ๐ 2 ๐ 0 ๐ฌ 1 ๐ 0
Think Small. Make Big Impact.
More Data โ Better Results.
โ Recent data is the most valuable.
โ Smaller AI models deliver bigger impact.
โ Local-first development works.
Stop relying on distributed complexity when single machines get the job done.
The #SmallData Movement is here. Are you in?
02.12.2024 10:51 โ ๐ 1 ๐ 0 ๐ฌ 0 ๐ 0
BigData isnโt the problemโit never was.
Most enterprises have <100GB in active data but overpay for tools designed for massive scale (#Snowflake, #Databricks, etc.).
Focus on #SmallData:
โ Easier to analyze
โ Cheaper to manage
โ Faster insights
Time to rethink your data strategy. #SmallData
30.11.2024 15:44 โ ๐ 5 ๐ 0 ๐ฌ 0 ๐ 0
Your source for developer best practices, updates, and resources.
Works with data, runs with swords.
A LLN - large language Nathan - (RL, RLHF, society, robotics), athlete, yogi, chef
Writes http://interconnects.ai
At Ai2 via HuggingFace, Berkeley, and normal places
Remaking consumer electronics to respect people and the planet. ๐ป๐ช
For support visit http://support.frame.work
Data Architect/Data Wrangler. Doer of ML/MT and AI thingies. Translator of Lexham Press' "Apostolic Fathers." Mastermind behind the "Lexham English Septuagint." Now working on Baylor Handbook for Shepherd of Hermas. He/Him.
web: https://rickbrannan.com
Founder and creative director of Bellingcat and director of Bellingcat Productions BV. Author of We Are Bellingcat.
Founder Resonate HQ | Distributed Async Await | Thinking in Distributed Systems | https://dtornow.substack.com
Demographics | Geospatial | Data Science | Open Source
The financial transactions database designed to power the next 30 years of online transaction processing.
Software engineer lost in Japan, after a few years in Ireland. Made in Brazil. Once upon a time I worked at AWS and Huawei. I like distributed systems and databases.
https://www.elias.sh
Code, craft, and caffeinated ideas from Bufferโs engineers.
you have the ability to understand things
principal engineer, react person, super gay, still masking
Currently a visiting researcher at Uni of Oxford. Normally at Uni of Bern.
Meta-scientist building tools to help other scientists. NLP, simulation, & LLMs.
Creator and developer of RegCheck (https://regcheck.app).
1/4 of @error.reviews.
๐ฎ๐ช
Anti-cynic. Towards a weirder future. Reinforcement Learning, Autonomous Vehicles, transportation systems, the works. Asst. Prof at NYU
https://emerge-lab.github.io
https://www.admonymous.co/eugenevinitsky
Passionate about how tech enables people. DDD/XP practitioner & aspiring Socio-technical Architect. Also co-organizer @dddbcn.org
โข blogger ๐, speaker ๐ข, and YouTuber ๐บ
โข ๐ Python ๐, Go ๐น, DevOps ๐ง
โข Patron Saint of Shitty Air Travel โ๏ธ๐ฉ
โข works for a smol web host & domain [โฆ]
๐ bridged from โ https://mastodon.social/@hynek, follow @ap.brid.gy to interact
Java Architect and CodeSmith (NL)
Writer | Programmer: Graphics, Games, Neuroscience, AI. Gamedev: Guitar Hero. Podcast Host: Software Engineering Radio. Writing at metastable.org and tobeva.com.
Interested in collaborative knowledge production, scaling credibility, healthcare AI governance, and liberal democracy as experimental epistemology. Anti-ideologue. Trained as a lawyer & cognitive psychologist; worked as an entrepreneur & informaticist.