Joachim Rosskopf @jrosskopf

DataFrames at Scale Comparison: TPC-H Hendrik Makait, Sarah Johnson, Matthew Rocklin 2024-05-14 14 min read We run benchmarks derived from the TPC-H benchmark suite on a variety of scales, hardware architectures, and dataframe projects...

Benchmarks Show This:

→ @DuckDB beats @Spark for small queries.
→ Even at 700GB, DuckDB (native files) is competitive.
→ Spark scales dynamically for 1TB+ workloads.

Details: https://buff.ly/47UvlMc

🔍 The lesson? If data fits on one node, go single-node for speed. Scale to MPP only when needed.

10.12.2024 10:51 — 👍 3 🔁 0 💬 2 📌 0

MPP vs. Single-Node Engines

Small workloads? Use @DuckDb or @Polars for faster in-memory performance.
Massive datasets? MPP systems like @Spark or @Snowflake scale dynamically.

Experiment: @DuckDB outperformed Spark at <100GB.

💡 Don't drive groceries shopping with a tank!

10.12.2024 10:51 — 👍 5 🔁 0 💬 1 📌 0

Why Are Object Stores So Attractive?

1️⃣ Scalability: Handle massive amounts of data.
2️⃣ Flexibility: Open formats like Iceberg for interoperability.
3️⃣ Advanced Features: Replication, immutability, and consistency.

They became the backbone of modern distributed systems.

08.12.2024 10:51 — 👍 0 🔁 0 💬 0 📌 0

What Are "One-Way Door" Risks?

❌ One-way doors = irreversible decisions.
In tech: adopting new tools or models without clear exit paths.

08.12.2024 10:51 — 👍 0 🔁 0 💬 1 📌 0

The Future of Distributed Systems

Object storage like S3 has become databases — scalable & efficient for transactional & analytical workloads.

Emerging programming models:
1️⃣ Distributed DBs: On files
2️⃣ Serverless: Focus on code
3️⃣ Wasm: Portable execution

Challenge: "one-way-door” innovation

08.12.2024 10:51 — 👍 2 🔁 0 💬 1 📌 0

The Iceberg Effect

Modern data is evolving:
→ Iceberg now leads open table formats (Snowflake & Databricks adoption confirms it).
→ Cloud-native storage is a must (legacy systems won’t keep up).
→ AI thrives on scalable, open architectures.

More innovation. Less vendor lock-in.
Ready to shift?

06.12.2024 15:14 — 👍 2 🔁 0 💬 0 📌 0

GitHub - resource-disaggregation/snowset: Snowflake dataset containing statistics for 70 million queries over 14 day period Snowflake dataset containing statistics for 70 million queries over 14 day period - resource-disaggregation/snowset

Curious where the data comes from?
🔗 Snowset (Snowflake's dataset): https://buff.ly/4eULXoQ
🔗 Redset (Redshift's dataset): https://buff.ly/3CScB4x

Both share real-world query samples, packed with insights into how data warehouses are used. Check them out!

04.12.2024 10:51 — 👍 1 🔁 0 💬 0 📌 0

The plot breaks down the cost of various query types in Redshift and Snowflake data warehouses: Ingest: Involves bringing new data into the system and merging it with existing data. Transformation: Converts raw data into simplified, easy-to-query views, making it more usable for business applications. Read: Focuses on business intelligence dashboards and data science workloads, extracting insights from the data. Export: Sends data out of the data warehouse for use in other systems. Other: Consists mainly of system maintenance functions necessary to keep the data warehouse operational.

Of queries that scan at least 1 MB, the median query scans around 100 MB, while the 99.9th percentile reaches 300 GB. Despite being 'massively parallel processing' systems, databases like Snowflake and Redshift mainly handle queries that could easily fit on a single large node.

What Do Data Warehouses Really Do?

→ $300K/year on Snowflake, and 90% is spent on queries.
→ Most queries are tiny (median: 100MB, 99.9% <300GB).
→ Most workloads = ingestion + transformation (not analytics).

💡 Small Data > Massive Complexity.
Are we overpaying for simplicity?

04.12.2024 10:51 — 👍 2 🔁 0 💬 1 📌 0

Think Small. Make Big Impact.

More Data ≠ Better Results.
→ Recent data is the most valuable.
→ Smaller AI models deliver bigger impact.
→ Local-first development works.

Stop relying on distributed complexity when single machines get the job done.

The #SmallData Movement is here. Are you in?

02.12.2024 10:51 — 👍 1 🔁 0 💬 0 📌 0

BigData isn’t the problem—it never was.

Most enterprises have <100GB in active data but overpay for tools designed for massive scale (#Snowflake, #Databricks, etc.).

Focus on #SmallData:
→ Easier to analyze
→ Cheaper to manage
→ Faster insights

Time to rethink your data strategy. #SmallData

30.11.2024 15:44 — 👍 5 🔁 0 💬 0 📌 0

Joachim Rosskopf

Latest posts by jrosskopf.bsky.social on Bluesky

@jrosskopf is following 20 prominent accounts