Rahul Jain rahulj51 - Bluesky Statics

26.04.2025 14:05 — 👍 2 🔁 0 💬 0 📌 0

Cursor having an identity crisis.

26.04.2025 14:04 — 👍 1 🔁 0 💬 0 📌 0

It takes about two weeks of using these fancy table formats to know that what's documented is just the tip of the iceberg. Most of the operational knowledge is undocumented and buried in GitHub issues and slack threads.

01.02.2025 02:24 — 👍 3 🔁 0 💬 0 📌 0

I will make my types so so safe this year. You just watch.

11.01.2025 02:25 — 👍 5 🔁 0 💬 0 📌 0

The logs provide little to no information on what's going on, the errors are so random that the last 200 people who had the same error all solved it differently. There are a gazillion parameters to configure and most of it is undocumented, tribal knowledge. /2

09.01.2025 14:18 — 👍 0 🔁 0 💬 0 📌 0

Spent two days fiddling with all kinds of spark settings to solve a network timeout issue. It finally worked but this is my biggest crib about spark. Most of the time, the only way to fix stuff is trial and error. /1

09.01.2025 14:18 — 👍 1 🔁 0 💬 1 📌 0

Can someone explain why the Spark dataframes api doesn't have the concept of create-if-not-exists but their sql api does?

07.01.2025 01:39 — 👍 0 🔁 0 💬 0 📌 0

*org

02.01.2025 05:00 — 👍 0 🔁 0 💬 0 📌 0

Your irg may have a set of 20 well written values but even well meaning techies usually care about only two traits in their colleagues:
- Technical skills
- Being nice to each other

02.01.2025 03:59 — 👍 3 🔁 0 💬 1 📌 0

Attended a big fat Indian wedding and went berserk with the food.

01.01.2025 08:57 — 👍 4 🔁 0 💬 0 📌 0

I didn't know that Iceberg also creates hive style partitioned folders. This is surprising. I thought the whole idea was to manage partitions at metadata level.

31.12.2024 10:02 — 👍 1 🔁 0 💬 1 📌 0

The worst resumé writing advice is that it should be contained within 1-2 pages.

Add more pages. Tell your story. But tell it well.

23.12.2024 09:14 — 👍 3 🔁 0 💬 1 📌 0

2 years of Typescript made me a better Python programmer.

18.12.2024 15:16 — 👍 5 🔁 0 💬 0 📌 0

I have a very specific scenario though I bet this has more general purpose use. Mine is about maintaining a lambda-architecture ingestion pipeline where the batch and the streaming tables can drift apart wrt to their schemas. We have a union-view on top of these tables which fails when this happens.

11.12.2024 16:07 — 👍 2 🔁 0 💬 1 📌 0

A SQL command I'd love to see is

```
ALTER TABLE <table2> MERGE SCHEMA <table1>
WHEN MATCHED CHANGE DATATYPE
WHEN NOT MATCHED BY SOURCE DROP COLUMN
WHEN NOT MATCHED BY TARGET ADD COLUMN
```

11.12.2024 12:48 — 👍 7 🔁 1 💬 2 📌 0

What Python Type checker is everyone using - mypy or pyright?

04.12.2024 12:15 — 👍 0 🔁 0 💬 0 📌 0

Indian grey mornings have a post-apocalyptic feel to them. A brownish, heavy, noisy grey as opposed to the silent steel grey of Berlin.

30.11.2024 05:01 — 👍 3 🔁 0 💬 0 📌 0

Data lake architectures are uncannily similar to farm life simulators.

29.11.2024 04:37 — 👍 2 🔁 0 💬 0 📌 0

In Data, I often see this when an analyst is trying to fix that 0.0001% bad data in a dataset because it feels like a pebble in the shoe that you must get rid of.

28.11.2024 02:10 — 👍 3 🔁 0 💬 1 📌 0

There is something about certain coding practices that are deeply satisfying to a programmer at a dopamine release level - which explains why they go OCD about these. Functional programming, TDD, Refactoring and Types design - all have this quality. It's like an itch you can't stop scratching.

28.11.2024 02:09 — 👍 18 🔁 0 💬 2 📌 1

The majority of data orgs in the world are still woefully unaware of the latest in data Tech. They are still sending each other csvs over ftp and have never heard of Airflow.

23.11.2024 02:25 — 👍 15 🔁 0 💬 2 📌 0

Yeah. Np. We can and should surely have both. I just find it curious how different folks interpret data democracy differently.

22.11.2024 03:44 — 👍 1 🔁 0 💬 1 📌 0

Depending on who you talk to, there are two versions of data democracy vis-a-vis technology:

1. My data, my choice. I want to be able to use my choice of tool/tech for my data.

2. Inclusivity. Eveyone should have easy access to data. Therefore, I'll choose generic tools that anyone can use easily

21.11.2024 07:23 — 👍 1 🔁 0 💬 0 📌 1

A good tip for data engineering architecture is to treat a "Table" as a logical entity and not tie it with the way it is materialized or mapped to other subsystems. A table can be anything - a set of files, a view, the result of a query. Its definition and physical manifestation should be decoupled.

18.11.2024 03:00 — 👍 20 🔁 0 💬 1 📌 0

Does anyone know if BigQuery uses table metadata to speed-up MIN/MAX of a partition column? It looks like it always does a column scan.

07.11.2024 17:12 — 👍 0 🔁 0 💬 0 📌 0

Data pipelines are difficult to generalize because a lot depends on the specific characteristics of a table - there is no one size fits all. Cloud warehouses offer high level abstractions which work but cost $$$. Lakes offer hundreds of levers to pull but at the cost of generalizability.

04.11.2024 19:17 — 👍 4 🔁 0 💬 0 📌 0

yes, versioning/snapshots is the way to manage this yourself and manually rollback if things go wrong.

02.11.2024 13:35 — 👍 0 🔁 0 💬 0 📌 0

For ex. writing to two related tables (say dimensions and facts). Or "moving" data from one table to another. In my case, I have two tables A and B participating in a lambda like arch. And I frequently compact one table by moving data to the other table. Such operations require atomicity.

02.11.2024 13:22 — 👍 0 🔁 0 💬 1 📌 0

The poor support of multi-table transactions in the DE world is somewhat limiting. We have just gotten used to not having it and replacing it with data integrity checks and manual rollbacks. But with table-formats and catalogs, this shouldn't be too difficult to implement (or so i think).

02.11.2024 09:59 — 👍 5 🔁 0 💬 2 📌 0

What's a good way to visually organize a flow chart?

01.11.2024 19:13 — 👍 0 🔁 0 💬 0 📌 0

Posts by Rahul Jain (@rahulj51.bsky.social)