Apr 24 · 4 min read · Most of us love building machine learning models. We tune hyperparameters, try different algorithms, and chase better accuracy. But there’s one part we quietly ignore: How the data actually gets to th
Join discussionApr 19 · 27 min read · TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global mi
Join discussionApr 19 · 28 min read · TLDR: Spark's architecture is a precise chain of responsibility. The Driver converts user code into a DAG, the DAGScheduler breaks it into stages at shuffle boundaries, the TaskScheduler dispatches ta
Join discussionApr 19 · 23 min read · 📖 The 500K-Event Problem: When a Naive Kafka Consumer Falls Apart An analytics platform at a mid-sized fintech company needs to process 500,000 payment events per second from a Kafka cluster. The tea
Join discussionApr 19 · 24 min read · TLDR: Catalyst is Spark's query compiler. It takes any DataFrame operation or SQL string, parses it into an abstract syntax tree, resolves column references against the catalog, applies a library of a
Join discussionApr 19 · 27 min read · 📖 The 15-Minute Gap: How a Fraud Team Discovered They Needed Real-Time Streaming A fintech team runs payment fraud detection with a well-tuned Spark batch job. Every 15 minutes it reads a day's worth
Join discussionApr 19 · 37 min read · TLDR: Spark executor OOMs are almost never caused by insufficient total cluster RAM — they are caused by misallocating memory across five distinct JVM regions while ignoring GC behavior and memoryOver
Join discussion