Apr 24 · 4 min read · Most of us love building machine learning models. We tune hyperparameters, try different algorithms, and chase better accuracy. But there’s one part we quietly ignore: How the data actually gets to th
Join discussionApr 19 · 27 min read · TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global mi
Join discussionApr 19 · 28 min read · TLDR: Spark's architecture is a precise chain of responsibility. The Driver converts user code into a DAG, the DAGScheduler breaks it into stages at shuffle boundaries, the TaskScheduler dispatches ta
Join discussionApr 19 · 23 min read · 📖 The 500K-Event Problem: When a Naive Kafka Consumer Falls Apart An analytics platform at a mid-sized fintech company needs to process 500,000 payment events per second from a Kafka cluster. The tea
Join discussionApr 19 · 24 min read · TLDR: Catalyst is Spark's query compiler. It takes any DataFrame operation or SQL string, parses it into an abstract syntax tree, resolves column references against the catalog, applies a library of a
Join discussion