KKKushneet Kaurincloudnativebykushneet.hashnode.dev·Jun 20 · 5 min readWhat is a Lakehouse? This article is part of the Databricks from Scratch series.Start from the beginning: Stop Optimising Your Prompts. Fix Your Data Pipelines. Picture this. It's IPL ticket booking day. 10 AM. 1 crore00
APAndrea Parisinandreaparisdata.hashnode.dev·Jun 11 · 6 min readThe Moment I Realised a Database Is Not a Data WarehouseContext The platform combines real-time cryptocurrency market data from the CoinGecko API with sentiment analysis derived from cryptocurrency-related YouTube discussions. Apache Kafka, Spark Structure00
APAndrea Parisinandreaparisdata.hashnode.dev·Jun 10 · 6 min readBuilding a Real-Time Crypto Analytics PlatformFrom Streaming Pipeline to Analytics Platform When I started learning data engineering, I wanted a project that would force me to use the technologies I was studying in a realistic setting. Tutorials 00
VDVishnu Dinthedatatrench.hashnode.dev·May 29 · 9 min readSpark Architecture Simply ExplainedYou have been using Spark for months -- running notebooks, submitting jobs, reading docs. But when someone asks you to explain what actually happens when a job runs, you find yourself stalling. The co00
APAishwarya Patankarinaishwaryapatankar.hashnode.dev·Apr 30 · 5 min readBehind Every Payment: The Data Pipelines You Don’t SeeThe Problem: Payments Look Simple, But Aren’t When you send money via UPI or receive your salary, it feels instant and effortless. But behind that single action, multiple systems exchange structured d10
MTMadhusmita Talukdaringiiki.hashnode.dev·Apr 24 · 4 min readStop Ignoring Data Pipelines: ETL vs ELT Explained Using a Real ML WorkflowMost of us love building machine learning models. We tune hyperparameters, try different algorithms, and chase better accuracy. But there’s one part we quietly ignore: How the data actually gets to th00
AAAbstract Algorithmsinabstractalgorithms.dev·Apr 19 · 28 min readSpark Architecture: Driver, Executors, DAG Scheduler, and Task Scheduler ExplainedTLDR: Spark's architecture is a precise chain of responsibility. The Driver converts user code into a DAG, the DAGScheduler breaks it into stages at shuffle boundaries, the TaskScheduler dispatches ta00
AAAbstract Algorithmsinabstractalgorithms.dev·Apr 19 · 37 min readSpark Executor Sizing: Memory Model, Core Tuning, and GC StrategyTLDR: Spark executor OOMs are almost never caused by insufficient total cluster RAM — they are caused by misallocating memory across five distinct JVM regions while ignoring GC behavior and memoryOver00
AAAbstract Algorithmsinabstractalgorithms.dev·Apr 19 · 28 min readStateful Aggregations in Spark Structured Streaming: mapGroupsWithStateTLDR: mapGroupsWithState gives each streaming key its own mutable state object, persisted in a fault-tolerant state store that checkpoints to object storage on every micro-batch. Where window aggregat00
AAAbstract Algorithmsinabstractalgorithms.dev·Apr 19 · 27 min readSpark Structured Streaming: Micro-Batch vs Continuous Processing📖 The 15-Minute Gap: How a Fraud Team Discovered They Needed Real-Time Streaming A fintech team runs payment fraud detection with a well-tuned Spark batch job. Every 15 minutes it reads a day's worth00