BDBiju Devassyinbijudevassy.hashnode.devCaching vs Persistence in Spark (PySpark)Introduction Apache Spark is built on lazy evaluation. Transformations such as select, filter, join, and groupBy do not execute immediately. Instead, Spark builds a logical plan (DAG) and executes it 5d ago·5 min read
DSDishant Sharmaindishantsharma.hashnode.devCodex Spark vs Codex 5.3 vs Claude: Which AI Coding Tool Wins?A developer on X posted last week that Codex Spark generated a full SpriteKit game in 20 minutes. He called it "INSANELY FAST". Same day, another engineer warned the model "trades brains for speed". Both were right. OpenAI dropped GPT-5.3-Codex-Spark...Feb 15·6 min read
BDBiju Devassyinbijudevassy.hashnode.devBroadcast Join vs Sort Merge Join vs Shuffle Hash Join in Apache SparkWhen working with large-scale data in Apache Spark, understanding join strategies is critical for performance tuning. Spark does not always execute joins the same way. Depending on dataset size and coFeb 12·3 min read
TITech Insights Hubintopperblog.hashnode.devSpark Streaming: Real-Time ProcessingApache Spark Streaming: Real-Time Processing Guide for Modern Data Platforms Real-time data processing has become non-negotiable for modern enterprises. When your fraud detection system takes 30 seconds to flag a suspicious transaction, you've alread...Feb 12·10 min read
TITech Insights Hubintopperblog.hashnode.devSpark Job Optimization: Partition SkewWhy Traditional Partition Skew Solutions Fail at Scale Legacy approaches to handling data skew in Spark relied heavily on manual intervention and static configurations. Data engineers would analyze job metrics, identify skewed keys, and hardcode salt...Feb 12·9 min read
NNumurinnumur.hashnode.devReflections on My First Year as an AI Engineer: From "Trial by Fire" to a Sustainable RhythmIntroduction 2025 marked my first year in the workforce as a full-time professional. Though I’ve been in the industry for less than a year, I’ve managed to find a steady rhythm that works for me. I decided to take a moment to look back and organize m...Feb 10·7 min read
NNumurinnumur.hashnode.dev從記憶體炸裂到獨立完成推薦系統——找到自己步調的第一年前言 2025 年是我第一年正職工作。 雖然目前職涯還未滿一年,但我已摸索出穩定的工作節奏,因此決定停下腳步整理這段歷程。在入職初期,我頻繁地在 "摸索與挫折" 中循環,當時沒有餘裕梳理這一切。 一、意外的開始 我大學期間專攻自然語言處理,所以當我看到一間遊戲公司在招募自然語言工程師時,自然就投了履歷。 後來收到線上筆試,結果裡面全是推薦系統的題目,沒有半題跟自然語言有關! 我當時沒多想,第一個念頭是:我是不是記錯職缺內容了? 因為我也有投遞其他推薦系統的職缺,加上我有認真學習以及搭建過推薦系...Feb 9·2 min read
SSSameer Shuklainfreecodecamp.orgHow to Optimize PySpark Jobs: Real-World Scenarios for Understanding Logical PlansIn the world of big data, performance isn't just about bigger clusters – it's about smarter code. Spark is deceptively simple to write but notoriously difficult to optimize, because what you write isn't what Spark executes. Between your transformatio...Feb 5·70 min read
DYDeepankar Yadavinbytesofdeepankar.hashnode.devUnderstanding Spark Memory Failures by Breaking a Cluster on PurposeI recently built an Apache Spark standalone cluster on a single Raspberry Pi 5 (8 GB RAM) using Docker.The cluster had: 1 Spark Master 4 Spark Workers: harvey, mike, donna, louis (named after Suits characters 😄) Strict memory limits per container...Dec 25, 2025·8 min read
DYDeepankar Yadavinbytesofdeepankar.hashnode.devPostmortem of Spark Executor's deathA question I kept coming back to while comparing Spark with BigQuery was this: If Spark executors write shuffle data to disk, and that disk still exists, why can’t other executors read that data when one executor dies? At first glance, it feels lik...Dec 22, 2025·4 min read