TTTrung Thànhinthanh-de.hashnode.dev·May 6 · 6 min readI spent 6 hours studying PySpark join strategies. Here's what I learnedmatch keys between two tables and boom, you get results. That mindset worked fine in SQL databases. Then I started working with Spark on large datasets and my jobs started failing, timing out, or grinding for hours. The reality: Spark join performanc...00
TTTrung Thànhinthanh-de.hashnode.dev·May 6 · 4 min readI spent 8 hours learning Spark partitioning and bucketing. Here's what I discovereds one thing I've noticed: most Spark pipelines waste 30-60% of their compute time reading data they don't need or shuffling data that could have been pre-organized. During my recent deep-dive, I spent 8 hours learning two important optimization techn...00