BDBiju Devassyinbijudevassy.hashnode.dev00Handling Data Skew and Broadcast Joins in PySparkFeb 22 · 5 min read · Introduction Joins are often the most expensive operations in Apache Spark. When they are not handled properly, they can lead to long-running jobs, uneven task execution, excessive shuffling, and evenJoin discussion
BDBiju Devassyinbijudevassy.hashnode.dev00Caching vs Persistence in Spark (PySpark)Feb 22 · 5 min read · Introduction Apache Spark is built on lazy evaluation. Transformations such as select, filter, join, and groupBy do not execute immediately. Instead, Spark builds a logical plan (DAG) and executes it Join discussion
BDBiju Devassyinbijudevassy.hashnode.dev00Broadcast Join vs Sort Merge Join vs Shuffle Hash Join in Apache SparkFeb 12 · 3 min read · When working with large-scale data in Apache Spark, understanding join strategies is critical for performance tuning. Spark does not always execute joins the same way. Depending on dataset size and coJoin discussion