Feb 12 · 3 min read · When working with large-scale data in Apache Spark, understanding join strategies is critical for performance tuning. Spark does not always execute joins the same way. Depending on dataset size and co
Join discussionApr 6, 2025 · 8 min read · When I started using AWS Glue, I was impressed by how quickly I could spin up a serverless data pipeline without worrying about managing infrastructure. But that excitement didn’t last long. As my data grew and the workflows became more complex, my G...
Join discussion
Feb 22, 2025 · 5 min read · Imagine that you built a beautiful spark application and it look great on paper but then when you run it on a huge dataset it just crawls. The promising job turns into a time consuming slog with high resource utilization and you are left wondering wh...
Join discussion
Sep 29, 2024 · 12 min read · Introduction to Spark Optimization Optimizing Spark can dramatically improve performance and reduce resource consumption. Typically, optimization in Spark can be approached from three distinct levels: cluster level, code level, and CPU/memory level. ...
Join discussion
May 26, 2024 · 5 min read · Spark is an in-memory processing engine where all of the computation that a task does happens in memory. So, it is important to understand Spark Memory Management. This will help us develop Spark applications and perform performance tuning. In Apache...
Join discussion