xiaohanxiaohan.hashnode.dev·Jul 21, 2024Spark RDD如何获取TaskId利用TaskContext.getPartitionId()获取分区Id。 rdd.partitionBy(partitioner) .foreachPartition(iter->{ int id=TaskContext.getPartitionId(); StringBuilder sb = new StringBuilder(); sb.append("Pid=").append(id); while(iter.hasNext()){ ...Discussspark
Rahul Rathodcodeok.hashnode.dev·Jul 14, 2024The Revolutionary Journey of Apache Spark: From Academic Roots to Industry DominanceIn the world of big data, speed and efficiency are paramount. Among the many technologies that have emerged to address these needs, Apache Spark stands out as a revolutionary force. Born from academic innovation and nurtured by a growing community, S...Discussapache
Mehul Kansalmehulkansal.hashnode.dev·Jul 8, 2024Week 8: Spark Performance Tuning 🎶Hey data enthusiasts! 👋 In this week's blog, we delve into Spark Performance Tuning, focusing on optimizing aggregate operations and understanding the intricacies of Spark's logical and physical plans. We explore how sort and hash aggregations diffe...Discussspark
manvendra singhmanvendra-singh.hashnode.dev·Jul 1, 2024The Ultimate Productivity Setup - 2024Hey Everyone! 👋 For the past 4 years, I have been watching self-improvement content, and it has been an awesome learning experience. Adopting good habits and working towards my own goals to better myself. 🌟 One of the key components of this journey...Discuss·2 likesLifeat
Mehul Kansalmehulkansal.hashnode.dev·Jul 1, 2024Week 7: Spark Optimization Unlocked 🔓Hello fellow data engineers! This week, we delve into intricacies of Apache Spark optimizations, exploring how transformations like groupBy(), join types, partitioning, and adaptive query execution (AQE) enhance the performance and efficiency of data...Discuss·28 readsspark
Debashis Adakadak.hashnode.dev·Jun 29, 2024Databricks Variant DataThe VARIANT data type is a recent introduction in Databricks (available in Databricks Runtime 15.3 and above) designed specifically for handling semi-structured data. It offers an efficient and flexible way to store and process this kind of data, whi...Discussbig data
Mehul Kansalmehulkansal.hashnode.dev·Jun 24, 2024Week 6: Spark Internals Demystified 🔮Hey there, fellow data enthusiasts! This week, we will explore the intricacies of Spark Internals, from DataFrame Writer API and various write modes to advanced partitioning and bucketing techniques. We will discover how to optimize query performance...Discussspark
Mehul Kansalmehulkansal.hashnode.dev·Jun 10, 2024Week 4: The Power of Caching - Boosting Spark Data Processing 🚀Introduction Hey data enthusiasts! This week, we're exploring one of the most powerful features of Apache Spark: Caching. We'll see various caching strategies, from basic DataFrame caching to advanced techniques like using persist with customizable s...Discussspark
Mehul Kansalmehulkansal.hashnode.dev·Jun 3, 2024Week 3: Spark Transformations - Navigating Schema and Data Types 🧭Introduction Welcome back, data enthusiasts! This week, we'll unravel the intricacies of schema inference and enforcement, data type handling, creating and refining dataframes, and removing duplicates. Let's get started right away! Schema inference, ...Discuss·26 readsdata-engineering
Harshita Chaudharyharshita.hashnode.dev·May 30, 2024Slowly Changing Dimensions with PySpark and Delta LakeSlowly Changing Dimensions (SCDs) are a vital concept in data warehousing, particularly in managing data that changes over time. As the entities evolve over time, it’s crucial to track and manage these changes effectively. This is where Slowly Changi...Discussdata-engineering