YSYash Srivastavainblog.yashsrivastava.link·Jan 24, 2023 · 2 min readSpark Stages, Tasks, and JobsThere are mainly 3 components in spark UI Jobs A spark application can have multiple jobs based on the number of actions (#jobs =#actions) in the application. These jobs can have a common rdd somewhere in the execution map or they can be separate al...00
YSYash Srivastavainblog.yashsrivastava.link·Jan 18, 2023 · 4 min readBasic Spark RDD transformationsRDD(resilient distributed datasets) are the basic unit of storage in spark. you can think of an rdd as a collection distributed over multiple machines.Most of the time higher level structured APIs are used in spark applications which under the hood g...00
YSYash Srivastavainblog.yashsrivastava.link·Jan 9, 2023 · 2 min readSpark on YARN architectureWhen we talk about spark on top of Hadoop its generally Hadoop core with Spark compute engine instead of MapReduce, i.e (HDFS, Spark, YARN) Spark follows a master-slave architecture where the master is called a Driver in spark and is responsible for ...00
YSYash Srivastavainblog.yashsrivastava.link·Jan 9, 2023 · 2 min readShared variables in sparkSometimes in a spark application, we need to share small data across all the machines for processing. For example, if you want to filter some set of words from a large dataset residing in a datalake. Or if we simply just want to know how many blank l...00
YSYash Srivastavainblog.yashsrivastava.link·Jan 4, 2023 · 2 min readWhat is Apache Spark?In simple terms, Apache spark is an in-memory unified parallel compute engine. In Memory,Most of the operations in apache spark happen in memory and there is very less disk IO operation giving rise to faster data transformation and computation, unlik...00