navinkumarnotes123.hashnode.dev·12 hours agoHow to decide bucket count in hiveSteps Calculate Expected Bucket Size: Divide the table size by the block size on Hadoop to get an initial estimate. Expected Bucket Size = Table Size / Block Size on Hadoop Find the Nearest Power of 2: Take the base-2 logarithm of the ini...Discusshivehive
Vaishnave Subbramanianvaishnave.page·Mar 16, 2024Dabbling with Spark EssentialsEmbarking on the journey of understanding Apache Spark marks the beginning of an exciting series designed for both newcomers and myself, as we navigate the complexities of big data processing together. Apache Spark, with its unparalleled capabilities...Discuss·117 readsDabbling with Apache Sparkspark
Gaurav Vishwakarma gauravoncloud.hashnode.dev·Feb 26, 2024Unveiling the Powerhouse: Data Engineering in the Digital EpochIn the vast landscape of technology, where information reigns supreme, the unsung hero orchestrating the symphony of data is none other than data engineering. This field, often hidden behind the glitz of data science and analytics, plays a crucial ro...Discussdata-engineering
Deepankar Yadavbytesofdeepankar.hashnode.dev·Feb 26, 2024Join Strategies in Apache SparkAlthough we are quite familiar with join operations in spark, but do you know spark has some inbuilt tricks to do joins in an efficient manner without letting you know, unless you tame spark and make it do the way you want. PREREQUISITE: TERMINOLOGY:...Discussspark
KALINGA SWAINkalingaswain.hashnode.dev·Feb 11, 2024EMR with EKSHi, welcome to the event! Amazon EMR is like the Rockstar of cloud big data. Picture this: petabyte-scale data parties, interactive analytics shindigs, and even machine learning raves—all happening with cool open-source crews like Apache Spark, Apach...Discuss#AWSConsole
Ronil Rodriguesronilrodrigues.hashnode.dev·Feb 9, 2024Apache Spark !!IntroApache Spark has emerged as a leading big data processing framework due to its speed, ease of use, and versatility. At the heart of Spark are its core functionalities and commands, which enable users to perform a wide range of data processing tasks e...Discussspark
Anees Shaikhaneesshaikh.hashnode.dev·Jan 18, 2024Replace withColumn with withColumns to speed up your Spark applications.Disclaimer - the views and opinions expressed in this blogpost are my own. Practical takeaways The .withColumn() function in Spark has been a popular way of adding and manipulating columns. In my experience, it is far more common than adding columns ...Discuss·118 readsdataengineering
Kiran Bhandarikiranbhandari.hashnode.dev·Jan 16, 2024Incremental Loading using Spark PartitionWe can achieve the incremental loading using the spark data partition. I will be demonstrating with a simple example here, and you can you according to the business need. Let's consider the following data frame.(This could be your csv, or any formatt...Discuss#incremental-loading
Deepankar Yadavbytesofdeepankar.hashnode.dev·Jan 11, 2024Stop "WithColumn" ChainAttention, PySpark wranglers! We've uncovered a hidden culprit that's been slowing down your DataFrames without you even knowing it. It's time to shed light on the stealthy menace of chaining withColumn calls. Let's dive into the issue and reveal how...Discussapache
Hitekhitek.hashnode.dev·Jan 9, 2024Delta table with change data capture(CDF)What is CDF: The Change Data Feed (CDF) feature allows Delta tables to track row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records “change events” for all the data written into the table. This include...Discussdelta table