Nextwebbnextwebb.hashnode.dev·Jan 10, 2025Data Engineering Foundations: A Hands-On GuideData Engineering Foundations: A Hands-On Guide Hey there! If you’ve been curious about data engineering, this guide will help you understand the basics and walk you through practical examples. Whether it’s setting up storage, processing data, automat...31 readsETLPipelines
Rahul Dasschemasensei.hashnode.dev·Jan 7, 2025Unlocking Real-Time Data with Change Data Capture (CDC)In this guide, we will cover CDC, its importance, and the setup of a CDC stack using Kafka, Debezium, and other services. Additionally, we will configure a PostgreSQL connector using the Confluent Control Center web UI to capture changes from a Postg...26 readskafka
Rajanand Ilangovanblog.rajanand.org·Dec 28, 2024Predicate Pushdown in SparkWhat is predicate pushdown? Predicate pushdown is an optimization technique in Apache Spark where the filtering logic (predicates) is pushed closer to the data source. Instead of Spark loading all the data into memory and applying the filters, the fi...spark
Renjitha Krenjithak.hashnode.dev·Dec 25, 2024Crafting a Seamless Data Journey: Navigating the Medallion Architecture PipelineHave you ever thought about how modern data systems manage data efficiently? The Medallion Architecture is a smart, structured approach that ensures your data is reliable, scalable, and ready to use. Let’s dive into it step-by-step and explore how it...49 readsMedallionArchitecture
Sharath Kumar Thungathurthisharaththungathurthi.hashnode.dev·Dec 22, 2024Impala Interview QuestionsHere are some questions and answers related to Impala, covering various aspects of its usage, architecture, and functionality. These are useful for interview preparation or as a study guide. 1. What is Impala? Answer:Impala is an MPP (Massively Paral...Impala
Varas Vishwanadhulasparkcache.hashnode.dev·Dec 12, 2024Unlocking the Power of Bucketing in Spark: Optimize Your Data ProcessingBucketing is a process of shuffling and sorting the data and storing it in physical location. Based on the above statement we can say that the bucketing can be used when we need the data to be shuffled and sorted. The most general case where the data...spark
Sandeep Pawarfabric.guru·Dec 11, 2024Upgrade Fabric Workspaces To The Latest GA RuntimeIt’s always a good idea to use the latest GA runtime for the Spark pool in Fabric. Unless you change it manually, the workspace will always use the previously set runtime even if a new version is available. To help identify the runtime that workspace...394 readsspark
Shrinivas Vishnupurikardata-engineering-with-shrinivasv73.hashnode.dev·Dec 11, 2024The Spark Eco-System with not very often elaborated ComponentsThe article’s title is justified because when I was studying the Spark’s Ecosystem, I had a lot questions with respect to which components are siblings to each other or which are nested one another, and also had “Does such a component actually exists...#apache-spark
Tanupriya Singhtanupriya.com·Dec 9, 2024A Beginner's Guide to Spark: Insights from an MLEIntroduction to Spark Spark is a distributed computing system designed for large-scale data processing. Here are key reasons why Spark is suitable for ML pipelines: High Volume of Data: Many ML problems require processing large datasets, which mig...176 readsspark
Sandeep Pawarfabric.guru·Dec 7, 2024Delta Lake Tables For Optimal Direct Lake Performance In Fabric Python NotebookIn my last blog, I showed how to use Polars in a Fabric Python notebook to read, write and transform data. What I did not cover was how to write a Delta Lake table in a Python notebook that’s optimal for DirectLake performance. All engines in Fabric ...2 likes·1.9K readsmicrosoftfabric