Shrinivas Vishnupurikardata-engineering-with-shrinivasv73.hashnode.dev·Dec 11, 2024The Spark Eco-System with not very often elaborated ComponentsThe article’s title is justified because when I was studying the Spark’s Ecosystem, I had a lot questions with respect to which components are siblings to each other or which are nested one another, and also had “Does such a component actually exists...#apache-spark
Jitender Kaushikjitenderkaushik.com·Nov 8, 2024Exploring Microsoft Fabric: Notebooks vs. Spark Jobs and How Java Fits InMicrosoft Fabric offers a versatile platform for data processing, blending interactive notebooks with powerful Spark jobs. While both tools serve different purposes, understanding their distinctions can optimize your workflows, especially with Java c...26 readsmicrosft fabric notebook
Alex Mercedalexmerced.hashnode.dev·Oct 19, 2024Orchestrating Airflow DAGs with GitHub Actions - A Lightweight Approach to Data Curation Across Spark, Dremio, and SnowflakeFree Copy of Apache Iceberg the Definitive Guide Free Apache Iceberg Crash Course Iceberg Lakehouse Engineering Video Playlist Maintaining a persistent Airflow deployment can often add significant overhead to data engineering teams, especially when ...GitHub
Sharath Kumar Thungathurthisharaththungathurthi.hashnode.dev·Oct 19, 2024Unlock PySpark’s Power: Techniques for ParallelizingConceptual Questions What is parallelize in PySpark? parallelize is a method in PySpark used to convert a Python collection (like a list or a tuple) into an RDD (Resilient Distributed Dataset). This allows you to perform parallel processing on the ...1 likePySpark
Gyuhang Shimplto001.hashnode.dev·Oct 14, 2024Lambda vs Kappa Architecture in Data Pipeline (Korean)Lambda Architecture 구성 요소 Batch Layer 정기적으로 대량의 Historical Data 를 처리합니다. (예: Daily 또는 Hourly) 이를 통해 높은 정확도와 데이터 Completeness (완전성) 을 보장하며, 복잡한 Data 변환을 처리합니다. Speed Layer 실시간 Data 를 처리하여 low-latency 시간의 결과를 제공합니다. Batch Layer 가 동일한 Data 를 처리할...kappa architecture
Gyuhang Shimplto001.hashnode.dev·Oct 10, 2024Trino (TSF) Installation and ConfigurationUnderstanding the Clear Reasons for Using Trino Since the installation and configuration methods of Trino vary depending on how you use it, it's essential to first clearly understand the reasons for using it. Determine if you need to fetch data from...ZGC
Gyuhang Shimplto001.hashnode.dev·Sep 27, 2024Trino (TSF) Installation and Configuration (Korean)Trino 사용 이유를 명확하게 파악하기 Trino 는 사용하는 방법에 따라서 설치, 구성 방법이 다르기 때문에 사용하는 이유를 명확하게 파악하는 것을 먼저 수행되어야 합니다. 다양한 Data Source (MySQL, PostgreSQL, HDFS, Amazon S3, Cassandra, Kafka 등)로부터 데이터를 가져와 하나의 Query 로 통합하여 분석할 필요가 있는 지 파악 Business Intelligence Platform...ZGC
Ilham Oulakbirensuringdataqualityandgovernance.hashnode.dev·Sep 23, 2024Best Practices for Data Engineers: Ensuring Data Quality and GovernanceIntroduction: In the world of data engineering, ensuring the quality and governance of data is as important as building robust pipelines and scalable architectures. Without proper governance and quality measures, the data you work with can lead to in...2 likes·32 readsData Tools
Vishal Barvaliyavishalbarvaliya.hashnode.dev·Sep 18, 2024Why Does the "Executor Out of Memory" Error Happen in Apache Spark?Apache Spark is a tool used to process large amounts of data. It’s fast, scalable, and great for big data tasks. However, sometimes when working with Spark, you might run into a common issue: the "Executor Out of Memory" error. If you've seen this er...9 likes#apache-spark
Vishal Barvaliyavishalbarvaliya.hashnode.dev·Sep 17, 2024How to Remove Leading Zeros from a Column in SQLWhen working with SQL databases, you might encounter numbers that contain leading zeros, such as 000123 or 0000456. While these zeros don't change the actual value of the number, they can make your data look unclean or cause issues when you're using ...SQL