Mehul Kansalmehulkansal.hashnode.dev·Jul 15, 2024Week 9: Lending Club Project - Part 1 💸Hey there, data engineering folks! 👋 In this two-part series on the Lending Club project, we delve into the process of creating, cleaning and transforming various datasets derived from a large dataset containing over 2 million records. The first par...Discussdataengineering
Victor OhachorforZero-Stack Engineerzerostackengineer.hashnode.dev·Jul 9, 2024Optimizing Query Performance in SnowflakeOverview Snowflake's architecture is made up of three (3) layers: storage layer, compute layer, and cloud services layer. The cloud services layer has query optimization built into it, which helps whip out some caveats in your queries to yield quicke...DiscussByteBite Wisdomsnowflake
Victor OhachorforZero-Stack Engineerzerostackengineer.hashnode.dev·Jul 7, 2024Snowflake SHOWSimilar to MySQL, you can use SHOW DATABASES to get information about your databases in Snowflake. While I love PostgreSQL, it's not as straightforward. In Postgres, you need to query the pg_database table or use \l or \list in the psql shell. Databa...DiscussByteBite Wisdomdataengineering
Constantin Lungudatawise.dev·Jul 4, 2024Ingestion-time partitioning in BigQueryHave you ever used ingestion-time partitioning in BigQuery? It's a separate type of partitioning that distributes rows into partitions based on the time they land in BQ. Once such a table is defined, you can query the pseudocolumns PARTITIONDATE ...DiscussPractical BigQuerybigquery
Mehul Kansalmehulkansal.hashnode.dev·Jun 17, 2024Week 5: PySpark Playground - Aggregates and Windows 🏀Hey there, fellow data engineers! This week's blog aims to delve into the various methods of accessing columns in PySpark, explore the different types of aggregate functions, and understand the utility of window functions. By the end of this post, yo...DiscussData Science
Mehul Kansalmehulkansal.hashnode.dev·Jun 10, 2024Week 4: The Power of Caching - Boosting Spark Data Processing 🚀Introduction Hey data enthusiasts! This week, we're exploring one of the most powerful features of Apache Spark: Caching. We'll see various caching strategies, from basic DataFrame caching to advanced techniques like using persist with customizable s...Discussspark
Ahmed Shaabanahmedshaaban1999.hashnode.dev·May 22, 2024Observability: What you need to know and moreThe word "Observability" (or its abbreviation o11y) gets thrown a lot these days. So what is it ? And is it really just a fancy word for "Monitoring" that was made trendy just because those "Though Leaders" ?. Monitoring is a well explored concept. Y...Discuss·36 readsobservability
Alex Mercedalexmerced.hashnode.dev·May 15, 20243 Reasons Data Engineers Should Embrace Apache IcebergData engineers are constantly seeking ways to streamline workflows and enhance data management efficiency. Apache Iceberg, a high-performance table format for huge analytic datasets, has emerged as a game-changer in the field. By offering powerful fe...Discussapacheiceberg
Turboline LTDblog.turboline.ai·May 6, 2024AI-Based Data Transformation: A Comparison of LLM-Generated PySpark Code (Using GPT-4)Welcome to the first installment of our series comparing data transformation codes generated by various large language models (LLMs). In this series, we aim to explore how different AI models approach data engineering and analytical tasks under ident...Discuss·42 readschatgpt
Sven Eliassonsveneliasson.de·Apr 25, 20242.5x Performance: PostgreSQL to Clickhouse without Kafka with MaterializedPostgreSQLTLDR: Stream PostgreSQL data to ClickHouse without Kafka using MaterializedPostgreSQL for real-time data replication and optimization, achieving significant data compression and query 2.5x performance improvements. Dealing with billions of sensor rec...DiscussClickHouse