© 2023 Hashnode
#pyspark
Overview In a previous article, we covered the basics of Apache Spark. Now that the foundational concepts and workflow architecture of the spark are covered, we'll be exploring PySpark and its convent…
Test Driven Development is a software development practice where a failing test is written so that the code can be written following that to make it pass. It enables the development of automated tests…
Introduction Apache Spark is one of the most widely used distributed computing frameworks that allow for fast and efficient processing of large datasets. It provides various APIs to process data in di…
PySpark SQL is a powerful module for processing structured data using SQL queries in Python programming language. In addition to the basic functionality, PySpark SQL also provides several advanced fea…
ETL (Extract, Transform, Load) is a process of integrating data from various sources, transforming it into a format that can be analysed, and loading it into a data warehouse for business intelligence…
👋 Jupyter Notebook is a powerful tool that allows us to write and run code in an interactive environment. It is widely used by data scientists, researchers, and developers to explore and analyze data, build and test machine learning models…
(Note: this is adapted from my talk at 2021 Scale by the Bay, Location-Based Data Engineering for Good) If you are a data scientist, chances are you are coding Python and most likely using pandas. You…
Here we are going to learn Spark Memory Management Before starting we need to understand the below points clearly, One core will process one partition of data at a time Spark partition is equivalent…
☑️ It is a SQL function in pyspark to 𝐞𝐱𝐞𝐜𝐮𝐭𝐞 𝐒𝐐𝐋-𝐥𝐢𝐤𝐞 𝐞𝐱𝐩𝐫𝐞𝐬𝐬𝐢𝐨𝐧𝐬. 🔵 𝐒𝐲𝐧𝐭𝐚𝐱:- 𝐞𝐱𝐩𝐫(𝐬𝐭𝐫) ☑️ It will take SQL expression as a 𝐬𝐭𝐫𝐢𝐧𝐠 𝐚𝐫𝐠𝐮𝐦𝐞𝐧𝐭 and pe…
𝐒𝐢𝐦𝐩𝐥𝐞 𝐑𝐚𝐧𝐝𝐨𝐦 𝐒𝐚𝐦𝐩𝐥𝐢𝐧𝐠 𝐨𝐫 𝐬𝐚𝐦𝐩𝐥𝐞():- ☑️ In Simple random sampling, we pick records randomly and every records has an equal chance to get picked. 🔵 Syntax:- sample(withRepl…