Raghuveer Sriramanraghuveer.me·a day agoGetting answers from data using PySparkThis post attempts to document a small part of a Data Engineer's workflow along with some techniques that help answering data questions from a dataset. On the technical side, we will deal with nested JSON data, touch upon data cleaning and data explo...Discussspark
Pineipinei.hashnode.dev·Oct 1, 2023Running PySpark in JupyterLab on a Raspberry PiWhile researching materials for installing a JupyterLab instance with Spark support (via PySpark), I noticed a lot of outdated content. That was until I came across an up-to-date Docker image provided by the Jupyter Docker Stacks project. https://jup...DiscussJupyter Notebook
Musaib Shaikhmusaib.hashnode.dev·Sep 28, 2023Transforming Semistructured Data with PySpark and Storing in Hive Using Cloudera ETL.In the vast landscape of data engineering and analysis, one common challenge is to transform the raw semi-structured/unstructured data into meaningful insight. In this blog, we will transform the semistructured data i.e. the CSV file into a structure...Discuss·102 readsCloudera
Harshita Chaudharyharshita.hashnode.dev·Sep 9, 2023Approaches to Remove Duplicate RowsCreate a data frame with columns Name and Age. data=[("Alice", 25), ("Bob", 30), ("Alice", 25), ("Kate", 22)] cols= ["Name", "Age"] df = spark.createDataFrame(data, cols) Approach 1:- Using dropDuplicates() #Approach 1 dedup_df=df.dropDuplicates(sub...Discussdata
Anish Machamasimacha7.hashnode.dev·Sep 5, 2023Spark ArchitectureA cluster is a collection of nodes. A Databricks cluster has one driver node and one or more worker nodes. A node is an individual machine. In the case of the cloud, It acts as a virtual machine. Each of the Azure virtual machines generally has at le...Discuss·1 likePySpark
Sandeep PawarProfabric.guru·Aug 13, 2023Recursively Loading Files From Multiple Workspaces And Lakehouses in FabricImagine a scenario in which you are collaborating with two colleagues who have stored sales data in two separate lakehouses, each within a different workspace and under distinct names but same schema. The data is saved in folders containing numerous ...Discuss·2 likes·447 readslakehouse
Anuoluwapo Balogunpinkdatahub.hashnode.dev·Jul 16, 2023Loading Data from MongoDB Database with PySparkThe last database we will connect with PySpark is MongoDB. MongoDB is a NoSQL Database that usually outputs data in a JSON File Format. We start by installing the MongoDB driver for python pip install pymongo To set up MongoDB you can download the c...Discuss·2 likesData Science
Anuoluwapo Balogunpinkdatahub.hashnode.dev·Jul 16, 2023Loading Data from MySQL Database with PySparkToday, we are going to load data from MySQL Database with PySpark. previously I have written on both PostgreSQL and Microsoft SQL Server. We start by installing MySQL connector. pip install mysql-connector-python Next, we start the Spark session. fr...DiscussPySpark
Anuoluwapo Balogunpinkdatahub.hashnode.dev·Jul 8, 2023Loading Data from Microsoft SQL Server Database with PySparkIn this article we are going to learn how to load Dataset from Microsoft SQL Server with PySpark We start by downloading the pyodbc library with pip pip install pyodbc We also need to set up our Microsoft SQL Server Local host server. This guide is ...Discuss·10 likes·50 readsPySpark
Anuoluwapo Balogunpinkdatahub.hashnode.dev·Jul 6, 2023Loading Data From PostgreSQL to PySparkWhat is PySpark? PySpark is the API for Apache Spark it allows real-time and large-scale data preprocessing in Python. It has its own PySpark shell in Python to interactively analyze your data. PySpark supports all Sparks features; Spark SQL, DataFra...Discuss·38 readsPySpark