Spark With Python(PySpark)

PySpark is a Spark library written in Python to run Python applications using Apache Spark capabilities, using PySpark we can run applications parallelly on the distributed cluster (multiple nodes). PySpark is a Python API for Apache Spark. Apache Spark is an analytical processing engine for large scale powerful distributed data processing and machine learning applications.

Pyspark Installation

Install Jupyter Notebook. [pip install jupyter notebook]

Install Java 8

To run the PySpark application, you would need Java 8 or a later version hence download the Java version from Oracle and install it on your system.

Post-installation, set JAVA_HOME and PATH variable.

Install Apache Spark

Download Apache spark by accessing Spark.

After download, untar the binary using 7zip and copy the underlying folder spark-3.2.1-bin-hadoop2.7 to C:/apps

Setup winutils.exe Download the winutils.exe file from winutils, and copy it to%SPARK_HOME%\binfolder. Winutils are different for each Hadoop version hence download the right version from [github.com/steveloughran/winutils]

PySpark shell Now open the command prompt and type pyspark command to run the Pyspark shell.