Up and running with PySpark on Windows


If you, too, are coming from an R (or Python/Pandas) environment like me, you would feel highly comfortable processing CSV files with R or Python/Pandas. Whether you are using time series sensory data, random CSV files, or something else, R and Pandas can take it! If you can step away from R and Python/Pandas mindset, Spark really goes to a great length to make me feel welcome as an R and Python Pandas user.

These last days I have been working extremely closely with AWS EMR. I am not talking about creating a couple of trivial notebooks with a 5×5 data frame containing fruit names. The data set I am working with is 10s of gigs stored away in the cloud. The data is far from clean. I need to create an ETL pipeline to retrieve historical information. Which I would use to train my machine learning models. The predictive analysis on the new incoming data with machine learning – how am I doing it is probably a post (or series of posts) for a later date, probably. Today, I want to get you up and running with PySpark in no time!

Why am I writing this post?

There already is a plethora of blogs after blogs, and forums after forums on Spark and PySpark on the internet about how to install PySpark on Windows. These are mainly focused on setting up just PySpark. But what if I want to use Anaconda or Jupyter Notebooks or do not wish to use Oracle JDK? This post picks up where most other content lack. In this post, I want to help you connect the dots and save a lot of time, agony, and frustration. Regardless, you are new to Windows, Spark/PySpark, or development in general.

This process is as easy as ABC!

Benefit

The main benefit of following the approach I suggest in my blog post is, that you do not have to install anything (for the most part) and you can switch Spark, Hadoop, Java versions in seconds!

Let’s get started!

A ) What would you need to download?

  1. Microsoft Build of OpenJDK
  2. Anaconda
  3. Apache Hadoop
  4. Winutils GitHub – cdarlint/winutils: winutils.exe hadoop.dll and hdfs.dll binaries for hadoop windows
    • match the appropriate version with Hadoop version above
  5. Apache Spark

B ) Set up: Extract & install

  1. Install Anaconda in C:\apps.
    • I recommend selecting the checkbox in the image below while you are installing Anaconda.
  2. Extract the rest to c:\apps.

Finally, we need to install anaconda virtual environment to complete set up. To help you get started quickly, I have created a virtual environment that you can import on your machine.

Note:

  • If you want, you can skip this next optional step and can make necessary adjustments from this point forward. The list in dev38.yml is by no means comprehensive, you might end up needing more packages but it should give you the basic packages to write code in IDE with linting, formatting and running PySpark applications on your machine.
  • If you do not want to call the environment dev38, just change the first line and to something else. On the same note, to reflect this change, issue conda env list command to make necessary adjustment to %PYSPARK_PYTHON% user environment variable in section C.

Copy the following block as is, and save it as C:\apps\dev38.yml.

name: dev38
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.8
  - pandas
  - numpy
  - pyspark
  - jupyter
  - pylint
  - fastparquet
  - autopep8
  - findspark
prefix: C:\apps\Anaconda3\envs\dev38

Now, launch your Anaconda prompt or PowerShell Core or PowerShell or Windows Terminal or Command Prompt and execute the following: (conda.io docs) to create an environment from dev38.yml you just created, issue the following command on your prompt:

conda env create –f dev38.yml

C ) Environment Variables

The last and final step you need is to set up your Window 10 Environment Variables. You do not need to create sytem variables. Just create user variables.

Environment Variables for user PATH variable
This screenshot shows you a glimpse of the user variables on my Windows 10 operating system. This screenshot shows you the path variables you need to set up. Those in blue are created by the Anaconda installer. I added those in orange. Make sure the path in brown color above is at the very end, by clicking the “Move Down” button.

Assuming you have installed and extracted it all in C:\apps, you need to make the following changes to your Windows 10 user environments.

Name Value
HADOOP_HOME C:\apps\hadoop-2.7.2
JAVA_HOME C:\apps\jdk-11.0.10+9
Path C:\apps\Anaconda3\envs\dev38
  C:\apps\Anaconda3\envs\dev38\Library\mingw-w64\bin
  C:\apps\Anaconda3\envs\dev38\Library\usr\bin
  C:\apps\Anaconda3\envs\dev38\Library\bin
  C:\apps\Anaconda3\envs\dev38\Scripts
  C:\apps\Anaconda3\envs\dev38\bin
  C:\apps\Anaconda3\condabin
  C:\apps\Anaconda3
  C:\apps\Anaconda3\Library\mingw-w64\bin
  C:\apps\Anaconda3\Library\usr\bin
  C:\apps\Anaconda3\Library\bin
  C:\apps\Anaconda3\Scripts
  %JAVA_HOME%\bin
  %SPARK_HOME%\bin
  %HADOOP_HOME%\bin
PYSPARK_DRIVER_PYTHON jupyter
PYSPARK_DRIVER_PYTHON_OPTS notebook
PYSPARK_PYTHON C:\apps\Anaconda3\envs\dev38\python.exe
SPARK_HOME C:\apps\spark-3.1.1-bin-hadoop2.7

Copy Winutils files to

%HADOOP_HOME%\bin

Note:

If you prefer not to use PySpark in Jupyter Notebook or if you would like to manually launch Jupyter Notebook, you can omit PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYTHON_OPTS parameters.

Note that, I do not have the following environment variable. And that is because, I do not want to use the Python embedded in Spark. I want to use the python from Anaconda dev38 virtual environment!

PYTHONPATH = %SPARK_HOME%\python;%SPARK_HOME%\python\lib\py4j-0.10.9-src.zip;%PYTHONPATH%

That’s it, you are good to go!

Start your Windows Terminal or Command Prompt or Powershell and run following commands one by one:

cd C:\
spark-shell
mkdir C:\Code
cd C:\Code
conda activate dev38
pyspark

The end … ding!

3 Comments

  1. Great read on PySpark and to the point! Could you please write about Windows Terminal app and how to configure it with linux, powrshell core, anaconda, etc.?

    Like

  2. Great blog posts, dude!
    Great curated yet distinctive article! Especially on what not to do. 😀
    Would you please write one on Linux, too? 😉

    Like

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s