Install PySpark on Linux


You probably have heard about it, wherever there is a talk about big data the name eventually comes up. In layman’s words Apache Spark is a large-scale data processing engine. Apache Spark provides various APIs for services to perform big data processing on it’s engine. PySpark is the Python API, exposing Spark programming model to Python applications. In my previous blog post, I talked about how set it up on Windows in my previous post. This time, we shall do it on Red Hat Enterprise Linux 8 or 7. You can follow along with free AWS EC2 instance, your hypervisor (VirtualBox, VMWare, Hyper-V, etc.) or a container on almost any Linux distribution. Commands we discuss below might slightly change from one distribution to the next.

Like most of my blog posts, my objective is to write a comprehensive post on real world end to end configuration, rather than talking about just one step. On Red Hat 7, I ran into a problem. I solved this problem without having to solve it.

The Author

Environment

  • Linux (I am using Red Hat Enterprise Linux 8 and 7)
  • Java
  • Hadoop
  • Spark
  • Anaconda or pip based virtual python environment

I have broken out the process into steps. Please feel free to skip a section as you deem appropriate.

Table of Contents

EPEL Repository

Extra Packages for Enterprise Linux, EPEL for short. It includes but isn’t limited to Red Hat Enterprise Linux (RHEL), CentOS, Scientific Linux (SL), Oracle Linux (OL), etc. In other words, EPEL is open source and free community supported project by Fedora and thrives to be a reliable source for up to date packages. Needless to say, it provides yum and dnf packages. If you want to know more, please check out their official wiki page and blog post from Red Hat. We would download the rpm using wget and, install it.

How to determine RHEL version?
// option 1
$ cat /etc/redhat-release
// option 2
$ cat /etc/os-release
// option 3 (RHEL 7.x or newer)
$ hostnamectl

Depending on the version you use, please see the wiki page to properly configure it.

// RHEL/CentOS 8 64 bit
$ wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
// RHEL/CentOS 7 64 bit
$ wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
// RHEL/CentOS 6 64 bit
$ wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
// RHEL/CentOS 6 32 bit
$ wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm

For example, you could install the rpm package after downloading or could also try doing it all in a single step.

// install the downloaded rpm file above
$ sudo rpm -ivh epel-release-6-8.noarch.rpm
// single step alternative
$ yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

To verify, let’s quickly run repolist command which would list currently active repositories.

$ yum repolist | grep epel

If you need it, EPEL configuration file should be located at /etc/yum.repos.d/epel.repo. If you have followed the steps you should have yum config manager installed on your machine at this point. Should you chose you can use the following commands to disable or enable EPEL, respectively.

$ sudo yum-config-manager --disable epel
$ yum-config-manager --enable epel

Update Linux software

If you are using AWS EC2 or equivalent, chances are you have bare bones base image and you are using SSH to log into your Linux. Let’s start by updating the base image. Depending on the number of packages, your machine configuration and internet speed – this might take a few minutes. The last command would install gcc, flex, autoconf, etc. which we would need to install fastparquet using pip, esp. if you you on RHEL 7.x.

$ sudo yum clean all
$ sudo yum -y update
$ sudo yum groupinstall "Development tools"
$ sudo yum install gcc
$ sudo yum install python3-devel

Set Password

If you do not know your current username, you can find out by executing whoami command. Now to set password:

$ sudo passwd <user_name>

To set root password, you can use the following command:

$ sudo su
# passwd

Install GUI and set it as default

If you are not planning to use GUI on this machine, you can skip this step. It would save you time, internet data and storage space.

// download gui components
$ sudo yum groupinstall -y "Server with GUI"
// set gui as default
$ sudo systemctl set-default graphical.target 
$ sudo systemctl default

Set up VNC and Remote Desktop Protocol

In this step we install VNC server and Remote Desktop Protocol. If you would rather stay with SSH, skip to the next step.

$ sudo yum install -y xrdp tigervnc-server

Now, we can set up SELINUX security as below.

chcon --type=bin_t /usr/sbin/xrdp
chcon --type=bin_t /usr/sbin/xrdp-sesman

We are now ready to start, enable and check the status of xrdp.service

$ sudo systemctl start xrdp.service
$ sudo systemctl status xrdp.service
$ sudo systemctl enable xrdp.service

The default port for using the Remote Desktop Protocol is 3389. This port should be open thru firewall to make it accessible by RDP. To ensure we actually can login, we have to do two things:

  • Check the port using ss or netstat
  • Add firewall rule to “open” the port
If you want to make this port accessible through the internet, not just your local network then, you want to enable port forwarding. It is uncommon and outside the context of this post. When doing so, please use extra caution.
$ netstat -antp

To add this port permanently to firewall, we want to check if firewall service is running. If it isn’t, we want to kick-start it.

$ sudo systemctl status firewalld.service
// if not running
$ sudo systemctl enable firewalld.service
$ sudo systemctl start firewalld.service

Add port and then reboot your machine. When the machine boots up again, you should be able to use Remote Desktop.

$ sudo firewall-cmd --permanent --add-port=3389/tcp
$ sudo firewall-cmd --reload
// reboot
$ sudo reboot

Tip: If your remote desktop connection terminates as soon as you login …

If you are not facing any issues please feel free to quickly skim through this paragraph.

Subscribe to get access

Read more of this content when you subscribe today.

$ cat ~/.xsession-errors

In my case, on RHEL 7 when using Anaconda, removing/commenting out the Anaconda block from ~/.bashrc solved the issue. I removed Anaconda and switched to pip based virtual environment since I needed remote desktop.

Version Compatibility Warning

A lot of questions end up on the forums with the root cause of version mismatch. Please do the due diligence to make sure Java, Spark, Hadoop and Python versions are compatible with each other. For example if you are using Spark 2.4.4, then install OpenJDK 1.8.0 instead 11 and Python 3.6.8 instead newer version.

Install Java

You may want to install either of the following. If you need OpenJDK for different platforms download no-cost long-term supported distribution of JDK.

// OpenJDK 11
$ sudo dnf install java-11-openjdk-devel
// OpenJDK 1.8.0 - Option 1
$ sudo yum install java-1.8.0-openjdk-devel.x86_64
// OpenJDK 1.8.0 - Option 2
$ sudo dnf install java-1.8.0-openjdk-devel

Spark

Download Spark

$ cd /opt
$ sudo wget https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
$ sudo tar -xf spark-2.4.4-bin-hadoop2.7.tgz
$ sudo ln -s /opt/spark-2.4.4-bin-hadoop2.7 /opt/spark

Configure Spark Master and Slave services

$ sudo su
# useradd spark
# chown -R spark:spark /opt/spark*

Create a file using $ sudo touch /etc/systemd/system/spark-master.service with the following content.

[Unit]
Description=Apache Spark Master
After=network.target
[Service]
Type=forking
User=spark
Group=spark
ExecStart=/opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh
[Install]
WantedBy=multi-user.target

To configure Spark slave service create a file using $ sudo touch /etc/systemd/system/spark-slave.service and add the following content. Replace IP-ADDRESS-or-hostname with IP address or the hostname of your machine.

[Unit]
Description=Apache Spark Slave
After=network.target
[Service]
Type=forking
User=spark
Group=spark
ExecStart=/opt/spark/sbin/start-slave.sh spark://IP-ADDRESS-or-hostname:7077
ExecStop=/opt/spark/sbin/stop-slave.sh
[Install]
WantedBy=multi-user.target

Load newly created Spark service files

$ sudo systemctl daemon-reload

Start Spark Service

// master service
$ sudo systemctl start spark-master.service
$ sudo systemctl status spark-master.service
// slave service
$ sudo systemctl start spark-slave.service
$ sudo systemctl status spark-slave.service

Hadoop

$ cd /opt
$ sudo wget https://archive.apache.org/dist/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz
$ sudo tar -xf hadoop-2.10.1.tar.gz
$ sudo ln -s /opt/hadoop-2.10.1 /opt/hadoop

Python

Spark comes with Python and most Linux distributions come with Python. So, why would you need Anaconda or pip virtual environment? Well you don’t, until you do. 😀

Option 1) Anaconda or Miniconda

Miniconda is a minimal installer for conda. Like Anaconda, Miniconda is a software package that includes the conda package manager and Python and its dependencies, but does not include any other packages. Once conda is installed by installing either Anaconda or Miniconda, other software packages may be installed directly from the command line with conda install. Let’s first download Anaconda. I am using a version which is right for me, but take a moment to check the anaconda website and the pages linked below, if you need to better understand the commands I have used below.

Download and install conda

$ cd /opt
$ sudo wget https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh

To ensure the integrity of the file we downloaded, we should verify the SHA256 checksum with the hash Anaconda has published.

$ sha256sum /opt/Anaconda3-2020.11-Linux-x86_64.sh

If the hash matches we can start installation and follow the on screen instructions.

$ sh /opt/Anaconda3-2020.11-Linux-x86_64.sh

Create or import conda virtual environment

When you are ready, let’s create conda virtual environment. Either you can import the file below or you can create one on your own. I would call this environment dev376, you can call it whatever you like. If you want to quickly import an environment, run create a file by running $ touch ~/dev376.yml and add the following content to it.

name: dev376
channels:
  - defaults
dependencies:
  - numpy
  - jupyter
  - fastparquet
  - autopep8
  - pandas
  - snakeviz
  - python=3.7.6
  - pyspark
  - pylint
prefix: /home/<user_name>/anaconda3/envs/dev376

Now to automatically create an environment with just these packages above, we would execute the following command:

$ cd ~
$ conda create env --no-default-packages --file dev376.yml

In some cases, based on your network settings, at work or home conda might have trouble downloading the metadata and packages from the internet. Temporarily disable SSL check and when you are done, you should probably reset to default value.

$ conda config --set ssl_verify false

Option 2) pip virtual environment

If you skipped the section Update Linux software in the beginning, check if you have gcc installed. pip and wheel might need it and your setup would likely fail without it. For pip, we would first install only the packages we need to set up virtual environment.

$ pip install virtualenv virtualenvwrapper

Now is the time to find the version of Python you want to use. Most probably you have multiple versions of Python available to you. To find this, execute the following command and settle on a path you want to use. FWIW, I am going to use /usr/bin/python3.6.

$ whereis python

We would now append the following to ~/.bashrc file:

export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3.6
export WORKON_HOME=/var/home/<user_name>/Envs
export VIRTUALENVWRAPPER_SCRIPT=/var/home/<user_name>/.local/bin/virtualenvwrapper.sh
source /var/home/<user_name>/.local/bin/virtualenvwrapper.sh

After you save and close the file, we would need to load these newly added variables. To do so, run the following command:

$ source ~/.bashrc

We are now ready to create virtual environment dev368.

$ mkvirtualenv -p /usr/bin/python3.6 dev368
$ workon dev368

Be a good citizen and upgrade your pip first.

$ /var/home/<user_name>/Envs/dev368/bin/python -m pip install --upgrade pip

I am providing you ~/requirements.txt file which you can use to bootstrap your virtual python environment or you certainly can create your own as you please.

pip install -r ~/requirements.txt

Finally, if you want to activate this virtual environment every time you start your bash from PuTTY or GNOME Terminal, add the following line to ~/.bashrc file:

workon dev368

Environment variables for Hadoop, Spark & PySpark

Add a few more variables to your ~/.bashrc file so that it looks like following:

export HADOOP_HOME=/opt/hadoop
export SPARK_HOME=/opt/spark
export PYSPARK_PYTHON=/usr/bin/python3.6
export PATH=$PATH:$SPARK_HOME/bin:$HADOOP_HOME/bin
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip
export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3.6
export WORKON_HOME=/var/home/<user_name>/Envs
export VIRTUALENVWRAPPER_SCRIPT=/var/home/<user_name>/.local/bin/virtualenvwrapper.sh
source /var/home/<user_name>/.local/bin/virtualenvwrapper.sh
workon dev368

Save and close the file. Take a deep breath and check if Spark shell works:

$ spark-shell

And if that runs without any problem, then get out of it by pressing Ctrl + D and test PySpark:

$ pyspark

Make sure that spark master and slave services are running, start them if they are not and in your browser navigate to Spark UI!


Just realized it is May 4, 2021. Awesome!


May the 4th be with you!

1 Comment

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s