You probably have heard about it, wherever there is a talk about big data the name eventually comes up. In layman’s words Apache Spark is a large-scale data processing engine. Apache Spark provides various APIs for services to perform big data processing on it’s engine. PySpark is the Python API, exposing Spark programming model to Python applications. In my previous blog post, I talked about how set it up on Windows in my previous post. This time, we shall do it on Red Hat Enterprise Linux 8 or 7. You can follow along with free AWS EC2 instance, your hypervisor (VirtualBox, VMWare, Hyper-V, etc.) or a container on almost any Linux distribution. Commands we discuss below might slightly change from one distribution to the next.
Like most of my blog posts, my objective is to write a comprehensive post on real world end to end configuration, rather than talking about just one step. On Red Hat 7, I ran into a problem. I solved this problem without having to solve it.
The Author
Environment
- Linux (I am using Red Hat Enterprise Linux 8 and 7)
- Java
- Hadoop
- Spark
- Anaconda or pip based virtual python environment
I have broken out the process into steps. Please feel free to skip a section as you deem appropriate.
Table of Contents
- EPEL Repository
- Update Linux software
- Set Password
- Install GUI and set it as default
- Set up VNC and Remote Desktop Protocol
- Tip: If your remote desktop connection terminates as soon as you login
- Version Compatibility Warning
- Install Java
- Spark
- Hadoop
- Python
- Environment variables for HadoopSpark & PySpark
EPEL Repository
Extra Packages for Enterprise Linux, EPEL for short. It includes but isn’t limited to Red Hat Enterprise Linux (RHEL), CentOS, Scientific Linux (SL), Oracle Linux (OL), etc. In other words, EPEL is open source and free community supported project by Fedora and thrives to be a reliable source for up to date packages. Needless to say, it provides yum and dnf packages. If you want to know more, please check out their official wiki page and blog post from Red Hat. We would download the rpm using wget and, install it.
How to determine RHEL version?
// option 1
$ cat /etc/redhat-release
// option 2
$ cat /etc/os-release
// option 3 (RHEL 7.x or newer)
$ hostnamectl
Depending on the version you use, please see the wiki page to properly configure it.
// RHEL/CentOS 8 64 bit
$ wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
// RHEL/CentOS 7 64 bit
$ wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
// RHEL/CentOS 6 64 bit
$ wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm
// RHEL/CentOS 6 32 bit
$ wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
For example, you could install the rpm package after downloading or could also try doing it all in a single step.
// install the downloaded rpm file above
$ sudo rpm -ivh epel-release-6-8.noarch.rpm
// single step alternative
$ yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
To verify, let’s quickly run repolist command which would list currently active repositories.
$ yum repolist | grep epel
If you need it, EPEL configuration file should be located at /etc/yum.repos.d/epel.repo
. If you have followed the steps you should have yum config manager installed on your machine at this point. Should you chose you can use the following commands to disable or enable EPEL, respectively.
$ sudo yum-config-manager --disable epel
$ yum-config-manager --enable epel
Update Linux software
If you are using AWS EC2 or equivalent, chances are you have bare bones base image and you are using SSH to log into your Linux. Let’s start by updating the base image. Depending on the number of packages, your machine configuration and internet speed – this might take a few minutes. The last command would install gcc, flex, autoconf, etc. which we would need to install fastparquet
using pip, esp. if you you on RHEL 7.x.
$ sudo yum clean all
$ sudo yum -y update
$ sudo yum groupinstall "Development tools"
$ sudo yum install gcc
$ sudo yum install python3-devel
Set Password
If you do not know your current username, you can find out by executing whoami
command. Now to set password:
$ sudo passwd <user_name>
To set root password, you can use the following command:
$ sudo su
# passwd
Install GUI and set it as default
If you are not planning to use GUI on this machine, you can skip this step. It would save you time, internet data and storage space.
// download gui components
$ sudo yum groupinstall -y "Server with GUI"
// set gui as default
$ sudo systemctl set-default graphical.target
$ sudo systemctl default
Set up VNC and Remote Desktop Protocol
In this step we install VNC server and Remote Desktop Protocol. If you would rather stay with SSH, skip to the next step.
$ sudo yum install -y xrdp tigervnc-server
Now, we can set up SELINUX security as below.
chcon --type=bin_t /usr/sbin/xrdp
chcon --type=bin_t /usr/sbin/xrdp-sesman
We are now ready to start, enable and check the status of xrdp.service
$ sudo systemctl start xrdp.service
$ sudo systemctl status xrdp.service
$ sudo systemctl enable xrdp.service
The default port for using the Remote Desktop Protocol is 3389. This port should be open thru firewall to make it accessible by RDP. To ensure we actually can login, we have to do two things:
- Check the port using
ss
ornetstat
- Add firewall rule to “open” the port

If you want to make this port accessible through the internet, not just your local network then, you want to enable port forwarding. It is uncommon and outside the context of this post. When doing so, please use extra caution.
$ netstat -antp
To add this port permanently to firewall, we want to check if firewall service is running. If it isn’t, we want to kick-start it.
$ sudo systemctl status firewalld.service
// if not running
$ sudo systemctl enable firewalld.service
$ sudo systemctl start firewalld.service
Add port and then reboot your machine. When the machine boots up again, you should be able to use Remote Desktop.
$ sudo firewall-cmd --permanent --add-port=3389/tcp
$ sudo firewall-cmd --reload
// reboot
$ sudo reboot
Tip: If your remote desktop connection terminates as soon as you login …
If you are not facing any issues please feel free to quickly skim through this paragraph.
Subscribe to get access
Read more of this content when you subscribe today.
$ cat ~/.xsession-errors
In my case, on RHEL 7 when using Anaconda, removing/commenting out the Anaconda block from ~/.bashrc solved the issue. I removed Anaconda and switched to pip based virtual environment since I needed remote desktop.
Version Compatibility Warning

A lot of questions end up on the forums with the root cause of version mismatch. Please do the due diligence to make sure Java, Spark, Hadoop and Python versions are compatible with each other. For example if you are using Spark 2.4.4, then install OpenJDK 1.8.0 instead 11 and Python 3.6.8 instead newer version.
Install Java
You may want to install either of the following. If you need OpenJDK for different platforms download no-cost long-term supported distribution of JDK.
// OpenJDK 11
$ sudo dnf install java-11-openjdk-devel
// OpenJDK 1.8.0 - Option 1
$ sudo yum install java-1.8.0-openjdk-devel.x86_64
// OpenJDK 1.8.0 - Option 2
$ sudo dnf install java-1.8.0-openjdk-devel
Spark
Download Spark
$ cd /opt
$ sudo wget https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
$ sudo tar -xf spark-2.4.4-bin-hadoop2.7.tgz
$ sudo ln -s /opt/spark-2.4.4-bin-hadoop2.7 /opt/spark
Configure Spark Master and Slave services
$ sudo su
# useradd spark
# chown -R spark:spark /opt/spark*
Create a file using $ sudo touch /etc/systemd/system/spark-master.service
with the following content.
[Unit]
Description=Apache Spark Master
After=network.target
[Service]
Type=forking
User=spark
Group=spark
ExecStart=/opt/spark/sbin/start-master.sh
ExecStop=/opt/spark/sbin/stop-master.sh
[Install]
WantedBy=multi-user.target
To configure Spark slave service create a file using $ sudo touch /etc/systemd/system/spark-slave.service
and add the following content. Replace IP-ADDRESS-or-hostname
with IP address or the hostname of your machine.
[Unit]
Description=Apache Spark Slave
After=network.target
[Service]
Type=forking
User=spark
Group=spark
ExecStart=/opt/spark/sbin/start-slave.sh spark://IP-ADDRESS-or-hostname:7077
ExecStop=/opt/spark/sbin/stop-slave.sh
[Install]
WantedBy=multi-user.target
Load newly created Spark service files
$ sudo systemctl daemon-reload
Start Spark Service
// master service
$ sudo systemctl start spark-master.service
$ sudo systemctl status spark-master.service
// slave service
$ sudo systemctl start spark-slave.service
$ sudo systemctl status spark-slave.service
Hadoop
$ cd /opt
$ sudo wget https://archive.apache.org/dist/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz
$ sudo tar -xf hadoop-2.10.1.tar.gz
$ sudo ln -s /opt/hadoop-2.10.1 /opt/hadoop
Python
Spark comes with Python and most Linux distributions come with Python. So, why would you need Anaconda or pip virtual environment? Well you don’t, until you do. 😀
Option 1) Anaconda or Miniconda
Miniconda is a minimal installer for conda. Like Anaconda, Miniconda is a software package that includes the conda package manager and Python and its dependencies, but does not include any other packages. Once conda is installed by installing either Anaconda or Miniconda, other software packages may be installed directly from the command line with conda install
. Let’s first download Anaconda. I am using a version which is right for me, but take a moment to check the anaconda website and the pages linked below, if you need to better understand the commands I have used below.
- Installing on Linux — Anaconda documentation
- Hashes for all files — Anaconda documentation
- Managing environments — Anaconda documentation
Download and install conda
$ cd /opt
$ sudo wget https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh
To ensure the integrity of the file we downloaded, we should verify the SHA256 checksum with the hash Anaconda has published.
$ sha256sum /opt/Anaconda3-2020.11-Linux-x86_64.sh
If the hash matches we can start installation and follow the on screen instructions.
$ sh /opt/Anaconda3-2020.11-Linux-x86_64.sh
Create or import conda virtual environment
When you are ready, let’s create conda virtual environment. Either you can import the file below or you can create one on your own. I would call this environment dev376, you can call it whatever you like. If you want to quickly import an environment, run create a file by running $ touch ~/dev376.yml
and add the following content to it.
name: dev376
channels:
- defaults
dependencies:
- numpy
- jupyter
- fastparquet
- autopep8
- pandas
- snakeviz
- python=3.7.6
- pyspark
- pylint
prefix: /home/<user_name>/anaconda3/envs/dev376
Now to automatically create an environment with just these packages above, we would execute the following command:
$ cd ~
$ conda create env --no-default-packages --file dev376.yml
Tip: If you are getting SSL related error
In some cases, based on your network settings, at work or home conda might have trouble downloading the metadata and packages from the internet. Temporarily disable SSL check and when you are done, you should probably reset to default value.
$ conda config --set ssl_verify false
Option 2) pip virtual environment
If you skipped the section Update Linux software in the beginning, check if you have gcc installed. pip and wheel might need it and your setup would likely fail without it. For pip, we would first install only the packages we need to set up virtual environment.
$ pip install virtualenv virtualenvwrapper
Now is the time to find the version of Python you want to use. Most probably you have multiple versions of Python available to you. To find this, execute the following command and settle on a path you want to use. FWIW, I am going to use /usr/bin/python3.6
.
$ whereis python
We would now append the following to ~/.bashrc
file:
export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3.6
export WORKON_HOME=/var/home/<user_name>/Envs
export VIRTUALENVWRAPPER_SCRIPT=/var/home/<user_name>/.local/bin/virtualenvwrapper.sh
source /var/home/<user_name>/.local/bin/virtualenvwrapper.sh
After you save and close the file, we would need to load these newly added variables. To do so, run the following command:
$ source ~/.bashrc
We are now ready to create virtual environment dev368.
$ mkvirtualenv -p /usr/bin/python3.6 dev368
$ workon dev368
Be a good citizen and upgrade your pip first.
$ /var/home/<user_name>/Envs/dev368/bin/python -m pip install --upgrade pip
I am providing you ~/requirements.txt file which you can use to bootstrap your virtual python environment or you certainly can create your own as you please.
pip install -r ~/requirements.txt
Finally, if you want to activate this virtual environment every time you start your bash from PuTTY or GNOME Terminal, add the following line to ~/.bashrc
file:
workon dev368
Environment variables for Hadoop, Spark & PySpark
Add a few more variables to your ~/.bashrc
file so that it looks like following:
export HADOOP_HOME=/opt/hadoop
export SPARK_HOME=/opt/spark
export PYSPARK_PYTHON=/usr/bin/python3.6
export PATH=$PATH:$SPARK_HOME/bin:$HADOOP_HOME/bin
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip
export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3.6
export WORKON_HOME=/var/home/<user_name>/Envs
export VIRTUALENVWRAPPER_SCRIPT=/var/home/<user_name>/.local/bin/virtualenvwrapper.sh
source /var/home/<user_name>/.local/bin/virtualenvwrapper.sh
workon dev368
Save and close the file. Take a deep breath and check if Spark shell works:
$ spark-shell
And if that runs without any problem, then get out of it by pressing Ctrl + D and test PySpark:
$ pyspark
Make sure that spark master and slave services are running, start them if they are not and in your browser navigate to Spark UI!
Just realized it is May 4, 2021. Awesome!
May the 4th be with you!
How did you generate the table of content (ToC), bro?
LikeLike