You probably have heard about it, wherever there is a talk about big data the name eventually comes up. In layman’s words Apache Spark is a large-scale data processing engine. Apache Spark provides various APIs for services to perform big data processing on it’s engine. PySpark is the Python API, exposing Spark programming model to Python applications. In my previous blog post, I talked about how set it up on Windows in my previous post. This time, we shall do it on Red Hat Enterprise Linux 8 or 7. You can follow along with free AWS EC2 instance, your hypervisor (VirtualBox, VMWare, Hyper-V, etc.) or a container on almost any Linux distribution. Commands we discuss below might slightly change from one distribution to the next.
Like most of my blog posts, my objective is to write a comprehensive post on real world end to end configuration, rather than talking about just one step. On Red Hat 7, I ran into a problem. I solved this problem without having to solve it.The Author
- Linux (I am using Red Hat Enterprise Linux 8 and 7)
- Anaconda or pip based virtual python environment
I have broken out the process into steps. Please feel free to skip a section as you deem appropriate.
Table of Contents
- EPEL Repository
- Update Linux software
- Set Password
- Install GUI and set it as default
- Set up VNC and Remote Desktop Protocol
- Tip: If your remote desktop connection terminates as soon as you login
- Version Compatibility Warning
- Install Java
- Environment variables for HadoopSpark & PySpark
Extra Packages for Enterprise Linux, EPEL for short. It includes but isn’t limited to Red Hat Enterprise Linux (RHEL), CentOS, Scientific Linux (SL), Oracle Linux (OL), etc. In other words, EPEL is open source and free community supported project by Fedora and thrives to be a reliable source for up to date packages. Needless to say, it provides yum and dnf packages. If you want to know more, please check out their official wiki page and blog post from Red Hat. We would download the rpm using wget and, install it.
How to determine RHEL version?
// option 1 $ cat /etc/redhat-release // option 2 $ cat /etc/os-release // option 3 (RHEL 7.x or newer) $ hostnamectl
Depending on the version you use, please see the wiki page to properly configure it.
// RHEL/CentOS 8 64 bit $ wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm // RHEL/CentOS 7 64 bit $ wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm // RHEL/CentOS 6 64 bit $ wget http://download.fedoraproject.org/pub/epel/6/x86_64/epel-release-6-8.noarch.rpm // RHEL/CentOS 6 32 bit $ wget http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm
For example, you could install the rpm package after downloading or could also try doing it all in a single step.
// install the downloaded rpm file above $ sudo rpm -ivh epel-release-6-8.noarch.rpm // single step alternative $ yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
To verify, let’s quickly run repolist command which would list currently active repositories.
$ yum repolist | grep epel
If you need it, EPEL configuration file should be located at
/etc/yum.repos.d/epel.repo. If you have followed the steps you should have yum config manager installed on your machine at this point. Should you chose you can use the following commands to disable or enable EPEL, respectively.
$ sudo yum-config-manager --disable epel $ yum-config-manager --enable epel
Update Linux software
If you are using AWS EC2 or equivalent, chances are you have bare bones base image and you are using SSH to log into your Linux. Let’s start by updating the base image. Depending on the number of packages, your machine configuration and internet speed – this might take a few minutes. The last command would install gcc, flex, autoconf, etc. which we would need to install
fastparquet using pip, esp. if you you on RHEL 7.x.
$ sudo yum clean all $ sudo yum -y update $ sudo yum groupinstall "Development tools" $ sudo yum install gcc $ sudo yum install python3-devel
If you do not know your current username, you can find out by executing
whoami command. Now to set password:
$ sudo passwd <user_name>
To set root password, you can use the following command:
$ sudo su # passwd
Install GUI and set it as default
If you are not planning to use GUI on this machine, you can skip this step. It would save you time, internet data and storage space.
// download gui components $ sudo yum groupinstall -y "Server with GUI" // set gui as default $ sudo systemctl set-default graphical.target $ sudo systemctl default
Set up VNC and Remote Desktop Protocol
In this step we install VNC server and Remote Desktop Protocol. If you would rather stay with SSH, skip to the next step.
$ sudo yum install -y xrdp tigervnc-server
Now, we can set up SELINUX security as below.
chcon --type=bin_t /usr/sbin/xrdp chcon --type=bin_t /usr/sbin/xrdp-sesman
We are now ready to start, enable and check the status of xrdp.service
$ sudo systemctl start xrdp.service $ sudo systemctl status xrdp.service $ sudo systemctl enable xrdp.service
The default port for using the Remote Desktop Protocol is 3389. This port should be open thru firewall to make it accessible by RDP. To ensure we actually can login, we have to do two things:
- Check the port using
- Add firewall rule to “open” the port
If you want to make this port accessible through the internet, not just your local network then, you want to enable port forwarding. It is uncommon and outside the context of this post. When doing so, please use extra caution.
$ netstat -antp
To add this port permanently to firewall, we want to check if firewall service is running. If it isn’t, we want to kick-start it.
$ sudo systemctl status firewalld.service // if not running $ sudo systemctl enable firewalld.service $ sudo systemctl start firewalld.service
Add port and then reboot your machine. When the machine boots up again, you should be able to use Remote Desktop.
$ sudo firewall-cmd --permanent --add-port=3389/tcp $ sudo firewall-cmd --reload // reboot $ sudo reboot
Tip: If your remote desktop connection terminates as soon as you login …
If you are not facing any issues please feel free to quickly skim through this paragraph.
Subscribe to get access
Read more of this content when you subscribe today.
$ cat ~/.xsession-errors
In my case, on RHEL 7 when using Anaconda, removing/commenting out the Anaconda block from ~/.bashrc solved the issue. I removed Anaconda and switched to pip based virtual environment since I needed remote desktop.
Version Compatibility Warning
A lot of questions end up on the forums with the root cause of version mismatch. Please do the due diligence to make sure Java, Spark, Hadoop and Python versions are compatible with each other. For example if you are using Spark 2.4.4, then install OpenJDK 1.8.0 instead 11 and Python 3.6.8 instead newer version.
You may want to install either of the following. If you need OpenJDK for different platforms download no-cost long-term supported distribution of JDK.
// OpenJDK 11 $ sudo dnf install java-11-openjdk-devel // OpenJDK 1.8.0 - Option 1 $ sudo yum install java-1.8.0-openjdk-devel.x86_64 // OpenJDK 1.8.0 - Option 2 $ sudo dnf install java-1.8.0-openjdk-devel
$ cd /opt $ sudo wget https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz $ sudo tar -xf spark-2.4.4-bin-hadoop2.7.tgz $ sudo ln -s /opt/spark-2.4.4-bin-hadoop2.7 /opt/spark
Configure Spark Master and Slave services
$ sudo su # useradd spark # chown -R spark:spark /opt/spark*
Create a file using
$ sudo touch /etc/systemd/system/spark-master.service with the following content.
[Unit] Description=Apache Spark Master After=network.target [Service] Type=forking User=spark Group=spark ExecStart=/opt/spark/sbin/start-master.sh ExecStop=/opt/spark/sbin/stop-master.sh [Install] WantedBy=multi-user.target
To configure Spark slave service create a file using
$ sudo touch /etc/systemd/system/spark-slave.service and add the following content. Replace
IP-ADDRESS-or-hostname with IP address or the hostname of your machine.
[Unit] Description=Apache Spark Slave After=network.target [Service] Type=forking User=spark Group=spark ExecStart=/opt/spark/sbin/start-slave.sh spark://IP-ADDRESS-or-hostname:7077 ExecStop=/opt/spark/sbin/stop-slave.sh [Install] WantedBy=multi-user.target
Load newly created Spark service files
$ sudo systemctl daemon-reload
Start Spark Service
// master service $ sudo systemctl start spark-master.service $ sudo systemctl status spark-master.service // slave service $ sudo systemctl start spark-slave.service $ sudo systemctl status spark-slave.service
$ cd /opt $ sudo wget https://archive.apache.org/dist/hadoop/common/hadoop-2.10.1/hadoop-2.10.1.tar.gz $ sudo tar -xf hadoop-2.10.1.tar.gz $ sudo ln -s /opt/hadoop-2.10.1 /opt/hadoop
Spark comes with Python and most Linux distributions come with Python. So, why would you need Anaconda or pip virtual environment? Well you don’t, until you do. 😀
Option 1) Anaconda or Miniconda
Miniconda is a minimal installer for conda. Like Anaconda, Miniconda is a software package that includes the conda package manager and Python and its dependencies, but does not include any other packages. Once conda is installed by installing either Anaconda or Miniconda, other software packages may be installed directly from the command line with
conda install. Let’s first download Anaconda. I am using a version which is right for me, but take a moment to check the anaconda website and the pages linked below, if you need to better understand the commands I have used below.
- Installing on Linux — Anaconda documentation
- Hashes for all files — Anaconda documentation
- Managing environments — Anaconda documentation
Download and install conda
$ cd /opt $ sudo wget https://repo.anaconda.com/archive/Anaconda3-2020.11-Linux-x86_64.sh
To ensure the integrity of the file we downloaded, we should verify the SHA256 checksum with the hash Anaconda has published.
$ sha256sum /opt/Anaconda3-2020.11-Linux-x86_64.sh
If the hash matches we can start installation and follow the on screen instructions.
$ sh /opt/Anaconda3-2020.11-Linux-x86_64.sh
Create or import conda virtual environment
When you are ready, let’s create conda virtual environment. Either you can import the file below or you can create one on your own. I would call this environment dev376, you can call it whatever you like. If you want to quickly import an environment, run create a file by running
$ touch ~/dev376.yml and add the following content to it.
name: dev376 channels: - defaults dependencies: - numpy - jupyter - fastparquet - autopep8 - pandas - snakeviz - python=3.7.6 - pyspark - pylint prefix: /home/<user_name>/anaconda3/envs/dev376
Now to automatically create an environment with just these packages above, we would execute the following command:
$ cd ~ $ conda create env --no-default-packages --file dev376.yml
Tip: If you are getting SSL related error
In some cases, based on your network settings, at work or home conda might have trouble downloading the metadata and packages from the internet. Temporarily disable SSL check and when you are done, you should probably reset to default value.
$ conda config --set ssl_verify false
Option 2) pip virtual environment
If you skipped the section Update Linux software in the beginning, check if you have gcc installed. pip and wheel might need it and your setup would likely fail without it. For pip, we would first install only the packages we need to set up virtual environment.
$ pip install virtualenv virtualenvwrapper
Now is the time to find the version of Python you want to use. Most probably you have multiple versions of Python available to you. To find this, execute the following command and settle on a path you want to use. FWIW, I am going to use
$ whereis python
We would now append the following to
export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3.6 export WORKON_HOME=/var/home/<user_name>/Envs export VIRTUALENVWRAPPER_SCRIPT=/var/home/<user_name>/.local/bin/virtualenvwrapper.sh source /var/home/<user_name>/.local/bin/virtualenvwrapper.sh
After you save and close the file, we would need to load these newly added variables. To do so, run the following command:
$ source ~/.bashrc
We are now ready to create virtual environment dev368.
$ mkvirtualenv -p /usr/bin/python3.6 dev368 $ workon dev368
Be a good citizen and upgrade your pip first.
$ /var/home/<user_name>/Envs/dev368/bin/python -m pip install --upgrade pip
I am providing you ~/requirements.txt file which you can use to bootstrap your virtual python environment or you certainly can create your own as you please.
pip install -r ~/requirements.txt
Finally, if you want to activate this virtual environment every time you start your bash from PuTTY or GNOME Terminal, add the following line to
Environment variables for Hadoop, Spark & PySpark
Add a few more variables to your
~/.bashrc file so that it looks like following:
export HADOOP_HOME=/opt/hadoop export SPARK_HOME=/opt/spark export PYSPARK_PYTHON=/usr/bin/python3.6 export PATH=$PATH:$SPARK_HOME/bin:$HADOOP_HOME/bin export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.7-src.zip export VIRTUALENVWRAPPER_PYTHON=/usr/bin/python3.6 export WORKON_HOME=/var/home/<user_name>/Envs export VIRTUALENVWRAPPER_SCRIPT=/var/home/<user_name>/.local/bin/virtualenvwrapper.sh source /var/home/<user_name>/.local/bin/virtualenvwrapper.sh workon dev368
Save and close the file. Take a deep breath and check if Spark shell works:
And if that runs without any problem, then get out of it by pressing Ctrl + D and test PySpark:
Make sure that spark master and slave services are running, start them if they are not and in your browser navigate to Spark UI!
Just realized it is May 4, 2021. Awesome!
May the 4th be with you!