You probably have heard about it, wherever there is a talk about big data the name eventually comes up. In layman’s words Apache Spark is a large-scale data processing engine. Apache Spark provides various APIs for services to perform big data processing on it’s engine. PySpark is the Python API, exposing Spark programming model to Python applications. In my previous blog post, I talked about how set it up on Windows in my previous post. This time, we shall do it on Red Hat Enterprise Linux 8 or 7. You can follow along with free AWS EC2 instance, your hypervisor (VirtualBox, VMWare, Hyper-V, etc.) or a container on almost any Linux distribution. Commands we discuss below might slightly change from one distribution to the next.
Like most of my blog posts, my objective is to write a comprehensive post on real world end to end configuration, rather than talking about just one step. On Red Hat 7, I ran into a problem. I solved this problem without having to solve it.
The Author
Environment
- Linux (I am using Red Hat Enterprise Linux 8 and 7)
- Java
- Hadoop
- Spark
- Anaconda or pip based virtual python environment
I have broken out the process into steps. Please feel free to skip a section as you deem appropriate.