You probably have heard about it, wherever there is a talk about big data the name eventually comes up. In layman’s words Apache Spark is a large-scale data processing engine. Apache Spark provides various APIs for services to perform big data processing on it’s engine. PySpark is the Python API, exposing Spark programming model to Python applications. In my previous blog post, I talked about how set it up on Windows in my previous post. This time, we shall do it on Red Hat Enterprise Linux 8 or 7. You can follow along with free AWS EC2 instance, your hypervisor (VirtualBox, VMWare, Hyper-V, etc.) or a container on almost any Linux distribution. Commands we discuss below might slightly change from one distribution to the next.
Like most of my blog posts, my objective is to write a comprehensive post on real world end to end configuration, rather than talking about just one step. On Red Hat 7, I ran into a problem. I solved this problem without having to solve it.
Linux (I am using Red Hat Enterprise Linux 8 and 7)
Anaconda or pip based virtual python environment
I have broken out the process into steps. Please feel free to skip a section as you deem appropriate.
Today, I want to share a utility program I built. This little program, takes subtitle files, and spits out a nice crisp paragraph. I tested this program with a directory containing subtitles of 2 movies, and on the 8th second, I was looking at their transcripts. To test the file I/O operation, I took transcripts for a graduate level computer science course. It was organized in sections and sub-topics, totaling about 200 small clips. Within 3 seconds, I had my class notes, which I could use to share with a class, highlight important things being said in the lectures without typing a word!
One more thing, it is free and open-source, you can literally clone it and start using it right away.
Few days back I started working on a utility project called Team Foundation Dev Tools ( http://ablaze8.github.io/TeamFoundationDevTools/ ). The goal is to extend the TFS api and serve some unmatched features like searching entire TFS server for some file and/or file path ( wild card and exact search ), commits by a specific user to any and/or project among all projects of a TFS server and such.
I’m not planning anything serious with it, just trying to build something I always wanted to see in an ideal TFS tool.
This is an open source tool and also supports .NET 3.5 & 4.5.2 so it’s compatible all the way back to Visual Studio 2008 running on Windows XP … up until latest and greatest !
The main reason for preparing such a handy book is to give, to the point and specific guide to the reader. I’ve searched a lot for such a manual when I was creating my own live operating system: “Linux Spark” but every time I read from any resource I found either something missing or something too technical. This book will help you, whether you are an experienced guy, Administrator, newbie or just a student finding some quick reference manual with its lucid and simple language. This book is ready for you to go to the terminal and just start commanding your UNIX/Linux system !
Your suggestions, comments and guidance are gladly accepted. It’s my strong wish that you share this book with more and more people who are really interested, because I firmly believe that the knowledge is free and to get some definitive guide exactly when you need … Should feel like magic!