You probably have heard about it, wherever there is a talk about big data the name eventually comes up. In layman’s words Apache Spark is a large-scale data processing engine. Apache Spark provides various APIs for services to perform big data processing on it’s engine. PySpark is the Python API, exposing Spark programming model to Python applications. In my previous blog post, I talked about how set it up on Windows in my previous post. This time, we shall do it on Red Hat Enterprise Linux 8 or 7. You can follow along with free AWS EC2 instance, your hypervisor (VirtualBox, VMWare, Hyper-V, etc.) or a container on almost any Linux distribution. Commands we discuss below might slightly change from one distribution to the next.
Like most of my blog posts, my objective is to write a comprehensive post on real world end to end configuration, rather than talking about just one step. On Red Hat 7, I ran into a problem. I solved this problem without having to solve it.
Linux (I am using Red Hat Enterprise Linux 8 and 7)
Anaconda or pip based virtual python environment
I have broken out the process into steps. Please feel free to skip a section as you deem appropriate.
If you, too, are coming from an R (or Python/Pandas) environment like me, you would feel highly comfortable processing CSV files with R or Python/Pandas. Whether you are using time series sensory data, random CSV files, or something else, R and Pandas can take it! If you can step away from R and Python/Pandas mindset, Spark really goes to a great length to make me feel welcome as an R and Python Pandas user.
These last days I have been working extremely closely with AWS EMR. I am not talking about creating a couple of trivial notebooks with a 5×5 data frame containing fruit names. The data set I am working with is 10s of gigs stored away in the cloud. The data is far from clean. I need to create an ETL pipeline to retrieve historical information. Which I would use to train my machine learning models. The predictive analysis on the new incoming data with machine learning – how am I doing it is probably a post (or series of posts) for a later date, probably. Today, I want to get you up and running with PySpark in no time!
Why am I writing this post?
There already is a plethora of blogs after blogs, and forums after forums on Spark and PySpark on the internet about how to install PySpark on Windows. These are mainly focused on setting up just PySpark. But what if I want to use Anaconda or Jupyter Notebooks or do not wish to use Oracle JDK? This post picks up where most other content lack. In this post, I want to help you connect the dots and save a lot of time, agony, and frustration. Regardless, you are new to Windows, Spark/PySpark, or development in general.
This process is as easy as ABC!
The main benefit of following the approach I suggest in my blog post is, that you do not have to install anything (for the most part) and you can switch Spark, Hadoop, Java versions in seconds!
So you’re / you’ve-been using Python in Windows. You know your way around setting up PATH variable so that you type “python” in your command prompt and it works. Now, say that you want to use Anaconda Python in bash. Let’s go one step further and say, you want to use the bash from your Visual Studio Code integrated shell. The process isn’t too different. There doesn’t seem to exist a guide, which covers all these together – hence this post.
My goal is to show you one of the possible ways to configure your development environment quickly – to you get you going in no time.
There is a straight forward way to update an existing or empty directory from given a list of keys. In the first example below, we update dict only with keys, which were not already present. Notice that the key ‘a’ did get change and ‘z’ did not get deleted – they were left alone. The second example, basically initializes an empty dict object. Whereas, the third example creates a new dict object which did not exist before.
This post tries to answer whether given numbers are comparatively close to each other. This can help if you are using Python for data science or in the area of computer vision doing computing with images. A quick stack-overflow search shows discussion around finding “nearest” value from a set of given values for any given number . However, there could be a need to limit how much further the nearest number could be. I’d call this limit a “threshold”.
Quite a few times, in my Python (esp. in computer vision related) programming I come across scenarios when I want to tell if the two numbers are close to each other. Some might ask, “Well define close!?” or “How close?”, well … comparatively close.