Install PySpark on Linux


You probably have heard about it, wherever there is a talk about big data the name eventually comes up. In layman’s words Apache Spark is a large-scale data processing engine. Apache Spark provides various APIs for services to perform big data processing on it’s engine. PySpark is the Python API, exposing Spark programming model to Python applications. In my previous blog post, I talked about how set it up on Windows in my previous post. This time, we shall do it on Red Hat Enterprise Linux 8 or 7. You can follow along with free AWS EC2 instance, your hypervisor (VirtualBox, VMWare, Hyper-V, etc.) or a container on almost any Linux distribution. Commands we discuss below might slightly change from one distribution to the next.

Like most of my blog posts, my objective is to write a comprehensive post on real world end to end configuration, rather than talking about just one step. On Red Hat 7, I ran into a problem. I solved this problem without having to solve it.

The Author

Environment

  • Linux (I am using Red Hat Enterprise Linux 8 and 7)
  • Java
  • Hadoop
  • Spark
  • Anaconda or pip based virtual python environment

I have broken out the process into steps. Please feel free to skip a section as you deem appropriate.

Table of Contents

Continue reading “Install PySpark on Linux”

Migrate Perforce to git(hub) repo


Assumptions

  • You have done your due diligence to compare the two (from price, speed, workflow, onboarding etc. perspectives) and have concluded that reasons to migrate Perforce to git are beneficial for you.
  • Word of caution. Do you have “large files”? Well, define “large”! Some cloud vendors set some limit for what is the maximum limit of your file could you upload to the cloud. Check out this Powershell script: Powershell.P4Sizes · GitHub

Migration

To prepare for surprises, plan ahead on how would you verify if the migration has succeeded!

Author
Continue reading “Migrate Perforce to git(hub) repo”

Up and running with PySpark on Windows


If you, too, are coming from an R (or Python/Pandas) environment like me, you would feel highly comfortable processing CSV files with R or Python/Pandas. Whether you are using time series sensory data, random CSV files, or something else, R and Pandas can take it! If you can step away from R and Python/Pandas mindset, Spark really goes to a great length to make me feel welcome as an R and Python Pandas user.

These last days I have been working extremely closely with AWS EMR. I am not talking about creating a couple of trivial notebooks with a 5×5 data frame containing fruit names. The data set I am working with is 10s of gigs stored away in the cloud. The data is far from clean. I need to create an ETL pipeline to retrieve historical information. Which I would use to train my machine learning models. The predictive analysis on the new incoming data with machine learning – how am I doing it is probably a post (or series of posts) for a later date, probably. Today, I want to get you up and running with PySpark in no time!

Why am I writing this post?

There already is a plethora of blogs after blogs, and forums after forums on Spark and PySpark on the internet about how to install PySpark on Windows. These are mainly focused on setting up just PySpark. But what if I want to use Anaconda or Jupyter Notebooks or do not wish to use Oracle JDK? This post picks up where most other content lack. In this post, I want to help you connect the dots and save a lot of time, agony, and frustration. Regardless, you are new to Windows, Spark/PySpark, or development in general.

This process is as easy as ABC!

Benefit

The main benefit of following the approach I suggest in my blog post is, that you do not have to install anything (for the most part) and you can switch Spark, Hadoop, Java versions in seconds!

Let’s get started!

Continue reading “Up and running with PySpark on Windows”

UI Path Task Capture in 1 minute


What is it?

  1. Screenshot and documentation tool on Steroids.
  2. It can take a series of screenshots and generate documentation in Microsoft Word template.
  3. Just like Windows’ built-in Snip and Sketch or Snipping tool but, with Microsoft Paint & OCR capabilities.
  4. Installs locally, just like Snappy.
  5. This tool can spit out a skeleton for RPA developers.
  6. This is to RPA developer, what Gherkin/Cucumber is to a C#/Java developer.

Benefit hypothesis

  1. Target audience:

Windows: configure VS Code integrated bash shell for Anaconda


So you’re / you’ve-been using Python in Windows. You know your way around setting up PATH variable so that you type “python” in your command prompt and it works. Now, say that you want to use Anaconda Python in bash. Let’s go one step further and say, you want to use the bash from your Visual Studio Code integrated shell. The process isn’t too different. There doesn’t seem to exist a guide, which covers all these together – hence this post.

My goal is to show you one of the possible ways to configure your development environment quickly – to you get you going in no time.

At the end you should have the following:

  • Bash shell working with python and,
  • Visual studio shell integration (optional)

Continue reading “Windows: configure VS Code integrated bash shell for Anaconda”

Create/Update dictionary form list


There is a straight forward way to update an existing or empty directory from given a list of keys. In the first example below, we update dict only with keys, which were not already present. Notice that the key ‘a’ did get change and ‘z’ did not get deleted – they were left alone. The second example, basically initializes an empty dict object. Whereas, the third example creates a new dict object which did not exist before.

Continue reading “Create/Update dictionary form list”

Restarting ALSA Audio


Follow these steps:

sudo /etc/init.d/alsa-utils stop
sudo alsa force-reload
sudo /etc/init.d/alsa-utils start

When I was running openSUSE  11.1 in previous decade, sometimes the ALSA sound diver throws an error while playing some video with VLC media player. The solution was, just to restart the ALSA sound driver by running the following command as super-user:

/etc/init.d/alsasound restart

Virtual Box boot from USB


You may want to do this for a number of reasons, you may have a bootable USB thumb-drive / USB flash drive / USB stick (whatever you call it) containing Live CD, installation image etc. before you actually use it on your computer, or may be you don’t want to use that bootable USB on your computer, whatever that case might be.

Linux

Following are the 3 different methods you could use.

Method # 1: Create a pointer to your USB

I am using Ubuntu 18.04 LTS, but it could be any Linux OS/distro/flavor. If you have a bootable USB that you want to boot your VM from, go ahead and insert it.

First you need to find the logical device for your removable USB flash drive. One way to do it is to use lshw command (ls for hardware, get it?) It is recommended that you run this command as a super-user (sudo) otherwise “your output may be incomplete or inaccurate, you should run this program as super-user” warning would be displayed, which makes sense. If you need more information on lshw, including installation and basic usage, see this project website or this article.

Here is the raw command which shots of how KDE used to look like back in the day, in openSUSE 11.1 – this was the first ever Linux distro which got me hooked with Linux. To put things in perspective, openSUSE’s current version is 15.0 😉 will output EVERYTHING:

# "sudo lshw" shows everything
$ sudo lshw -class volume -disable TEST -notime

And look for the entry associated with your hard drive’s label. Alternatively, following commands much more concise if you know what you are looking for:

$ sudo lshw -businfo -disable TEST | grep volume

In my case, from the first command above, it was /dev/sdb1.

Next, Continue reading “Virtual Box boot from USB”