I’ve used Hadoop for several years now. One of the most frustrating parts of using Hadoop is the time it takes to start-up the Java HDFS client to run simple tasks.  Even listing a directory can take several seconds because of the startup cost associated with launching the JVM.
In 2013, Spotify open sourced a pure Python implementation of the HFDS client called snakebite. You can find it on Spotify’s github. Exciting times, for sure.  What was a series of slow, rickety (and error-prone) shell scripts wrapped around the hadoop fs
commands could be turned into workable Python. The only downside was that the first implementation of snakebite did not support Kerberos. This was a major downside for those of us using Kerberized Hadoop environments because of security requirements.
The coming of snakebite
A few months ago, Spotify released support for Kerberos. Happy days for Hadoop Operations folks like me having to deal with ghetto scripts. One of the first things I did was to figure out how to create a reliable and reproducible install of snakebite on my Cloudera Hadoop cluster.
I could go the simple route and do the pip install
thing, but that would require installing all the development libraries and tool chains necessary for compiling the Kerberos backend in snakebite. For a one-time install on each system, this added a significant amount of unnecessary software on the cluster.
The second route was to use fpm
and build RPMS that could be used both on my Hadoop cluster and on other systems in my environment. This was pretty simple (and the basis for my recent fpm and docker post). I added the RPMS to my yum repo, built a quick puppet module to load them as needed and …
… ran into an operational failure when attempting to install them on servers that happened to run the Mesosphere Mesos RPMS.
Mesosphere, we have a problem!
Looking at my snakebite RPMS, I found the following:
file /usr/lib/python2.6/site-packages/google/protobuf/text_format.py from install of python-protobuf-3.0.0b2-1.noarch conflicts with file from package mesos-0.23.1-0.2.61.centos65.x86_64 file /usr/lib/python2.6/site-packages/google/protobuf/text_format.pyc from install of python-protobuf-3.0.0b2-1.noarch conflicts with file from package mesos-0.23.1-0.2.61.centos65.x86_64
It turns out that Mesos includes it’s own copy of the Python protobuf library, installing it into the Python site-packages
directory. But, snakebite also wants it’s own version of protobuf installed (from a separate python-protobuf RPM that I built). This presents a significant problem, and not one I want to attempt to resolve with virtualenv
.
Thankfully, I brought this issue up with one of my coworkers who suggested I try using Python pex.
What is pex?
pex is a tool for Python that implements the Python EXecutable environment. The easiest way to think of these is like a Python virtualenv
equivalent of a Java JARÂ or WARÂ file: it’s a compressed copy of everything needed to run a self-contained tool or app. Plus, it can be created in a way that is portable across operating systems. Neat, huh?
Why use pex?
One of the driving ideas behind a pex file is the ability to easily deploy a bundle of code with something as simple as a /bin/cp
. You get an isolated, executable environment without a lot of dependency fuss to go through.
If you still don’t understand, there is a great lightning talk from Twitter called WTF is PEX; it’s about 15 minutes long and breaks down the important parts of pex.
Prepping for pex
In an earlier post, I talked about using docker to create clean build environments using fpm. I re-use that docker image here because of how simple it is to spin up and add my build dependencies to it.
In this case, because the pex
 command is not already installed, I add it into the running docker container, then add in the development libraries necessary for building the snakebite library and its Kerberos dependencies.
$ docker run -ti -v /tmp/fpmbuild:/tmp/fpmbuild fpm-centos6 [root@b508004a7709 fpmbuild]# pip install pex [root@b508004a7709 fpmbuild]# yum install -y python-devel cyrus-sasl-devel krb5-devel
Running pex on snakebite
Here, I’m going to build a pex-ified version of the snakebite library. Â A few things to note:
- IÂ need to build snakebite with Kerberos support.
- To get Kerberos support, you would normally do
pip install snakebite[kerberos]
.- The
[kerberos]
string ends up being a build option, or “extra” for snakebite’ssetup.py
- The
- IÂ want to build a pex-ified version of the
snakebite
command that comes with the snakebite library, so I need a copy of that script in my working dir.
[root@b508004a7709 fpmbuild]# pex -v 'snakebite[kerberos]' -o snakebite.pex -c snakebite pex: Building pex pex: Building pex :: Resolving distributions pex: Building pex :: Resolving distributions :: Packaging snakebite pex: Building pex :: Resolving distributions :: Packaging sasl pex: Building pex :: Resolving distributions :: Packaging python-krbV argparse 1.4.0 snakebite 2.7.2 six 1.10.0 sasl 0.1.3 python-krbV 1.0.90 protobuf 3.0.0b2 setuptools 19.2 pex: Building pex: 124009.6ms pex: Resolving distributions: 123946.4ms pex: Packaging snakebite: 16495.9ms pex: Packaging sasl: 17479.8ms pex: Packaging python-krbV: 17198.1ms Saving PEX file to snakebite.pex [root@b508004a7709 fpmbuild]#
The options and arguments passed to pex
do the following:
- turn on verbosity: Â
-v
- specify which module to start loading:
snakebite[kerberos]
- the name of the pex output file:
-o snakebite.pex
- the name of the script to use as the default entry point for the pex file:
-c snakebite
- this is the script that will run when you run
snakebite.pex
- this is called
snakebite.py
in the working directory, but.py
is not required in the command argument.
- this is the script that will run when you run
Testing out the pex-ified snakebite
Once the snakebite.pex
is built, I tested to see what it was doing. I expected it to do the same thing as the snakebite
command that comes in the library. That sets up a Python HDFS client and lets you do file operations. You can see that I did a directory listing below.
[hcoyote@hadoopclient ~]$ export HADOOP_CONF_DIR=/etc/hadoop-confs/cdh5/ [hcoyote@hadoopclient ~]$ ./snakebite.pex ls Found 4 items drwx------ - hcoyote users 0 2016-01-12 18:34 /user/hcoyote/.Trash drwx------ - hcoyote users 0 2016-01-13 01:58 /user/hcoyote/.staging -rw-r--r-- 3 hcoyote users 598758188 2014-04-25 15:03 /user/hcoyote/2014-04-25-fetchImage-fsimage.tsv -rw-r--r-- 3 hcoyote users 5184637 2015-02-20 12:17 /user/hcoyote/SecurityAuth-hdfs.audit.gz
Next, I want to make sure that the snakebite.pex
command isn’t lying to me, so I try doing the same operation using the Java HDFS client.
[hcoyote@hadoopclient ~]$ export HDP_DIR=/home/hcoyote/hadoop-2.6.0-cdh5.4.9 [hcoyote@hadoopclient ~]$ export HADOOP_CONF_DIR=/etc/hadoop-confs/cdh5 [hcoyote@hadoopclient ~]$ export CDH_MR2_HOME=${HDP_DIR} [hcoyote@hadoopclient ~]$ export JAVA_HOME=/usr/java/jdk_x64 [hcoyote@hadoopclient ~]$ hadoop-2.6.0-cdh5.4.9/bin/hadoop fs -ls Found 4 items drwx------ - hcoyote users 0 2016-01-12 18:34 .Trash drwx------ - hcoyote users 0 2016-01-13 02:05 .staging -rw-r--r-- 3 hcoyote users 598758188 2014-04-25 15:03 2014-04-25-fetchImage-fsimage.tsv -rw-r--r-- 3 hcoyote users 5184637 2015-02-20 12:17 SecurityAuth-hdfs.audit.gz
Achievement Unlocked!
Some quick testing shows that the Java client consistently takes 3-4 seconds to return the directory listing, but the pex-ified Python client is an order of magnitude smaller. That’s a win for both me and my users, for doing simple file system operations. The bigger win is that I now have a mechanism for creating portable tools that use the snakebite library on systems that may also have other conflicting dependencies and I don’t have to mess with building out Python virtual environments to get this working.