One of the biggest challenges I have running Hadoop clusters is constantly validating that the health and well-being of the cluster meets my standards for operation.  Hadoop, like any large software ecosystem, is composed of many layers of technologies, starting from the physical machine, up into the operating system kernel, the distributed filesystem layer, the compute layer, and beyond. As you add in more and more complexity to the cluster, you find it becomes more and more necessary to verify the ongoing state of the system.  Verifying Hadoop health must be checked at all layers to ensure your sanity stays intact.  Otherwise, you will find yourself putting out small fire after small fire, wondering when the next cinder will burn your cluster down.
A healthy node is a happy node
Every node is sacred.  Every node is great.  If a node lay wasted, Hadoop gets quite irate.  Ok, not really.  Within a Hadoop cluster of any reasonable size, you don’t need to fixate on a single node.  Hadoop can tolerate nodes coming in and out.  But, it’s a waste of resources to not have as many nodes up and running as optimally as possible.
Now, this isn’t specifically geared towards monitoring, but you could use the same methodology (and even the same tools) to do your monitoring.
The first thing we’re doing here is defining a set of cases to test against that you declare as healthy and then running those tests to validate your expected outcome.  This is very similar to what you would find in test-driven development (TDD) practices.  More specifically, this is called behavior driven development (BDD), an outgrowth of TDD. In my case, the automation code is our development base.  I use Serverspec as the ruleset for defining tests and what the valid output should be.
Behavior driven development?
I won’t go into great detail about behavior driven development (BDD); there are many good articles about this. Â BDD has a few principles behind it.
- When you define a test, it should be in terms of the expected behavior of the software.
- Write behaviors as a simple narrative. This is known as Given-When-Then.
- For example: Given a Hadoop Datanode, when a Datanode process is present, then it should be listening on port 60020.
- Behaviors define the business logic you’re trying to validate. Â Validate the behavior, validate the code.
Simple? Â Good. Â Now, onto Serverspec
What is Serverspec?
Serverspec is a BDD framework written on top of Ruby RSpec.  It’s primary function is to allow you to write your tests in RSpec and validate the physical behavior of servers.  It is simple to install.
$ sudo gem install serverspec
It is simple to write tests for. For example, you can use it to validate that a particular filesystem is mounted:
describe file('/') do it { should be_mounted } end
or that a kernel parameter is set:
describe 'Linux kernel parameters' do context linux_kernel_parameter('net.ipv4.tcp_syncookies') do its(:value) { should eq 1 } end end
or that even a port is open and in the LISTEN state by a given process.
describe port(80) do it { should be_listening } end
You can use it standalone or with your favorite Ruby-based automation tool, such as Puppet or Chef.
Using Serverspec to check Hadoop node health
So now that you have a very basic understanding of Serverspec, let’s talk about how this helps us validate a node’s health. Ordinarily, you don’t care about how an individual node is working within Hadoop.  You care about the over-all cluster health and performance.  But, over time, if you’re not paying attention to nodes as they drop off, you’ll find that the efficiency of your cluster usage drops with it.  You’ll want to have a standard way of validating that the physical attributes of each node are as you expect.
In my clusters, I want to validate that several things are always correctly set.  While automation tools like Puppet help a lot, we’ve met situations where the runtime experience was definitely not matching the configuration experience.
For this discussion, I will be focusing on compute nodes in the cluster. Â This would be the node that runs the Datanode process, the Tasktracker, the HBase Regionserver, and so on.
- Every node runs the Datanode process and should be configured to start on boot. The process should always be running and should be listening on port 50075.
- Every node runs the Tasktracker process and should be configured to start on boot. Â The process should always be running and should be listening on port 50060.
- Every node runs the Regionserver process and should be configured to start on boot. The process should always be running and should be listening on port 60020.
/etc/hosts
should contain a definition for localhost and the fully qualified domain name (and short name) for this host.- Every node should have twelve separate mounts points for the HDFS file systems. Each filesystem should be
ext4
and have specific mount options, and specific top-level file ownerships. - Each node should have specific kernel configurations or settings for performance tuning or fault recovery:
- the kernel’s
kernel.panic
setting configured for 300 seconds ( to work around a specific type of hardware fault we encounter). vm.swappiness
should be set to zero to prevent swapping from occurring.- Disable the
intel_idle
driver so it the highest CPU C-state is zero. - Disable Transparent Huge Pages
- the kernel’s
- Every node should have a bonded network device using Link Aggregation Control Protocol (LACP). Â It should contain specific physical interfaces. Â Each interface should be at least 1Gbps.
- Every node should run a Link Layer Discovery Protocol daemon (LLDP) so we have an easier time determining which switch port(s) the node is connected to.
Tying Serverspec into Monitoring
All of these behaviors define the minimum required set of features that the system should be running with before we want to release it into the cluster.  Puppet will handle setting most (if not all) of these things.  With the combination of these two, I now have an easily expandable set of smoke tests for adding new systems to a cluster.  I also have the basis of a new monitoring framework to tack on to my monitoring system.  Both activities overlap in requirements, but the use of Serverspec allows me to define the behaviors only once.
There are two ways that I would tie this into our monitoring tools and Hadoop.  The first way is to write a wrapper around Serverspec so that we alert on individual node health failures.  But that goes against what we really care about in Hadoop:  the cluster matters, not the node.  Monitor the cluster health, figure out when it drops below a certain threshold of performance, and then alert on that.
The second way is to indirectly inject this into the Hadoop cluster in a way that proactively prevents compute tasks from running on a bad server. Basically, because Serverspec can tell us what the current health state is of a node, we can use that to leverage the Tasktracker blacklist to allow nodes to shoot themselves in the head if they go bad for some reason.
To do this, you configure the Tasktracker to periodically run a script defined in mapred.healthChecker.script.path
. The output will be one of two things: nothing or ERROR
. If it is ERROR
, the node is flagged for blacklisting.  Once in the blacklist, the Jobtracker stops sending MapReduce jobs to that node until the Tasktracker is restarted or the Tasktracker returns to a healthy state.  This script needs to be fast and non-blocking.  What will probably work the best is a small script that reads a state file that gets run by the Tasktracker and a second script that’s a wrapper around the Serverspec tests that gets run via cron on a frequent and periodic basis that writes out the current health state.
You can read more about this in the Hadoop configuration documentation.
Where to go from here?
Right now, I have the start of the Serverspec tests written for the business logic I defined above.  I’m still determining the best way to break things up so that I can utilize common code across both Hadoop and non-Hadoop parts of my environment.  Right now, I’m primarily targeting the work for server acceptance testing of my new clusters, but I intend to move towards using it as part of our overall health monitoring at some point.
You can see the code and play with it yourself.
https://github.com/hcoyote/hadoop-serverspec
$ git clone https://github.com/hcoyote/hadoop-serverspec $ cd hadoop-serverspec $ rspec HDFS Filesystems File "/hdfs/01" should be mounted should be owned by "root" should be grouped into "root" should be mode 755 File "/hdfs/02" should be mounted should be owned by "root" should be grouped into "root" should be mode 755 File "/hdfs/03" should be mounted should be owned by "root" should be grouped into "root" should be mode 755 File "/hdfs/04" should be mounted should be owned by "root" should be grouped into "root" should be mode 755 File "/hdfs/05" should be mounted should be owned by "root" should be grouped into "root" should be mode 755 File "/hdfs/06" should be mounted should be owned by "root" should be grouped into "root" should be mode 755 File "/hdfs/07" should be mounted should be owned by "root" should be grouped into "root" should be mode 755 File "/hdfs/08" should be mounted should be owned by "root" should be grouped into "root" should be mode 755 File "/hdfs/09" should be mounted should be owned by "root" should be grouped into "root" should be mode 755 File "/hdfs/10" should be mounted should be owned by "root" should be grouped into "root" should be mode 755 File "/hdfs/11" should be mounted should be owned by "root" should be grouped into "root" should be mode 755 File "/hdfs/12" should be mounted should be owned by "root" should be grouped into "root" should be mode 755 Hadoop Datanode Daemon Package "hadoop-hdfs-datanode" should be installed Service "hadoop-hdfs-datanode" should be enabled should be running Port "50020" should be listening Hadoop requirements to run File "/etc/hosts" should contain /127.0.0.1 localhost localhost.localdomain/ should contain "hdn20.example.net" Host "127.0.0.1" should be resolvable Host "localhost" should be resolvable Host "hdn20.example.net" should be resolvable Hadoop Tasktracker Daemon Package "hadoop-0.20-mapreduce-tasktracker" should be installed Service "hadoop-0.20-mapreduce-tasktracker" should be enabled should be running Hardware Configuration Linux kernel parameter "kernel.panic" value should eq 300 HBase Regionserver Package "hbase-regionserver" should be installed Service "hbase-regionserver" should be enabled should be running Port "60030" should be listening Network Configuration Physical Network Interface "em1" should exist speed should eq 1000 Interface "em2" should exist speed should eq 1000 Interface "p3p1" should exist speed should eq 1000 LACP Bond "bond0" should exist should have interface "em1" should have interface "em2" should have interface "p3p1" LLDP Package "lldpd" should be installed Service "lldpd" should be enabled should be running CPU Kernel Configurations for Max Performance File "/proc/cmdline" content should match /intel_idle.max_cstate=0?/ Linux kernel parameter "vm.swappiness" value should eq 0 Transparent Huge Page Support File "/sys/kernel/mm/transparent_hugepage/defrag" content should match /\[never\]/ File "/sys/kernel/mm/transparent_hugepage/enabled" content should match /\[never\]/ Finished in 20.38 seconds (files took 0.89765 seconds to load) 82 examples, 0 failures