Pig ‘local’ mode fails when Kerberos auth is enabled.

I ran across this interesting Kerberos authentication bug today on Cloudera’s CDH4. It appears to affect all versions of pig, but only when running in local mode. I want to run pig in local mode. This implies that pig fires up everything it needs to run the MapReduce job on your local machine without having […]

Followup on Cloudera HUE’s Kerberos kt_renewer

Just a short followup about the HUE kt_renewer issue I discovered. It turns out that the issue was me and not HUE. The fix turned out to be pretty simple once I saw the clue in a related issue. It seems like Cloudera Manager had the same issue. The problem ended up being a missing […]

Kerberos kt_renewer failures with HUE on CDH4

First off, I’m not exactly sure if this is a Hadoop User Environment (HUE) issue or if this is a broken setup on my Kerberos environment. I have a thread open on the HUE users list, but haven’t had any followup. I’ve just fired up HUE for the first time to talk with a kerberos-enabled […]

Mass-gzip files inside HDFS using the power of Hadoop

I have a bunch of text files sitting in HDFS that I need to compress. It’s on the order of several hundred files comprising several hundred gigabytes of data. There are several ways to do this. I could individually copy down each file, compress it, and re-upload it to HDFS. This takes an excessively long […]

Using cobbler with a fast file system creation snippet for Kickstart %post install of Hadoop nodes

I run Hadoop servers with 12 2TB hard drives in them. One of the bottlenecks with this occurs during kickstart when we’re using anaconda to create the filesystems. Previously, I just had a specific partition configuration that was brought in during %pre, but this caused the filesystem formatting section of kickstart to take several hours […]

Hadoop DataNode logs filling with clienttrace messages

So, you’re probably like me. You have a shiny, new Cloudera Hadoop cluster. Everything is zooming along smoothly. Until you find that your /var/log/hadoop datanode logs are growing at a rate of a bazillion gigabytes per day. What do you do, hot shot? WHAT DO YOU DO? Actually, it’s pretty simple. We were getting alerts […]

Hadoop, facter, and the puppet marionette

I’ve been working with puppet a lot lately.  A lot.  It’s part of my job.  We’ve been setting up a new hadoop cluster in our Xen environment.  Nothing big.  It started out with 4 nodes, all configured the same way (3 drives each).  We added an additional 2 nodes with 2 drives each.  This, of […]