I run Hadoop servers with 12 2TB hard drives in them. One of the bottlenecks with this occurs during kickstart when we’re using anaconda to create the filesystems. Previously, I just had a specific partition configuration that was brought in during %pre, but this caused the filesystem formatting section of kickstart to take several hours to complete. With some additional changes that required us to begin hand-signing puppet certificates that were created during %post, this entire process because too unwieldy. I got tired of having to wait hours for systems to get to the %post install, just so I could turn around and sign things.
Instead, what I did was to move the partitioning of the filesystems into a %post snippet for cobbler and tweak a few filesystem creation options (which you can’t do with partition in the ks.cfg). This brought creation time down to about 10 minutes from several hours.
The snippet is below.
I’d like to point a few things out with our configuration.
- We have 12 2TB drives, as I stated above. Each drive is dedicated to HDFS. Each drive is mounted to /hdfs/XX where XX is 01 through 12.
- All HDFS drives are given an e2label of the form /hdfs/XX. We use this as a signal within a puppet fact to do configuration of the data node. If we lose a drive, we can just re-run puppet and have the configuration fixed temporarily while we wait for a replacement.
- We’re going to use ext4 with these systems since we’re using a sufficiently new enough kernel in the 2.6 line. We chose this based on various recommendations now coming out of places like Cloudera, AMD, and Intel.
- We disable atime and diratime on all HDFS drives to prevent useless writes that occur with atime updates.
- We use an FS profile called “largefile”, during mkfs.ext4 creation that reduces the number of inodes that gets created on the filesystem. Since we’re generally dealing with large files, this is acceptable to us.
- We use several ext4 features:
- dir_index – speeds up directory lookups in large directories
- extent – use extent-based block allocation which has a benefit for large files, allowing them to be laid out more contiguously on disk
- sparse_super – create fewer superblock backups which aren’t needed on large filesystems.
- We have a RAID1 OS drive on /dev/sda. This was done because we wanted to dedicate as much space to HDFS and prevent the first drive from taking an I/O hit due to logging or other non-HDFS activities. This is presented to the OS with a model of “Virtual disk”, which allows us to detect and skip operating on it.
Finally, because Cobbler uses Cheetah as it’s backend templating system, I want to point out that there are some additional escaped dollar signs ($) in the snippet to prevent cobbler from choking on them. Otherwise, you could use this straight in a shell script.
DIR="/sys/block" MINSIZE=1000 # list-harddrives doesn't exist in the chroot post install environment. bummer. for DEV in `cd $DIR ; ls -d sd*`; do if [ -d $DIR/$DEV ] ; then REMOVABLE=`cat $DIR/$DEV/removable` if (( $REMOVABLE == 0 )) ; then MODEL=`cat $DIR/$DEV/device/model` if [[ "$MODEL" =~ ^Virtual.* ]] ; then echo "Found a virtual disk on /dev/$DEV, skipping" else echo "Found $DEV" SIZE=`cat $DIR/$DEV/size` GB=$(($SIZE/2**21)) if [ $GB -gt $MINSIZE ] ; then # we are a non-root drive echo "Found a rightsize drive on $DEV for hadoop" for partition in `parted -s /dev/$DEV print | awk '/^ / {print $1}'`; do parted -s /dev/$DEV rm /dev/${partition} done parted -s /dev/$DEV mklabel gpt parted -s /dev/$DEV mkpart -- primary ext4 1 -1 partprobe # we are going to map /dev/sdX to /hdfs/YY with this HDFS_PART_ASCII=`echo $DEV | sed -e 's/sd//' | od -N 1 -i | head -1 | tr -s " " | cut -d" " -f 2` HDFS_PART_NUMBER=\$(($HDFS_PART_ASCII - 97)) HDFS_LABEL=\$(printf "/hdfs/%02g" $HDFS_PART_NUMBER) mkfs.ext4 -T largefile -m 1 -O dir_index,extent,sparse_super -L $HDFS_LABEL /dev/${DEV}1 eval `blkid -o export /dev/${DEV}1` if [ -n "${LABEL}" ] ; then echo "Creating $LABEL mountpoint" mkdir -p "${LABEL}" echo "Adding $LABEL to /etc/fstab" echo "LABEL=$LABEL $LABEL ext4 defaults,noatime,nodiratime 1 2" >> /etc/fstab tail -1 /etc/fstab fi fi fi fi fi done