I have a bunch of text files sitting in HDFS that I need to compress. It’s on the order of several hundred files comprising several hundred gigabytes of data. There are several ways to do this.
- I could individually copy down each file, compress it, and re-upload it to HDFS. This takes an excessively long amount of time.
- I could run a hadoop streaming job that turns on mapred.output.compress and mapred.compress.map.output, sets my mapred.output.compress.codec to org.apache.hadoop.io.compress.GzipCodec, and then just cats the output.
- I could create a shell script that gets uploaded via a hadoop streaming job that copies each file down to the local data node where the task is executing, runs gzip, and re-uploads it.
Option 1 was a no-go from the start. It would have taken days to complete the operation and I would have lost the opportunity to learn something new.
Option 2 was attempted, but what I found was that the inputs were split along their block size, causing the resulting file to turn into a multiple gzipped parts. This was a no-go because I needed each file assembled back in one piece. Additionally, I still ended up having to run one mapreduce job per file, which was going to take about a day the way I had written it.
Option 3 was attempted because it could easily guarantee that the files would be compressed quickly, in parallel, and end up coming back out of the mapreduce job as one file per input. That’s exactly what I needed.
To do this, I needed to do several things.
First, create the input file list.
$ hadoop fs -ls /tmp/filedir/*.txt > gzipped.out
Next, I create a simple shell script to invoke as the map task by the hadoop streaming job. The script looks like this:
#!/bin/sh -e set -xv while read dummy filename ; do echo "Reading $filename" hadoop fs -copyToLocal $filename . base=`basename $filename` gzip $base hadoop fs -copyFromLocal ${base}.gz /tmp/jobvis/${base}.gz done
A short breakdown of what this is doing:
The input to the map task is going to be fed to us by the hadoop streaming job. The input has two columns, the key and the filename. We don’t care about the key, so we’re just going to ignore it. Next, we copy the file from hadoop to the current working directory for the task on the datanode’s local disk location for mapreduce. In this case, it ended up somewhere in /hdfs/mapred/local/ttprivate/taskTracker. Since we’re operating on a filepath, we need to get the basename so we can have gzip operate on it once it’s in the local temporary directory. Once gzip is complete, we upload it back to hadoop.
This particular cluster runs simple authentication, so the jobs actually run as the mapred user. Because of this, the files that are actually getting written down into the local datanode temporary directory will be owned by the mapred user. The directory that we want to upload the data back to needs to have access from the mapred user to create the uploaded data.
Note: it’s probably best to run the shell script with -e
so that if any operation fails, the task will fail. Using set -xv
will also allow some better output in the mapreduce task logs so you can see what the script is doing during the run.
Next, we create the output directory on HDFS.
$ hadoop fs -mkdir /tmp/jobvis $ hadoop fs -chmod 777 /tmp/jobvis
Once that’s done, we want to run the hadoop streaming job. I ran it like this. There’s a lot of output here. I include it only for reference.
$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-*.jar -Dmapred.reduce.tasks=0 -mapper gzipit.sh -input ./gzipped.txt -output /user/hcoyote/gzipped.log -verbose -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat -file gzipit.sh
Important to note on this command line: org.apache.hadoop.mapred.lib.NLineInputFormat is the magic here. It basically tells the job to feed one file per maptask. This allows gzipit.sh to run once per file and gain the parallelism of running on all available map slots on the cluster. One thing I should have done was to turn off speculative execution. Since we’re creating a specific output file, I was seeing some tasks preemptively fail because the output had already been produced.
Then it just looks like this:
STREAM: addTaskEnvironment= STREAM: shippedCanonFiles_=[/home/hcoyote/gzipit.sh] STREAM: shipped: true /home/hcoyote/gzipit.sh STREAM: cmd=gzipit.sh STREAM: cmd=null STREAM: cmd=null STREAM: Found runtime classes in: /tmp/hadoop-hcoyote/hadoop-unjar3549927095616185719/ packageJobJar: [gzipit.sh, /tmp/hadoop-hcoyote/hadoop-unjar3549927095616185719/] [] /tmp/streamjob6232490588444082861.jar tmpDir=null JarBuilder.addNamedStream gzipit.sh JarBuilder.addNamedStream org/apache/hadoop/streaming/DumpTypedBytes.class JarBuilder.addNamedStream org/apache/hadoop/streaming/UTF8ByteArrayUtils.class JarBuilder.addNamedStream org/apache/hadoop/streaming/Environment.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamKeyValUtil.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PathFinder.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeReducer.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamBaseRecordReader.class JarBuilder.addNamedStream org/apache/hadoop/streaming/JarBuilder.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TypedBytesOutputReader.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/IdentifierResolver.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TextOutputReader.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/RawBytesOutputReader.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TextInputWriter.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/InputWriter.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/OutputReader.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/TypedBytesInputWriter.class JarBuilder.addNamedStream org/apache/hadoop/streaming/io/RawBytesInputWriter.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$MROutputThread.class JarBuilder.addNamedStream org/apache/hadoop/streaming/AutoInputFormat.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil$TaskId.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeCombiner.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRunner.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil$StreamConsumer.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamUtil.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamXmlRecordReader.class JarBuilder.addNamedStream org/apache/hadoop/streaming/HadoopStreaming.class JarBuilder.addNamedStream org/apache/hadoop/streaming/LoadTypedBytes.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$MRErrorThread.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapper.class JarBuilder.addNamedStream org/apache/hadoop/streaming/PipeMapRed$1.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamInputFormat.class JarBuilder.addNamedStream org/apache/hadoop/streaming/StreamJob.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesInput$1.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordInput$TypedBytesIndex.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableOutput$1.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableOutput.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableInput$1.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableInput$2.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordInput.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordOutput$1.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesOutput$1.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritable.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesOutput.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordInput$1.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesRecordOutput.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesWritableInput.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/TypedBytesInput.class JarBuilder.addNamedStream org/apache/hadoop/typedbytes/Type.class JarBuilder.addNamedStream META-INF/MANIFEST.MF STREAM: ==== JobConf properties: STREAM: dfs.access.time.precision=3600000 STREAM: dfs.balance.bandwidthPerSec=943718400 STREAM: dfs.block.access.key.update.interval=600 STREAM: dfs.block.access.token.enable=false STREAM: dfs.block.access.token.lifetime=600 STREAM: dfs.block.size=67108864 STREAM: dfs.blockreport.initialDelay=0 STREAM: dfs.blockreport.intervalMsec=3600000 STREAM: dfs.client.block.write.retries=3 STREAM: dfs.data.dir=${hadoop.tmp.dir}/dfs/data STREAM: dfs.datanode.address=0.0.0.0:50010 STREAM: dfs.datanode.data.dir.perm=700 STREAM: dfs.datanode.directoryscan.threads=1 STREAM: dfs.datanode.dns.interface=default STREAM: dfs.datanode.dns.nameserver=default STREAM: dfs.datanode.du.reserved=53687091200 STREAM: dfs.datanode.failed.volumes.tolerated=0 STREAM: dfs.datanode.handler.count=3 STREAM: dfs.datanode.http.address=0.0.0.0:50075 STREAM: dfs.datanode.https.address=0.0.0.0:50475 STREAM: dfs.datanode.ipc.address=0.0.0.0:50020 STREAM: dfs.datanode.max.xcievers=4096 STREAM: dfs.datanode.plugins=org.apache.hadoop.thriftfs.DatanodePlugin STREAM: dfs.default.chunk.view.size=32768 STREAM: dfs.df.interval=60000 STREAM: dfs.heartbeat.interval=3 STREAM: dfs.hosts=/etc/hadoop/conf/hosts.include STREAM: dfs.hosts.exclude=/etc/hadoop/conf/hosts.exclude STREAM: dfs.http.address=0.0.0.0:50070 STREAM: dfs.https.address=0.0.0.0:50470 STREAM: dfs.https.client.keystore.resource=ssl-client.xml STREAM: dfs.https.enable=false STREAM: dfs.https.need.client.auth=false STREAM: dfs.https.server.keystore.resource=ssl-server.xml STREAM: dfs.max-repl-streams=16 STREAM: dfs.max.objects=0 STREAM: dfs.name.dir=/hdfs/01/name,/hdfs/02/name,/mnt/remote_namenode_failsafe/name STREAM: dfs.name.edits.dir=${dfs.name.dir} STREAM: dfs.namenode.decommission.interval=30 STREAM: dfs.namenode.decommission.nodes.per.interval=5 STREAM: dfs.namenode.delegation.key.update-interval=86400000 STREAM: dfs.namenode.delegation.token.max-lifetime=604800000 STREAM: dfs.namenode.delegation.token.renew-interval=86400000 STREAM: dfs.namenode.handler.count=10 STREAM: dfs.namenode.logging.level=info STREAM: dfs.namenode.plugins=org.apache.hadoop.thriftfs.NamenodePlugin STREAM: dfs.permissions=true STREAM: dfs.permissions.supergroup=supergroup STREAM: dfs.replication=3 STREAM: dfs.replication.considerLoad=true STREAM: dfs.replication.interval=3 STREAM: dfs.replication.max=512 STREAM: dfs.replication.min=1 STREAM: dfs.safemode.extension=30000 STREAM: dfs.safemode.min.datanodes=0 STREAM: dfs.safemode.threshold.pct=0.999f STREAM: dfs.secondary.http.address=0.0.0.0:50090 STREAM: dfs.support.append=true STREAM: dfs.thrift.address=0.0.0.0:10090 STREAM: dfs.web.ugi=webuser,webgroup STREAM: fs.automatic.close=true STREAM: fs.checkpoint.dir=/hdfs/01/checkpoint,/hdfs/02/checkpoint STREAM: fs.checkpoint.edits.dir=${fs.checkpoint.dir} STREAM: fs.checkpoint.period=3600 STREAM: fs.checkpoint.size=67108864 STREAM: fs.default.name=hdfs://namenode.example.net:9000/ STREAM: fs.file.impl=org.apache.hadoop.fs.LocalFileSystem STREAM: fs.ftp.impl=org.apache.hadoop.fs.ftp.FTPFileSystem STREAM: fs.har.impl=org.apache.hadoop.fs.HarFileSystem STREAM: fs.har.impl.disable.cache=true STREAM: fs.hdfs.impl=org.apache.hadoop.hdfs.DistributedFileSystem STREAM: fs.hftp.impl=org.apache.hadoop.hdfs.HftpFileSystem STREAM: fs.hsftp.impl=org.apache.hadoop.hdfs.HsftpFileSystem STREAM: fs.inmemory.size.mb=192 STREAM: fs.kfs.impl=org.apache.hadoop.fs.kfs.KosmosFileSystem STREAM: fs.ramfs.impl=org.apache.hadoop.fs.InMemoryFileSystem STREAM: fs.s3.block.size=67108864 STREAM: fs.s3.buffer.dir=${hadoop.tmp.dir}/s3 STREAM: fs.s3.impl=org.apache.hadoop.fs.s3.S3FileSystem STREAM: fs.s3.maxRetries=4 STREAM: fs.s3.sleepTimeSeconds=10 STREAM: fs.s3n.block.size=67108864 STREAM: fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem STREAM: fs.trash.interval=0 STREAM: hadoop.http.authentication.kerberos.keytab=${user.home}/hadoop.keytab STREAM: hadoop.http.authentication.kerberos.principal=HTTP/localhost@LOCALHOST STREAM: hadoop.http.authentication.signature.secret.file=${user.home}/hadoop-http-auth-signature-secret STREAM: hadoop.http.authentication.simple.anonymous.allowed=true STREAM: hadoop.http.authentication.token.validity=36000 STREAM: hadoop.http.authentication.type=simple STREAM: hadoop.kerberos.kinit.command=kinit STREAM: hadoop.logfile.count=10 STREAM: hadoop.logfile.size=10000000 STREAM: hadoop.native.lib=true STREAM: hadoop.permitted.revisions= 03b655719d13929bd68bb2c2f9cee615b389cea9, 217a3767c48ad11d4632e19a22897677268c40c4 STREAM: hadoop.rpc.socket.factory.class.default=org.apache.hadoop.net.StandardSocketFactory STREAM: hadoop.security.authentication=simple STREAM: hadoop.security.authorization=false STREAM: hadoop.security.group.mapping=org.apache.hadoop.security.ShellBasedUnixGroupsMapping STREAM: hadoop.security.uid.cache.secs=14400 STREAM: hadoop.tmp.dir=/tmp/hadoop-${user.name} STREAM: hadoop.util.hash.type=murmur STREAM: hadoop.workaround.non.threadsafe.getpwuid=false STREAM: io.bytes.per.checksum=512 STREAM: io.compression.codecs=org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec STREAM: io.file.buffer.size=131702 STREAM: io.map.index.skip=0 STREAM: io.mapfile.bloom.error.rate=0.005 STREAM: io.mapfile.bloom.size=1048576 STREAM: io.seqfile.compress.blocksize=1000000 STREAM: io.seqfile.lazydecompress=true STREAM: io.seqfile.sorter.recordlimit=1000000 STREAM: io.serializations=org.apache.hadoop.io.serializer.WritableSerialization STREAM: io.skip.checksum.errors=false STREAM: io.sort.factor=10 STREAM: io.sort.mb=100 STREAM: io.sort.record.percent=0.05 STREAM: io.sort.spill.percent=0.80 STREAM: ipc.client.connect.max.retries=10 STREAM: ipc.client.connection.maxidletime=10000 STREAM: ipc.client.idlethreshold=4000 STREAM: ipc.client.kill.max=10 STREAM: ipc.client.tcpnodelay=false STREAM: ipc.server.listen.queue.size=128 STREAM: ipc.server.tcpnodelay=false STREAM: job.end.retry.attempts=0 STREAM: job.end.retry.interval=30000 STREAM: jobclient.completion.poll.interval=5000 STREAM: jobclient.output.filter=FAILED STREAM: jobclient.progress.monitor.poll.interval=1000 STREAM: jobtracker.thrift.address=0.0.0.0:9290 STREAM: keep.failed.task.files=false STREAM: local.cache.size=10737418240 STREAM: map.sort.class=org.apache.hadoop.util.QuickSort STREAM: mapred.acls.enabled=false STREAM: mapred.child.java.opts=-Xmx512m -Xms512m STREAM: mapred.child.tmp=./tmp STREAM: mapred.cluster.map.memory.mb=-1 STREAM: mapred.cluster.max.map.memory.mb=-1 STREAM: mapred.cluster.max.reduce.memory.mb=-1 STREAM: mapred.cluster.reduce.memory.mb=-1 STREAM: mapred.compress.map.output=false STREAM: mapred.create.symlink=yes STREAM: mapred.disk.healthChecker.interval=60000 STREAM: mapred.fairscheduler.preemption=true STREAM: mapred.healthChecker.interval=60000 STREAM: mapred.healthChecker.script.timeout=600000 STREAM: mapred.heartbeats.in.second=100 STREAM: mapred.hosts=/etc/hadoop/conf/hosts.include STREAM: mapred.hosts.exclude=/etc/hadoop/conf/hosts.exclude STREAM: mapred.inmem.merge.threshold=1000 STREAM: mapred.input.dir=hdfs://namenode.example.net:9000/user/hcoyote/gzipped.txt STREAM: mapred.input.format.class=org.apache.hadoop.mapred.lib.NLineInputFormat STREAM: mapred.jar=/tmp/streamjob6232490588444082861.jar STREAM: mapred.job.map.memory.mb=-1 STREAM: mapred.job.queue.name=default STREAM: mapred.job.reduce.input.buffer.percent=0.0 STREAM: mapred.job.reduce.memory.mb=-1 STREAM: mapred.job.reuse.jvm.num.tasks=1 STREAM: mapred.job.shuffle.input.buffer.percent=0.70 STREAM: mapred.job.shuffle.merge.percent=0.66 STREAM: mapred.job.tracker=namenode.example.net:54311 STREAM: mapred.job.tracker.handler.count=10 STREAM: mapred.job.tracker.http.address=0.0.0.0:50030 STREAM: mapred.job.tracker.jobhistory.lru.cache.size=5 STREAM: mapred.job.tracker.persist.jobstatus.active=false STREAM: mapred.job.tracker.persist.jobstatus.dir=/jobtracker/jobsInfo STREAM: mapred.job.tracker.persist.jobstatus.hours=0 STREAM: mapred.job.tracker.retiredjobs.cache.size=1000 STREAM: mapred.jobtracker.completeuserjobs.maximum=20 STREAM: mapred.jobtracker.instrumentation=org.apache.hadoop.mapred.JobTrackerMetricsInst STREAM: mapred.jobtracker.job.history.block.size=3145728 STREAM: mapred.jobtracker.maxtasks.per.job=-1 STREAM: mapred.jobtracker.plugins=org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin STREAM: mapred.jobtracker.restart.recover=false STREAM: mapred.jobtracker.taskScheduler=org.apache.hadoop.mapred.FairScheduler STREAM: mapred.line.input.format.linespermap=1 STREAM: mapred.local.dir=${hadoop.tmp.dir}/mapred/local STREAM: mapred.local.dir.minspacekill=0 STREAM: mapred.local.dir.minspacestart=0 STREAM: mapred.map.max.attempts=4 STREAM: mapred.map.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec STREAM: mapred.map.runner.class=org.apache.hadoop.streaming.PipeMapRunner STREAM: mapred.map.tasks=2 STREAM: mapred.map.tasks.speculative.execution=true STREAM: mapred.mapoutput.key.class=org.apache.hadoop.io.Text STREAM: mapred.mapoutput.value.class=org.apache.hadoop.io.Text STREAM: mapred.mapper.class=org.apache.hadoop.streaming.PipeMapper STREAM: mapred.max.tracker.blacklists=4 STREAM: mapred.max.tracker.failures=4 STREAM: mapred.merge.recordsBeforeProgress=10000 STREAM: mapred.min.split.size=0 STREAM: mapred.output.compress=false STREAM: mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec STREAM: mapred.output.compression.type=RECORD STREAM: mapred.output.dir=hdfs://namenode.example.net:9000/user/hcoyote/gzipped.log STREAM: mapred.output.format.class=org.apache.hadoop.mapred.TextOutputFormat STREAM: mapred.output.key.class=org.apache.hadoop.io.Text STREAM: mapred.output.value.class=org.apache.hadoop.io.Text STREAM: mapred.queue.default.state=RUNNING STREAM: mapred.queue.names=default STREAM: mapred.reduce.max.attempts=4 STREAM: mapred.reduce.parallel.copies=33 STREAM: mapred.reduce.slowstart.completed.maps=0.75 STREAM: mapred.reduce.tasks=0 STREAM: mapred.reduce.tasks.speculative.execution=false STREAM: mapred.skip.attempts.to.start.skipping=2 STREAM: mapred.skip.map.auto.incr.proc.count=true STREAM: mapred.skip.map.max.skip.records=0 STREAM: mapred.skip.reduce.auto.incr.proc.count=true STREAM: mapred.skip.reduce.max.skip.groups=0 STREAM: mapred.submit.replication=10 STREAM: mapred.system.dir=/mapred/system STREAM: mapred.task.cache.levels=2 STREAM: mapred.task.profile=false STREAM: mapred.task.profile.maps=0-2 STREAM: mapred.task.profile.reduces=0-2 STREAM: mapred.task.timeout=600000 STREAM: mapred.task.tracker.http.address=0.0.0.0:50060 STREAM: mapred.task.tracker.report.address=127.0.0.1:0 STREAM: mapred.task.tracker.task-controller=org.apache.hadoop.mapred.DefaultTaskController STREAM: mapred.tasktracker.dns.interface=default STREAM: mapred.tasktracker.dns.nameserver=default STREAM: mapred.tasktracker.expiry.interval=600000 STREAM: mapred.tasktracker.indexcache.mb=10 STREAM: mapred.tasktracker.instrumentation=org.apache.hadoop.mapred.TaskTrackerMetricsInst STREAM: mapred.tasktracker.map.tasks.maximum=8 STREAM: mapred.tasktracker.reduce.tasks.maximum=8 STREAM: mapred.tasktracker.taskmemorymanager.monitoring-interval=5000 STREAM: mapred.tasktracker.tasks.sleeptime-before-sigkill=5000 STREAM: mapred.temp.dir=${hadoop.tmp.dir}/mapred/temp STREAM: mapred.used.genericoptionsparser=true STREAM: mapred.user.jobconf.limit=5242880 STREAM: mapred.userlog.limit.kb=128 STREAM: mapred.userlog.retain.hours=24 STREAM: mapred.working.dir=hdfs://namenode.example.net:9000/user/hcoyote STREAM: mapreduce.job.acl-modify-job= STREAM: mapreduce.job.acl-view-job= STREAM: mapreduce.job.complete.cancel.delegation.tokens=true STREAM: mapreduce.job.counters.limit=120 STREAM: mapreduce.job.jar.unpack.pattern=(?:classes/|lib/).*|(?:\Qgzipit.sh\E) STREAM: mapreduce.jobtracker.split.metainfo.maxsize=10000000 STREAM: mapreduce.jobtracker.staging.root.dir=${hadoop.tmp.dir}/mapred/staging STREAM: mapreduce.reduce.input.limit=-1 STREAM: mapreduce.reduce.shuffle.connect.timeout=180000 STREAM: mapreduce.reduce.shuffle.maxfetchfailures=10 STREAM: mapreduce.reduce.shuffle.read.timeout=180000 STREAM: mapreduce.tasktracker.cache.local.numberdirectories=10000 STREAM: mapreduce.tasktracker.outofband.heartbeat=false STREAM: stream.addenvironment= STREAM: stream.map.input.writer.class=org.apache.hadoop.streaming.io.TextInputWriter STREAM: stream.map.output.reader.class=org.apache.hadoop.streaming.io.TextOutputReader STREAM: stream.map.streamprocessor=gzipit.sh STREAM: stream.numinputspecs=1 STREAM: stream.reduce.input.writer.class=org.apache.hadoop.streaming.io.TextInputWriter STREAM: stream.reduce.output.reader.class=org.apache.hadoop.streaming.io.TextOutputReader STREAM: tasktracker.http.threads=40 STREAM: topology.node.switch.mapping.impl=org.apache.hadoop.net.ScriptBasedMapping STREAM: topology.script.file.name=/etc/hadoop/conf/rack-topology.sh STREAM: topology.script.number.args=100 STREAM: webinterface.private.actions=false STREAM: ==== STREAM: submitting to jobconf: namenode.example.net:54311 13/09/24 17:10:27 INFO mapred.FileInputFormat: Total input paths to process : 1 13/09/24 17:10:27 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-hcoyote/mapred/local] 13/09/24 17:10:27 INFO streaming.StreamJob: Running job: job_201307061907_108574 13/09/24 17:10:27 INFO streaming.StreamJob: To kill this job, run: 13/09/24 17:10:27 INFO streaming.StreamJob: /usr/lib/hadoop-0.20/bin/hadoop job -Dmapred.job.tracker=namenode.example.net:54311 -kill job_201307061907_108574 13/09/24 17:10:27 INFO streaming.StreamJob: Tracking URL: http://namenode.example.net:50030/jobdetails.jsp?jobid=job_201307061907_108574 13/09/24 17:10:28 INFO streaming.StreamJob: map 0% reduce 0% 13/09/24 17:10:41 INFO streaming.StreamJob: map 4% reduce 0% 13/09/24 17:10:42 INFO streaming.StreamJob: map 37% reduce 0% 13/09/24 17:10:43 INFO streaming.StreamJob: map 55% reduce 0% 13/09/24 17:10:44 INFO streaming.StreamJob: map 61% reduce 0% 13/09/24 17:10:45 INFO streaming.StreamJob: map 66% reduce 0% 13/09/24 17:10:46 INFO streaming.StreamJob: map 70% reduce 0% 13/09/24 17:10:47 INFO streaming.StreamJob: map 72% reduce 0% 13/09/24 17:10:48 INFO streaming.StreamJob: map 73% reduce 0% 13/09/24 17:10:49 INFO streaming.StreamJob: map 74% reduce 0% 13/09/24 17:11:18 INFO streaming.StreamJob: map 75% reduce 0% 13/09/24 17:11:27 INFO streaming.StreamJob: map 76% reduce 0% 13/09/24 17:11:29 INFO streaming.StreamJob: map 77% reduce 0% 13/09/24 17:11:32 INFO streaming.StreamJob: map 78% reduce 0% 13/09/24 17:11:34 INFO streaming.StreamJob: map 79% reduce 0% 13/09/24 17:11:37 INFO streaming.StreamJob: map 80% reduce 0% 13/09/24 17:11:38 INFO streaming.StreamJob: map 82% reduce 0% 13/09/24 17:11:40 INFO streaming.StreamJob: map 83% reduce 0% 13/09/24 17:11:41 INFO streaming.StreamJob: map 84% reduce 0% 13/09/24 17:11:42 INFO streaming.StreamJob: map 86% reduce 0% 13/09/24 17:11:43 INFO streaming.StreamJob: map 88% reduce 0% 13/09/24 17:11:44 INFO streaming.StreamJob: map 89% reduce 0% 13/09/24 17:11:45 INFO streaming.StreamJob: map 90% reduce 0% 13/09/24 17:11:46 INFO streaming.StreamJob: map 91% reduce 0% 13/09/24 17:11:47 INFO streaming.StreamJob: map 92% reduce 0% 13/09/24 17:11:48 INFO streaming.StreamJob: map 94% reduce 0% 13/09/24 17:11:49 INFO streaming.StreamJob: map 95% reduce 0% 13/09/24 17:11:50 INFO streaming.StreamJob: map 96% reduce 0% 13/09/24 17:11:51 INFO streaming.StreamJob: map 97% reduce 0% 13/09/24 17:11:52 INFO streaming.StreamJob: map 99% reduce 0% 13/09/24 17:11:53 INFO streaming.StreamJob: map 100% reduce 0% 13/09/24 17:14:14 INFO streaming.StreamJob: map 100% reduce 100% 13/09/24 17:14:14 INFO streaming.StreamJob: Job complete: job_201307061907_108574 13/09/24 17:14:14 INFO streaming.StreamJob: Output: /user/hcoyote/gzipped.log