[hadoop@master test]$ hadoop jar /home/hadoop/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar -info
Warning: $HADOOP_HOME is deprecated.
14/12/15 14:06:32 ERROR streaming.StreamJob: Missing required options: input, output
Usage: $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <cmd|JavaClassName> The streaming command to run
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-partitioner JavaClassName Optional.
-numReduceTasks <num> Optional.
-inputreader <spec> Optional.
-cmdenv <n>=<v> Optional. Pass env.var to streaming commands
-mapdebug <path> Optional. To run this script when a map task fails
-reducedebug <path> Optional. To run this script when a reduce task fails
-io <identifier> Optional.
Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|jobtracker:port> specify a job tracker
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
In -input: globbing on <path> is supported and can have multiple -input
Default Map input format: a line is a record in UTF-8
the key part ends at first TAB, the rest of the line is the value
Custom input format: -inputformat package.MyInputFormat
Map output format, reduce input/output format:
Format defined by what the mapper command outputs. Line-oriented
The files named in the -file argument[s] end up in the
working directory when the mapper and reducer are run.
The location of this working directory is unspecified.
To set the number of reduce tasks (num. of output files):
-D mapred.reduce.tasks=10
To skip the sort/combine/shuffle/sort/reduce step:
Use -numReduceTasks 0
A Task's Map output then becomes a 'side-effect output' rather than a reduce input
This speeds up processing, This also feels more like "in-place" processing
because the input filename and the map input order are preserved
This equivalent -reducer NONE
To speed up the last maps:
-D mapred.map.tasks.speculative.execution=true
To speed up the last reduces:
-D mapred.reduce.tasks.speculative.execution=true
To name the job (appears in the JobTracker Web UI):
-D mapred.job.name='My Job'
To change the local temp directory:
-D dfs.data.dir=/tmp/dfs
-D stream.tmpdir=/tmp/streaming
Additional local temp directories with -cluster local:
-D mapred.local.dir=/tmp/local
-D mapred.system.dir=/tmp/system
-D mapred.temp.dir=/tmp/temp
To treat tasks with non-zero exit status as SUCCEDED:
-D stream.non.zero.exit.is.failure=false
Use a custom hadoopStreaming build along a standard hadoop install:
$HADOOP_HOME/bin/hadoop jar /path/my-hadoop-streaming.jar [...]\
[...] -D stream.shipped.hadoopstreaming=/path/my-hadoop-streaming.jar
For more details about jobconf parameters see:
To set an environement variable in a streaming command:
-cmdenv EXAMPLE_DIR=/home/example/dictionaries/
setenv HSTREAMING "$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar"
Example: $HSTREAMING -mapper "/usr/local/bin/perl5 filter.pl"
-file /local/filter.pl -input "/logs/0604*/*" [...]
Ships a script, invokes the non-shipped perl interpreter
Shipped files go to the working directory so filter.pl is found by perl
Input files are all the daily logs for days in month 2006-04
Streaming Command Failed!
[hadoop@master test]$ hadoop jar /home/hadoop/hadoop-1.2.1/contrib/streaming/hadoop-streaming-1.2.1.jar -Dmapred.reduce.tasks=1 -input /test/t1* -output /testoutput -mapper cat -reducer cat
Warning: $HADOOP_HOME is deprecated.1
In order to keep everything on the grid use hadoop streaming with a single reducer and cat as the mapper and reducer (basically a noop) - add compression using MR flags.
hadoop jar \
$HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming.jar \<br>
-Dmapred.reduce.tasks=1 \
-Dmapred.job.queue.name=$QUEUE \
-input "$INPUT" \
-output "$OUTPUT" \
-mapper cat \
-reducer cat
If you want compression add
-Dmapred.output.compress=true \ -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec