Hadoop Streaming入门



Hadoop Streaming是Hadoop提供的一种编程工具,提供了一种非常灵活的编程接口, 允许用户使用任何语言编写MapReduce作业,是一种常用的非Java API编写MapReduce的工具。


$ ${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
    -input <输入目录> \ # 可以指定多个输入路径,例如:-input '/user/foo/dir1' -input '/user/foo/dir2'
    -inputformat <输入格式 JavaClassName> \
    -output <输出目录> \
    -outputformat <输出格式 JavaClassName> \
    -mapper  \
    -reducer  \
    -combiner  \
    -partitioner  \
    -cmdenv <name=value> \ # 可以传递环境变量,可以当作参数传入到任务中,可以配置多个
    -file <依赖的文件> \ # 配置文件,字典等依赖
    -D <name=value> \ # 作业的属性配置



属性 新名称 含义 备注
mapred.job.name mapreduce.job.name 作业名称  
mapred.map.tasks mapreduce.job.maps 每个Job运行map task的数量 map启动的个数无法被完全控制
mapred.reduce.tasks mapreduce.job.reduces 每个Job运行reduce task的数量  
mapred.job.priority mapreduce.job.priority 作业优先级 VERY_LOW,LOW,NORMAL,HIGH,VERY_HIGH
stream.map.input.field.separator   Map输入数据的分隔符 默认是\t
stream.reduce.input.field.separator   Reduce输入数据的分隔符 默认是\t
stream.map.output.field.separator   Map输出数据的分隔符 默认是\t
stream.reduce.output.field.separator   Reduce输出数据的分隔符  
stream.num.map.output.key.fields   Map task输出record中key所占的个数  
stream.num.reduce.output.key.fields   Reduce task输出record中key所占的个数  

注意:2.6.0的Streaming文档中只提到了stream.num.reduce.output.fields, 没提到stream.num.reduce.output.key.fields,后续需要看下二者的关系。



Hadoop Streaming要求用户编写的Mapper/Reducer从标准输入(stdin)中读取数据,将结果写入到标准输出(stdout)中, 这非常类似于Linux的管道机制。


$ cat  |  | sort | 

# python的streaming示例
$ cat  | python mapper.py | sort | python reducer.py




$ cat input/input_0.txt
Hadoop is the Elephant King!
A yellow and elegant thing.
He never forgets
Useful data, or lets
An extraneous element cling!

$ cat input/input_1.txt  
A wonderful king is Hadoop.
The elephant plays well with Sqoop.
But what helps him to thrive
Are Impala, and Hive,
And HDFS in the group.

$ cat input/input_2.txt  
Hadoop is an elegant fellow.
An elephant gentle and mellow.
He never gets mad,
Or does anything bad,
Because, at his core, he is yellow.

$ ${HADOOP_HOME}/bin/hadoop fs -mkdir -p /user//wordcount

$ ${HADOOP_HOME}/bin/hadoop fs -put input/ /user//wordcount


#!/bin/env python
# encoding: utf-8

import re
import sys

seperator_pattern = re.compile(r'[^a-zA-Z0-9]+')

for line in sys.stdin:
    for word in seperator_pattern.split(line):
        if word:
            print '%s\t%d' % (word.lower(), 1)


#!/bin/env python
# encoding: utf-8

import sys

last_key = None
last_sum = 0

for line in sys.stdin:
    key, value = line.rstrip('\n').split('\t')
    if last_key is None:
        last_key = key
        last_sum = int(value)
    elif last_key == key:
        last_sum += int(value)
        print '%s\t%d' % (last_key, last_sum)
        last_sum = int(value)
        last_key = key

if last_key:
    print '%s\t%d' % (last_key, last_sum)


#!/bin/env python
# encoding: utf-8

import itertools
import sys

stdin_generator = (line for line in sys.stdin if line)

for key, values in itertools.groupby(stdin_generator, key=lambda x: x.split('\t')[0]):
    value_sum = sum((int(i.split('\t')[1]) for i in values))
    print '%s\t%d' % (key, value_sum)




前面说过,Streaming的基本过程与linux管道类似,所以可以在本地先进行简单的测试。 这里的测试只能测试程序的逻辑基本符合预期,作业的属性设置

$ cat input/* | python mapper.py  | sort | python reducer.py
a       2
an      3
and     4
anything        1
are     1
at      1
bad     1
because 1
but     1
cling   1
core    1
data    1
does    1
elegant 2
element 1
elephant        3
extraneous      1
fellow  1
forgets 1
gentle  1
gets    1
group   1
hadoop  3
hdfs    1
he      3
helps   1
him     1
his     1
hive    1
impala  1
in      1
is      4
king    2
lets    1
mad     1
mellow  1
never   2
or      2
plays   1
sqoop   1
the     3
thing   1
thrive  1
to      1
useful  1
well    1
what    1
with    1
wonderful       1
yellow  2



#!/bin/env python
# encoding: utf-8

import re
import sys

seperator_pattern = re.compile(r'[^a-zA-Z0-9]+')

def print_counter(group, counter, amount):
    print >> sys.stderr, 'reporter:counter:{g},{c},{a}'.format(g=group, c=counter, a=amount)

for line in sys.stdin:
    for word in seperator_pattern.split(line):
        if word:
            print '%s\t%d' % (word.lower(), 1)
            print_counter('wc', 'empty-word', 1)


How do I update counters in streaming applications?

A streaming process can use the stderr to emit counter information. reporter:counter:,, should be sent to stderr to update the counter.



# 使用-files,注意:-D -files选项放在最前面,放在后面会报错,不懂为何
$ ${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
    -D mapred.job.name="streaming_wordcount" \
    -D mapred.map.tasks=3 \
    -D mapred.reduce.tasks=3 \
    -D mapred.job.priority=HIGH \
    -files "mapper.py,reducer.py" \
    -input /user//wordcount/input \
    -output /user//wordcount/output \
    -mapper "python mapper.py" \
    -reducer "python reducer.py"

# output 不同的版本可能输出有所不同 -D这里使用的老配置名,前面会有一些警告,这里未显示
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-unjar707084306300214621/] [] /tmp/streamjob5287904745550112970.jar tmpDir=null
15/09/29 10:35:14 INFO client.RMProxy: Connecting to ResourceManager at xxxxx/x.x.x.x:y
15/09/29 10:35:14 INFO client.RMProxy: Connecting to ResourceManager at xxxxx/x.x.x.x:y
15/09/29 10:35:15 INFO mapred.FileInputFormat: Total input paths to process : 3
15/09/29 10:35:15 INFO mapreduce.JobSubmitter: number of splits:3
15/09/29 10:35:15 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/09/29 10:35:15 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1440570785607_1597
15/09/29 10:35:15 INFO impl.YarnClientImpl: Submitted application application_1440570785607_1597
15/09/29 10:35:15 INFO mapreduce.Job: The url to track the job: http://xxxxx:yyy/proxy/application_1440570785607_1597/
15/09/29 10:35:15 INFO mapreduce.Job: Running job: job_1440570785607_1597
15/09/29 10:37:15 INFO mapreduce.Job: Job job_1440570785607_1597 running in uber mode : false
15/09/29 10:37:15 INFO mapreduce.Job:  map 0% reduce 0%
15/09/29 10:42:17 INFO mapreduce.Job:  map 33% reduce 0%
15/09/29 10:42:18 INFO mapreduce.Job:  map 100% reduce 0%
15/09/29 10:42:23 INFO mapreduce.Job:  map 100% reduce 100%
15/09/29 10:42:24 INFO mapreduce.Job: Job job_1440570785607_1597 completed successfully
15/09/29 10:42:24 INFO mapreduce.Job: Counters: 50
        File System Counters
                FILE: Number of bytes read=689
                FILE: Number of bytes written=661855
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=822
                HDFS: Number of bytes written=379
                HDFS: Number of read operations=18
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=6
        Job Counters
                Launched map tasks=3
                Launched reduce tasks=3
                Rack-local map tasks=3
                Total time spent by all maps in occupied slots (ms)=10657
                Total time spent by all reduces in occupied slots (ms)=21644
                Total time spent by all map tasks (ms)=10657
                Total time spent by all reduce tasks (ms)=10822
                Total vcore-seconds taken by all map tasks=10657
                Total vcore-seconds taken by all reduce tasks=10822
                Total megabyte-seconds taken by all map tasks=43651072
                Total megabyte-seconds taken by all reduce tasks=88653824
        Map-Reduce Framework
                Map input records=15
                Map output records=72
                Map output bytes=527
                Map output materialized bytes=725
                Input split bytes=423
                Combine input records=0
                Combine output records=0
                Reduce input groups=50
                Reduce shuffle bytes=725
                Reduce input records=72
                Reduce output records=50
                Spilled Records=144
                Shuffled Maps =9
                Failed Shuffles=0
                Merged Map outputs=9
                GC time elapsed (ms)=72
                CPU time spent (ms)=7870
                Physical memory (bytes) snapshot=3582062592
                Virtual memory (bytes) snapshot=29715922944
                Total committed heap usage (bytes)=10709630976
        Shuffle Errors
        File Input Format Counters
                Bytes Read=399
        File Output Format Counters
                Bytes Written=379
15/09/29 10:42:24 INFO streaming.StreamJob: Output directory: /user//wordcount/output


  1. The url to track the job: http://xxxxx:yyy/proxy/application_1440570785607_1597/ 点击这个url可以通过web页面查看任务的状态
  2. map 0% reduce 0% 显示任务map和reduce的进度
  3. 最后的Counters信息,包含系统默认的counter,可以自定义counter来统计一些任务的状态信息
  4. Output directory: /user//wordcount/output 结果输出目录




$ wget https://www.python.org/ftp/python/2.7.10/Python-2.7.10.tgz
$ tar xzf Python-2.7.10.tgz
$ cd Python-2.7.10

# compile
$ ./configure --prefix=/home//wordcount/python27

$ make -j

$ make install

# 打包一份python27.tar.gz
$ cd /home//wordcount/
$ tar czf python27.tar.gz python27/

# 上传至hadoop的hdfs
$ ${HADOOP_HOME}/bin/hadoop fs -mkdir -p /tools/
$ ${HADOOP_HOME}/bin/hadoop fs -put python27.tar.gz /tools

# 启动任务,使用刚才上传的Python版本
$ ${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar \
    -D mapred.reduce.tasks=3 \
    -files "mapper.py,reducer.py" \
    -archives "hdfs://xxxxx:9000/tools/python27.tar.gz#py" \
    -input /user//wordcount/input \
    -output /user//wordcount/output \
    -mapper "py/python27/bin/python mapper.py" \
    -reducer "py/python27/bin/python reducer.py"






配置多个-input的时候可以进行多路输入,在实际中可能需要对不同的输入进行不同的处理,这个时候需要获取输入的路径信息, 来区分是哪个输入路径或文件。Streaming提供了Configured_Parameters, 可以获取一些运行时的信息。

Name Type Description
mapreduce.job.id String The job id
mapreduce.job.jar String job.jar location in job directory
mapreduce.job.local.dir String The job specific shared scratch space
mapreduce.task.id String The task id
mapreduce.task.attempt.id String The task attempt id
mapreduce.task.is.map boolean Is this a map task
mapreduce.task.partition int The id of the task within the job
mapreduce.map.input.file String The filename that the map is reading from
mapreduce.map.input.start long The offset of the start of the map input split
mapreduce.map.input.length long The number of bytes in the map input split
mapreduce.task.output.dir String The task's temporary output directory

在Streaming job运行的过程中,这些mapreduce的参数格式会有所变化,所有的点(.)会变成下划线(_)。例如,mapreduce.job.id变成mapreduce_job_id。 所有的参数都可以通过环境变量来获取。


import os

input_file = os.environ['mapreduce_map_input_file']



  1. mrjob


  1. snakebite:纯Python实现的HDFS客户端


  1. Apache Hadoop MapReduce Streaming
  2. Hadoop Streaming 编程 - 董西成
  3. Deprecated Properties: 新旧参数名字对照

