MapReduce - Hadoop Streaming - 练习

几个入门级别的MapReduce练习
基于版本:Python2.6.6,Hadoop2.6.5

文章目录

  • 1. WordCount —— 统计文本中的词频
    • 1.1 准备文本数据
    • 1.2 Map阶段
      • 1.2.1 写一个map函数
      • 1.2.2 Testing
      • 1.2.3 对代码进行改进
    • 1.3 Reduce阶段
      • 1.3.1 Coding
      • 1.3.2 本地测试
    • 1.4 上Hadoop测试
      • 1.4.1 把数据传到HDFS
      • 1.4.2 找到hadoop-streaming
      • 1.4.3 启动脚本
      • 1.4.4 查看结果
  • 2. WhiteList —— 只关心指定单词词频
    • 2.1 White List
    • 2.2 Mapper
      • 2.2.1 重写map
      • 2.2.2 测试map
    • 2.3 Reducer
    • 2.4 Local Testing
    • 2.5 上Hadoop测试
      • 2.5.1 修改启动脚本
      • 2.5.2启动
      • 2.5.3 查看结果
  • 3. 统计用户每次下单金额
    • 3.1 准备数据
    • 3.2 Mapper
    • 3.3 Reducer
    • 3.4 执行
  • 4. MapReduce join
    • 4.1 数据准备
    • 4.2 Mapper
      • 4.2.1 map A
      • 4.2.2 map B
    • 4.2 Reducer
    • 4.3 上传源数据
    • 4.4 启动脚本
    • 4.5 跑起来
    • 4.6 查看结果
  • Appendix

Input and Output types of a MapReduce job:
(input) -> map -> -> combine -> -> reduce -> (output)

1. WordCount —— 统计文本中的词频

  • 实现思想:文章单词间都是用空格分割的,所以我们按照空格切分每个单词再统计每个单词出现个次数就好了。
  • 编程模型:可以看到上面的编程模型中,整个MR的计算模型是键值对模式的。

1.1 准备文本数据

可以是任意一个文本文章, 我用的The_Man_of_Property.txt

1.2 Map阶段

1.2.1 写一个map函数

vim map.py
import sys
import time
import re

for line in sys.stdin:
        ss = line.strip().split(' ')
        for s in ss:
                if s.strip() != "":
                        print '%s\t%s' % (s, 1)

1.2.2 Testing

  • 取数据中的前2行进行测试
head -n 2 The_Man_of_Property.txt |python map.py
  • 结果
...
to	1
this	1
day,	1
for	1
all	1
the	1
recent	1
efforts	1
to	1
“talk	1
them	1
out.”	1

发现有一些文章的标点符号被引入计算,这不是我们期望的,需要对代码进行一些改进。

1.2.3 对代码进行改进

增加一个正则匹配字母,这样跳过符号影响

import sys
import time
import re

p = re.compile(r'\w+')
for line in sys.stdin:
        ss = line.strip().split(' ')
        for s in ss:
                array_s = p.findall(s)
                for word in array_s:
                        if word.strip() != "":
                                print '%s\t%s' % (word.lower(), 1)
  • 重新测试结果
...
property	1
counted	1
as	1
they	1
do	1
to	1
this	1
day	1
for	1
all	1
the	1
recent	1
efforts	1
to	1
talk	1
them	1
out	1
  • 看起来已经没有符号的影响了。
  • map输出的结果是符合我们的编程模型的。

1.3 Reduce阶段

该阶段我们只需要将相同word数量累加起来。

1.3.1 Coding

import sys
current_word = None
sum = 0

for line in sys.stdin:
        word, val = line.strip().split('\t')

        if current_word == None:
                current_word = word

        if current_word != word:
                print "%s\t%s" % (current_word, sum)
                current_word = word
                sum = 0
        sum += int(val)

print "%s\t%s" % (current_word, str(sum))

1.3.2 本地测试

cat The_Man_of_Property.txt|python map.py|sort -k1 |python red.py

命令中包含的sort工作在MR计算的combine阶段会自动帮助我们完成。

  • 结果
...
yielding	2
yields	1
you	750
young	238
younger	11
youngest	3
youngling	1
your	149
yours	8
yourself	22
yourselves	1
youth	11
z	1
zealand	1
zelle	1
zermatt	1
zoo	9

1.4 上Hadoop测试

使用Streaming的方式,需要用到hadoop-streaming的包。

1.4.1 把数据传到HDFS

[root@node1 test]# hdfs dfs -put The_Man_of_Property.txt /user/hadoop 
[root@node1 test]# hdfs dfs -ls /user/hadoop
Found 3 items
-rw-r--r--   2 root supergroup     632207 2020-12-15 04:19 /user/hadoop/The_Man_of_Property.txt
-rw-r--r--   1 root supergroup         12 2020-12-06 15:03 /user/hadoop/result.txt
-rw-r--r--   2 root supergroup          0 2020-12-06 15:33 /user/hadoop/touchfile.txt

1.4.2 找到hadoop-streaming

find / -name 'hadoop-streaming*.jar'
/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar
/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/sources/hadoop-streaming-2.6.1-sources.jar
/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/sources/hadoop-streaming-2.6.1-test-sources.jar

1.4.3 启动脚本

为了方便编辑和运行制作一个启动脚本

STREAM_JAR_PATH="/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar"

INPUT_FILE_PATH="/user/hadoop/The_Man_of_Property.txt"
OUTPUT_PATH="/out/wc"

hadoop jar $STREAM_JAR_PATH \
        -input $INPUT_FILE_PATH \
        -output $OUTPUT_PATH \
        -mapper "python map.py" \
        -reducer "python red.py" \
        -file /root/test/map.py \
        -file /root/test/red.py
  • 发车
[root@node1 test]# sh -x run.sh 
+ STREAM_JAR_PATH=/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar
+ INPUT_FILE_PATH=/user/hadoop/The_Man_of_Property.txt
+ OUTPUT_PATH=/out/wc
+ hadoop jar /usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar -input /user/hadoop/The_Man_of_Property.txt -output /out/wc -mapper 'python map.py' -reducer 'python red.py' -file /root/test/map.py -file /root/test/red.py
20/12/15 05:25:49 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/root/test/map.py, /root/test/red.py, /tmp/hadoop-unjar7281784309232586653/] [] /tmp/streamjob5660556035539970318.jar tmpDir=null
20/12/15 05:25:50 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 05:25:50 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 05:25:51 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/15 05:25:51 INFO mapreduce.JobSubmitter: number of splits:2
20/12/15 05:25:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1608034597178_0009
20/12/15 05:25:51 INFO impl.YarnClientImpl: Submitted application application_1608034597178_0009
20/12/15 05:25:51 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1608034597178_0009/
20/12/15 05:25:51 INFO mapreduce.Job: Running job: job_1608034597178_0009
20/12/15 05:25:57 INFO mapreduce.Job: Job job_1608034597178_0009 running in uber mode : false
20/12/15 05:25:57 INFO mapreduce.Job:  map 0% reduce 0%
20/12/15 05:26:08 INFO mapreduce.Job:  map 100% reduce 0%
20/12/15 05:26:14 INFO mapreduce.Job:  map 100% reduce 100%
20/12/15 05:26:14 INFO mapreduce.Job: Job job_1608034597178_0009 completed successfully
20/12/15 05:26:14 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=1045598
		FILE: Number of bytes written=2418591
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=635802
		HDFS: Number of bytes written=93748
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=14535
		Total time spent by all reduces in occupied slots (ms)=3954
		Total time spent by all map tasks (ms)=14535
		Total time spent by all reduce tasks (ms)=3954
		Total vcore-seconds taken by all map tasks=14535
		Total vcore-seconds taken by all reduce tasks=3954
		Total megabyte-seconds taken by all map tasks=14883840
		Total megabyte-seconds taken by all reduce tasks=4048896
	Map-Reduce Framework
		Map input records=2866
		Map output records=113132
		Map output bytes=819328
		Map output materialized bytes=1045604
		Input split bytes=210
		Combine input records=0
		Combine output records=0
		Reduce input groups=9114
		Reduce shuffle bytes=1045604
		Reduce input records=113132
		Reduce output records=9114
		Spilled Records=226264
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=319
		CPU time spent (ms)=2580
		Physical memory (bytes) snapshot=486555648
		Virtual memory (bytes) snapshot=6174294016
		Total committed heap usage (bytes)=258678784
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=635592
	File Output Format Counters 
		Bytes Written=93748
20/12/15 05:26:14 INFO streaming.StreamJob: Output directory: /out/wc
  • 如果有错检查是不是脚本写错了。
  • 任务已经跑完了,看看结果。

1.4.4 查看结果

  • 到指定的路劲下查看
hdfs dfs -ls /out/wc
Found 2 items
-rw-r--r--   2 root supergroup          0 2020-12-15 05:26 /out/wc/_SUCCESS
-rw-r--r--   2 root supergroup      93748 2020-12-15 05:26 /out/wc/part-00000
  • 直接cat结果
hdfs dfs -cat /out/wc/part-00000
...
yields	1
you	750
young	238
younger	11
youngest	3
youngling	1
your	149
yours	8
yourself	22
yourselves	1
youth	11
z	1
zealand	1
zelle	1
zermatt	1
zoo	9

2. WhiteList —— 只关心指定单词词频

现在我不想统计所有单词的词频了,给你一个白名单,只统计其包含单词的词频。

2.1 White List

[root@node1 test]# vim whiteList.txt               
[root@node1 test]# cat whiteList.txt 
you
against
recent

2.2 Mapper

2.2.1 重写map

在map中根据white list进行过滤

import sys
import re

def read_local_file_func(f):
        word_set = set()
        file_in = open(f,'r')
        for line in file_in:
                word = line.strip()
                word_set.add(word)
        return word_set

def mapper_func(white_list_fd):
        word_set = read_local_file_func(white_list_fd)
        p = re.compile(r'\w+')
        for line in sys.stdin:
                ss = line.strip().split(' ')
                for s in ss:
                        array_s = p.findall(s)
                        for word in array_s:
                                if word.strip() != "" and (word in word_set):
                                        print '%s\t%s' % (word.lower(), 1)

if __name__=="__main__":
        module = sys.modules[__name__]
        func = getattr(module, sys.argv[1])
        args = None
        if len(sys.argv) > 1:
                args = sys.argv[2:]
        func(*args)

2.2.2 测试map

[root@node1 test]# cat The_Man_of_Property.txt |python map.py mapper_func whiteList.txt | head
against	1
recent	1
against	1
against	1
against	1
against	1
against	1
against	1
you	1
against	1
close failed in file object destructor:
Error in sys.excepthook:

Original exception was:
  • 这里的exception是由于我们用head进行截断,实际python程序还没有跑完,不用在意。不使用head就不会出这个异常了。

2.3 Reducer

直接用WordCount的reducer就可以

2.4 Local Testing

cat The_Man_of_Property.txt |python map.py mapper_func whiteList.txt |sort -k1|python red.py
  • 结果:
against	93
recent	2
you	613

2.5 上Hadoop测试

2.5.1 修改启动脚本

STREAM_JAR_PATH="/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar"

INPUT_FILE_PATH="/user/hadoop/The_Man_of_Property.txt"
OUTPUT_PATH="/out/wl"

hadoop jar $STREAM_JAR_PATH \
        -input $INPUT_FILE_PATH \
        -output $OUTPUT_PATH \
        -mapper "python map.py mapper_func whiteList.txt" \
        -reducer "python red.py" \
        -file ./map.py \
        -file ./red.py \
        -file ./whiteList.txt

2.5.2启动

[root@node1 test]# sh -x run.sh 
+ STREAM_JAR_PATH=/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar
+ INPUT_FILE_PATH=/user/hadoop/The_Man_of_Property.txt
+ OUTPUT_PATH=/out/wl
+ hadoop jar /usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar -input /user/hadoop/The_Man_of_Property.txt -output /out/wl -mapper 'python map.py mapper_func whiteList.txt' -reducer 'python red.py' -file ./map.py -file ./red.py -file ./whiteList.txt
20/12/15 06:16:13 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./map.py, ./red.py, ./whiteList.txt, /tmp/hadoop-unjar7041032493610463978/] [] /tmp/streamjob4902933525116380783.jar tmpDir=null
20/12/15 06:16:14 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 06:16:14 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 06:16:15 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/15 06:16:15 INFO mapreduce.JobSubmitter: number of splits:2
20/12/15 06:16:15 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1608034597178_0013
20/12/15 06:16:15 INFO impl.YarnClientImpl: Submitted application application_1608034597178_0013
20/12/15 06:16:15 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1608034597178_0013/
20/12/15 06:16:15 INFO mapreduce.Job: Running job: job_1608034597178_0013
20/12/15 06:16:23 INFO mapreduce.Job: Job job_1608034597178_0013 running in uber mode : false
20/12/15 06:16:23 INFO mapreduce.Job:  map 0% reduce 0%
20/12/15 06:16:31 INFO mapreduce.Job:  map 100% reduce 0%
20/12/15 06:16:37 INFO mapreduce.Job:  map 100% reduce 100%
20/12/15 06:16:38 INFO mapreduce.Job: Job job_1608034597178_0013 completed successfully
20/12/15 06:16:38 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=6048
		FILE: Number of bytes written=340427
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=635802
		HDFS: Number of bytes written=28
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=12845
		Total time spent by all reduces in occupied slots (ms)=3526
		Total time spent by all map tasks (ms)=12845
		Total time spent by all reduce tasks (ms)=3526
		Total vcore-seconds taken by all map tasks=12845
		Total vcore-seconds taken by all reduce tasks=3526
		Total megabyte-seconds taken by all map tasks=13153280
		Total megabyte-seconds taken by all reduce tasks=3610624
	Map-Reduce Framework
		Map input records=2866
		Map output records=708
		Map output bytes=4626
		Map output materialized bytes=6054
		Input split bytes=210
		Combine input records=0
		Combine output records=0
		Reduce input groups=3
		Reduce shuffle bytes=6054
		Reduce input records=708
		Reduce output records=3
		Spilled Records=1416
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=299
		CPU time spent (ms)=1620
		Physical memory (bytes) snapshot=484392960
		Virtual memory (bytes) snapshot=6174113792
		Total committed heap usage (bytes)=259112960
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=635592
	File Output Format Counters 
		Bytes Written=28
20/12/15 06:16:38 INFO streaming.StreamJob: Output directory: /out/wl

2.5.3 查看结果

[root@node1 test]# hdfs dfs -ls /out/wl
Found 2 items
-rw-r--r--   2 root supergroup          0 2020-12-15 06:16 /out/wl/_SUCCESS
-rw-r--r--   2 root supergroup         28 2020-12-15 06:16 /out/wl/part-00000
[root@node1 test]# hdfs dfs -cat /out/wl/part-00000
against	93
recent	2
you	613

3. 统计用户每次下单金额

3.1 准备数据

  • uId,sell
user1,3
user2,5
user1,4
user6,3
user2,9
user2,5
user9,1
user8,3
user2,4
user2,4
user5,5
user6,8
user7,2
user2,7
  • 期望结果
user1	3,4
user2	4,4,5,5,7,9
user5	5
user6	3,8
user7	2
user8	3
user9	1

3.2 Mapper

  • 实际本例中mapper逻辑作用不大
import sys

for line in sys.stdin:
        key = line.strip().split(',')
        print '\t'.join(key)

3.3 Reducer

import sys

cur = None
cur_list = []
for line in sys.stdin:
        ss = line.strip().split('\t')
        key = ss[0]
        val = ss[1]
        if cur == None:
                cur = key
        elif cur!=key:
                print '%s\t%s' % (cur, ','.join(cur_list))
                cur = key
                cur_list = []
        cur_list.append(val)
print '%s\t%s' % (cur, ','.join(cur_list))

3.4 执行

[root@node1 test]# cat ub.data |python ubMap.py | sort -k1 | python ubRed.py 
user1	3,4
user2	4,4,5,5,7,9
user5	5
user6	3,8
user7	2
user8	3
user9	1

4. MapReduce join

4.1 数据准备

  • a集合,用户——消费金额
user1,42
user2,55
user3,66
user7,2
user9,38
  • b集合,用户——购买商品
user2,Hadoop
user3,Spark
user5,Trump
user7,Cap
user88,Laptop
  • 期望join后的结果
user2	55	Hadoop
user3	66	Spark
user7	2	Cap
  • 整体思想:

通过mapper对两个数据集格式化后保存到hdfs,再启动一个mr对两个数据集进行合并。

4.2 Mapper

4.2.1 map A

  • mapper
import sys

for line in sys.stdin:
	ss = line.strip().split(',')
	print "%s\t1\t%s" % (ss[0], ss[1])
  • 处理结果
[root@node1 test]# cat a.txt |python mapA.py 
user1	1	42
user2	1	55
user3	1	66
user7	1	2
user9	1	38

4.2.2 map B

  • mapper
import sys

for line in sys.stdin:
	ss = line.strip().split(',')
	print "%s\t2\t%s" % (ss[0], ss[1])
  • 处理结果
[root@node1 test]# cat b.txt |python mapB.py 
user2	2	Hadoop
user3	2	Spark
user5	2	Trump
user7	2	Cap
user88	2	Laptop

可以看到对a, b集合增加了一个标识(1,2),是为了方便在join的时候判断两个集合的数据都接收到了。

4.2 Reducer

这个阶段的工作就是将前面输出的两个集合合并。


import sys

val_1 = ""
for line in sys.stdin:
        key, flag, val = line.strip().split('\t')

        if flag == '1':
                val_1 = val
        elif flag == '2' and val_1 != "":
                val_2 = val
                print "%s\t%s\t%s" % (key, val_1, val_2)
                val_1 = ""

4.3 上传源数据

[root@node1 test]# hdfs dfs -put a.txt /user/hadoop
[root@node1 test]# hdfs dfs -put b.txt /user/hadoop

4.4 启动脚本

STREAM_JAR_PATH="/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar"

INPUT_FILE_PATH_A="/user/hadoop/a.txt"
INPUT_FILE_PATH_B="/user/hadoop/b.txt"

OUTPUT_PATH_A="/out/a"
OUTPUT_PATH_B="/out/b"
OUTPUT_PATH_JOIN="/out/JOIN"

echo ========================= step 1. 

hdfs dfs -ls $OUTPUT_PATH_A>null
res=$?
if [ "$res" -eq "0" ];then
        echo need to delete $OUTPUT_PATH_A
        hdfs dfs -rmr $OUTPUT_PATH_A
else
        echo no need to delete $OUTPUT_PATH_A
fi

hadoop jar $STREAM_JAR_PATH \
        -input $INPUT_FILE_PATH_A \
        -output $OUTPUT_PATH_A \
        -mapper "python mapA.py" \
        -file ./mapA.py \

echo ========================= step 2.

hdfs dfs -ls $OUTPUT_PATH_B>null
res=$?
if [ "$res" -eq "0" ];then
        echo need to delete $OUTPUT_PATH_B
        hdfs dfs -rmr $OUTPUT_PATH_B
else
        echo no need to delete $OUTPUT_PATH_B
fi

hadoop jar $STREAM_JAR_PATH \
        -input $INPUT_FILE_PATH_B \
        -output $OUTPUT_PATH_B \
        -mapper "python mapB.py" \
        -file ./mapB.py \

echo ========================= step 3.

hdfs dfs -ls $OUTPUT_PATH_JOIN>null
res=$?
if [ "$res" -eq "0" ];then
        echo need to delete $OUTPUT_PATH_JOIN
        hdfs dfs -rmr $OUTPUT_PATH_JOIN
else
        echo no need to delete $OUTPUT_PATH_JOIN
fi

hadoop jar $STREAM_JAR_PATH \
        -input $OUTPUT_PATH_A,$OUTPUT_PATH_B \
        -output $OUTPUT_PATH_JOIN \
        -mapper "cat" \
        -reducer "python redJoin.py" \
        -file ./redJoin.py \
        -jobconf stream.num.map.output.key.fields=2 \
        -jobconf num.key.fields.for.partition=1

4.5 跑起来

  • 会看到3个mr的作业
[root@node1 test]# sh runJoin.sh 
========================= step 1.
need to delete /out/a
rmr: DEPRECATED: Please use 'rm -r' instead.
20/12/15 09:23:04 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /out/a
20/12/15 09:23:05 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./mapA.py, /tmp/hadoop-unjar2503113385762558629/] [] /tmp/streamjob9067169500243307079.jar tmpDir=null
20/12/15 09:23:06 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 09:23:06 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 09:23:07 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/15 09:23:07 INFO mapreduce.JobSubmitter: number of splits:2
20/12/15 09:23:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1608034597178_0031
20/12/15 09:23:07 INFO impl.YarnClientImpl: Submitted application application_1608034597178_0031
20/12/15 09:23:07 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1608034597178_0031/
20/12/15 09:23:07 INFO mapreduce.Job: Running job: job_1608034597178_0031
20/12/15 09:23:13 INFO mapreduce.Job: Job job_1608034597178_0031 running in uber mode : false
20/12/15 09:23:13 INFO mapreduce.Job:  map 0% reduce 0%
20/12/15 09:23:22 INFO mapreduce.Job:  map 50% reduce 0%
20/12/15 09:23:23 INFO mapreduce.Job:  map 100% reduce 0%
20/12/15 09:23:28 INFO mapreduce.Job:  map 100% reduce 100%
20/12/15 09:23:29 INFO mapreduce.Job: Job job_1608034597178_0031 completed successfully
20/12/15 09:23:29 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=70
		FILE: Number of bytes written=324793
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=240
		HDFS: Number of bytes written=54
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=12251
		Total time spent by all reduces in occupied slots (ms)=3152
		Total time spent by all map tasks (ms)=12251
		Total time spent by all reduce tasks (ms)=3152
		Total vcore-seconds taken by all map tasks=12251
		Total vcore-seconds taken by all reduce tasks=3152
		Total megabyte-seconds taken by all map tasks=12545024
		Total megabyte-seconds taken by all reduce tasks=3227648
	Map-Reduce Framework
		Map input records=5
		Map output records=5
		Map output bytes=54
		Map output materialized bytes=76
		Input split bytes=174
		Combine input records=0
		Combine output records=0
		Reduce input groups=5
		Reduce shuffle bytes=76
		Reduce input records=5
		Reduce output records=5
		Spilled Records=10
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=297
		CPU time spent (ms)=1320
		Physical memory (bytes) snapshot=487895040
		Virtual memory (bytes) snapshot=6174117888
		Total committed heap usage (bytes)=258678784
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=66
	File Output Format Counters 
		Bytes Written=54
20/12/15 09:23:29 INFO streaming.StreamJob: Output directory: /out/a
========================= step 2.
need to delete /out/b
rmr: DEPRECATED: Please use 'rm -r' instead.
20/12/15 09:23:33 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /out/b
20/12/15 09:23:34 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./mapB.py, /tmp/hadoop-unjar76200206261107368/] [] /tmp/streamjob7206024831264575031.jar tmpDir=null
20/12/15 09:23:35 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 09:23:35 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 09:23:36 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/15 09:23:36 INFO mapreduce.JobSubmitter: number of splits:2
20/12/15 09:23:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1608034597178_0032
20/12/15 09:23:37 INFO impl.YarnClientImpl: Submitted application application_1608034597178_0032
20/12/15 09:23:37 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1608034597178_0032/
20/12/15 09:23:37 INFO mapreduce.Job: Running job: job_1608034597178_0032
20/12/15 09:23:43 INFO mapreduce.Job: Job job_1608034597178_0032 running in uber mode : false
20/12/15 09:23:43 INFO mapreduce.Job:  map 0% reduce 0%
20/12/15 09:23:51 INFO mapreduce.Job:  map 100% reduce 0%
20/12/15 09:23:56 INFO mapreduce.Job:  map 100% reduce 100%
20/12/15 09:23:56 INFO mapreduce.Job: Job job_1608034597178_0032 completed successfully
20/12/15 09:23:56 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=87
		FILE: Number of bytes written=324827
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=266
		HDFS: Number of bytes written=71
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=11561
		Total time spent by all reduces in occupied slots (ms)=3155
		Total time spent by all map tasks (ms)=11561
		Total time spent by all reduce tasks (ms)=3155
		Total vcore-seconds taken by all map tasks=11561
		Total vcore-seconds taken by all reduce tasks=3155
		Total megabyte-seconds taken by all map tasks=11838464
		Total megabyte-seconds taken by all reduce tasks=3230720
	Map-Reduce Framework
		Map input records=5
		Map output records=5
		Map output bytes=71
		Map output materialized bytes=93
		Input split bytes=174
		Combine input records=0
		Combine output records=0
		Reduce input groups=5
		Reduce shuffle bytes=93
		Reduce input records=5
		Reduce output records=5
		Spilled Records=10
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=292
		CPU time spent (ms)=1300
		Physical memory (bytes) snapshot=483713024
		Virtual memory (bytes) snapshot=6174130176
		Total committed heap usage (bytes)=259031040
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=92
	File Output Format Counters 
		Bytes Written=71
20/12/15 09:23:56 INFO streaming.StreamJob: Output directory: /out/b
========================= step 3.
need to delete /out/JOIN
rmr: DEPRECATED: Please use 'rm -r' instead.
20/12/15 09:24:00 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /out/JOIN
20/12/15 09:24:02 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
20/12/15 09:24:02 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [./redJoin.py, /tmp/hadoop-unjar7560819781671361131/] [] /tmp/streamjob3781723006571695879.jar tmpDir=null
20/12/15 09:24:02 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 09:24:03 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 09:24:04 INFO mapred.FileInputFormat: Total input paths to process : 2
20/12/15 09:24:04 INFO mapreduce.JobSubmitter: number of splits:3
20/12/15 09:24:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1608034597178_0033
20/12/15 09:24:04 INFO impl.YarnClientImpl: Submitted application application_1608034597178_0033
20/12/15 09:24:04 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1608034597178_0033/
20/12/15 09:24:04 INFO mapreduce.Job: Running job: job_1608034597178_0033
20/12/15 09:24:10 INFO mapreduce.Job: Job job_1608034597178_0033 running in uber mode : false
20/12/15 09:24:10 INFO mapreduce.Job:  map 0% reduce 0%
20/12/15 09:24:22 INFO mapreduce.Job:  map 100% reduce 0%
20/12/15 09:24:29 INFO mapreduce.Job:  map 100% reduce 100%
20/12/15 09:24:29 INFO mapreduce.Job: Job job_1608034597178_0033 completed successfully
20/12/15 09:24:30 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=151
		FILE: Number of bytes written=436951
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=392
		HDFS: Number of bytes written=43
		HDFS: Number of read operations=12
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=3
		Launched reduce tasks=1
		Data-local map tasks=3
		Total time spent by all maps in occupied slots (ms)=29703
		Total time spent by all reduces in occupied slots (ms)=3315
		Total time spent by all map tasks (ms)=29703
		Total time spent by all reduce tasks (ms)=3315
		Total vcore-seconds taken by all map tasks=29703
		Total vcore-seconds taken by all reduce tasks=3315
		Total megabyte-seconds taken by all map tasks=30415872
		Total megabyte-seconds taken by all reduce tasks=3394560
	Map-Reduce Framework
		Map input records=10
		Map output records=10
		Map output bytes=125
		Map output materialized bytes=163
		Input split bytes=258
		Combine input records=0
		Combine output records=0
		Reduce input groups=10
		Reduce shuffle bytes=163
		Reduce input records=10
		Reduce output records=3
		Spilled Records=20
		Shuffled Maps =3
		Failed Shuffles=0
		Merged Map outputs=3
		GC time elapsed (ms)=579
		CPU time spent (ms)=2060
		Physical memory (bytes) snapshot=640995328
		Virtual memory (bytes) snapshot=8230125568
		Total committed heap usage (bytes)=379858944
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=134
	File Output Format Counters 
		Bytes Written=43
20/12/15 09:24:30 INFO streaming.StreamJob: Output directory: /out/JOIN

4.6 查看结果

[root@node1 test]# hdfs dfs -ls /out/JOIN
Found 2 items
-rw-r--r--   2 root supergroup          0 2020-12-15 09:24 /out/JOIN/_SUCCESS
-rw-r--r--   2 root supergroup         43 2020-12-15 09:24 /out/JOIN/part-00000
[root@node1 test]# hdfs dfs -cat /out/JOIN/part-00000
user2	55	Hadoop
user3	66	Spark
user7	2	Cap

Appendix

  • 经常遇到的exception是:PipeMapRed.waitOutputThreads(): subprocess failed with code N
"OS error code 1: Operation not permitted"
 "OS error code 2: No such file or directory"
 "OS error code 3: No such process"
 "OS error code 4: Interrupted system call"
 "OS error code 5: Input/output error"
 "OS error code 6: No such device or address"
 "OS error code 7: Argument list too long"
 "OS error code 8: Exec format error"
 "OS error code 9: Bad file descriptor"
 "OS error code 10: No child processes"
 "OS error code 11: Resource temporarily unavailable"
 "OS error code 12: Cannot allocate memory"
 "OS error code 13: Permission denied"
 "OS error code 14: Bad address"
 "OS error code 15: Block device required"
 "OS error code 16: Device or resource busy"
 "OS error code 17: File exists"
 "OS error code 18: Invalid cross-device link"
 "OS error code 19: No such device"
 "OS error code 20: Not a directory"
 "OS error code 21: Is a directory"
 "OS error code 22: Invalid argument"
 "OS error code 23: Too many open files in system"
 "OS error code 24: Too many open files"
 "OS error code 25: Inappropriate ioctl for device"
 "OS error code 26: Text file busy"
 "OS error code 27: File too large"
 "OS error code 28: No space left on device"
 "OS error code 29: Illegal seek"
 "OS error code 30: Read-only file system"
 "OS error code 31: Too many links"
 "OS error code 32: Broken pipe"
 "OS error code 33: Numerical argument out of domain"
 "OS error code 34: Numerical result out of range"
 "OS error code 35: Resource deadlock avoided"
 "OS error code 36: File name too long"
 "OS error code 37: No locks available"
 "OS error code 38: Function not implemented"
 "OS error code 39: Directory not empty"
 "OS error code 40: Too many levels of symbolic links"
 "OS error code 42: No message of desired type"
 "OS error code 43: Identifier removed"
 "OS error code 44: Channel number out of range"
 "OS error code 45: Level 2 not synchronized"
 "OS error code 46: Level 3 halted"
 "OS error code 47: Level 3 reset"
 "OS error code 48: Link number out of range"
 "OS error code 49: Protocol driver not attached"
 "OS error code 50: No CSI structure available"
 "OS error code 51: Level 2 halted"
 "OS error code 52: Invalid exchange"
 "OS error code 53: Invalid request descriptor"
 "OS error code 54: Exchange full"
 "OS error code 55: No anode"
 "OS error code 56: Invalid request code"
 "OS error code 57: Invalid slot"
 "OS error code 59: Bad font file format"
 "OS error code 60: Device not a stream"
 "OS error code 61: No data available"
 "OS error code 62: Timer expired"
 "OS error code 63: Out of streams resources"
 "OS error code 64: Machine is not on the network"
 "OS error code 65: Package not installed"
 "OS error code 66: Object is remote"
 "OS error code 67: Link has been severed"
 "OS error code 68: Advertise error"
 "OS error code 69: Srmount error"
 "OS error code 70: Communication error on send"
 "OS error code 71: Protocol error"
 "OS error code 72: Multihop attempted"
 "OS error code 73: RFS specific error"
 "OS error code 74: Bad message"
 "OS error code 75: Value too large for defined data type"
 "OS error code 76: Name not unique on network"
 "OS error code 77: File descriptor in bad state"
 "OS error code 78: Remote address changed"
 "OS error code 79: Can not access a needed shared library"
 "OS error code 80: Accessing a corrupted shared library"
 "OS error code 81: .lib section in a.out corrupted"
 "OS error code 82: Attempting to link in too many shared libraries"
 "OS error code 83: Cannot exec a shared library directly"
 "OS error code 84: Invalid or incomplete multibyte or wide character"
 "OS error code 85: Interrupted system call should be restarted"
 "OS error code 86: Streams pipe error"
 "OS error code 87: Too many users"
 "OS error code 88: Socket operation on non-socket"
 "OS error code 89: Destination address required"
 "OS error code 90: Message too long"
 "OS error code 91: Protocol wrong type for socket"
 "OS error code 92: Protocol not available"
 "OS error code 93: Protocol not supported"
 "OS error code 94: Socket type not supported"
 "OS error code 95: Operation not supported"
 "OS error code 96: Protocol family not supported"
 "OS error code 97: Address family not supported by protocol"
 "OS error code 98: Address already in use"
 "OS error code 99: Cannot assign requested address"
 "OS error code 100: Network is down"
 "OS error code 101: Network is unreachable"
 "OS error code 102: Network dropped connection on reset"
 "OS error code 103: Software caused connection abort"
 "OS error code 104: Connection reset by peer"
 "OS error code 105: No buffer space available"
 "OS error code 106: Transport endpoint is already connected"
 "OS error code 107: Transport endpoint is not connected"
 "OS error code 108: Cannot send after transport endpoint shutdown"
 "OS error code 109: Too many references: cannot splice"
 "OS error code 110: Connection timed out"
 "OS error code 111: Connection refused"
 "OS error code 112: Host is down"
 "OS error code 113: No route to host"
 "OS error code 114: Operation already in progress"
 "OS error code 115: Operation now in progress"
 "OS error code 116: Stale NFS file handle"
 "OS error code 117: Structure needs cleaning"
 "OS error code 118: Not a XENIX named type file"
 "OS error code 119: No XENIX semaphores available"
 "OS error code 120: Is a named type file"
 "OS error code 121: Remote I/O error"
 "OS error code 122: Disk quota exceeded"
 "OS error code 123: No medium found"
 "OS error code 124: Wrong medium type"
 "OS error code 125: Operation canceled"
 "OS error code 126: Required key not available"
 "OS error code 127: Key has expired"
 "OS error code 128: Key has been revoked"
 "OS error code 129: Key was rejected by service"
 "OS error code 130: Owner died"
 "OS error code 131: State not recoverable"
 "MySQL error code 132: Old database file"
 "MySQL error code 133: No record read before update"
 "MySQL error code 134: Record was already deleted (or record file crashed)"
 "MySQL error code 135: No more room in record file"
 "MySQL error code 136: No more room in index file"
 "MySQL error code 137: No more records (read after end of file)"
 "MySQL error code 138: Unsupported extension used for table"
 "MySQL error code 139: Too big row"
 "MySQL error code 140: Wrong create options"
 "MySQL error code 141: Duplicate unique key or constraint on write or update"
 "MySQL error code 142: Unknown character set used"
 "MySQL error code 143: Conflicting table definitions in sub-tables of MERGE table"
 "MySQL error code 144: Table is crashed and last repair failed"
 "MySQL error code 145: Table was marked as crashed and should be repaired"
 "MySQL error code 146: Lock timed out; Retry transaction"
 "MySQL error code 147: Lock table is full; Restart program with a larger locktable"
 "MySQL error code 148: Updates are not allowed under a read only transactions"
 "MySQL error code 149: Lock deadlock; Retry transaction"
 "MySQL error code 150: Foreign key constraint is incorrectly formed"
 "MySQL error code 151: Cannot add a child row"
 "MySQL error code 152: Cannot delete a parent row"

你可能感兴趣的:(Hadoop,python,hadoop,mapreduce,大数据)