几个入门级别的MapReduce练习
基于版本:Python2.6.6,Hadoop2.6.5
Input and Output types of a MapReduce job:
(input)-> map -> -> combine -> -> reduce -> (output)
可以是任意一个文本文章, 我用的The_Man_of_Property.txt
vim map.py
import sys
import time
import re
for line in sys.stdin:
ss = line.strip().split(' ')
for s in ss:
if s.strip() != "":
print '%s\t%s' % (s, 1)
head -n 2 The_Man_of_Property.txt |python map.py
...
to 1
this 1
day, 1
for 1
all 1
the 1
recent 1
efforts 1
to 1
“talk 1
them 1
out.” 1
发现有一些文章的标点符号被引入计算,这不是我们期望的,需要对代码进行一些改进。
增加一个正则匹配字母,这样跳过符号影响
import sys
import time
import re
p = re.compile(r'\w+')
for line in sys.stdin:
ss = line.strip().split(' ')
for s in ss:
array_s = p.findall(s)
for word in array_s:
if word.strip() != "":
print '%s\t%s' % (word.lower(), 1)
...
property 1
counted 1
as 1
they 1
do 1
to 1
this 1
day 1
for 1
all 1
the 1
recent 1
efforts 1
to 1
talk 1
them 1
out 1
该阶段我们只需要将相同word数量累加起来。
import sys
current_word = None
sum = 0
for line in sys.stdin:
word, val = line.strip().split('\t')
if current_word == None:
current_word = word
if current_word != word:
print "%s\t%s" % (current_word, sum)
current_word = word
sum = 0
sum += int(val)
print "%s\t%s" % (current_word, str(sum))
cat The_Man_of_Property.txt|python map.py|sort -k1 |python red.py
命令中包含的sort
工作在MR计算的combine
阶段会自动帮助我们完成。
...
yielding 2
yields 1
you 750
young 238
younger 11
youngest 3
youngling 1
your 149
yours 8
yourself 22
yourselves 1
youth 11
z 1
zealand 1
zelle 1
zermatt 1
zoo 9
使用Streaming的方式,需要用到hadoop-streaming的包。
[root@node1 test]# hdfs dfs -put The_Man_of_Property.txt /user/hadoop
[root@node1 test]# hdfs dfs -ls /user/hadoop
Found 3 items
-rw-r--r-- 2 root supergroup 632207 2020-12-15 04:19 /user/hadoop/The_Man_of_Property.txt
-rw-r--r-- 1 root supergroup 12 2020-12-06 15:03 /user/hadoop/result.txt
-rw-r--r-- 2 root supergroup 0 2020-12-06 15:33 /user/hadoop/touchfile.txt
find / -name 'hadoop-streaming*.jar'
/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar
/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/sources/hadoop-streaming-2.6.1-sources.jar
/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/sources/hadoop-streaming-2.6.1-test-sources.jar
为了方便编辑和运行制作一个启动脚本
STREAM_JAR_PATH="/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar"
INPUT_FILE_PATH="/user/hadoop/The_Man_of_Property.txt"
OUTPUT_PATH="/out/wc"
hadoop jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH \
-output $OUTPUT_PATH \
-mapper "python map.py" \
-reducer "python red.py" \
-file /root/test/map.py \
-file /root/test/red.py
[root@node1 test]# sh -x run.sh
+ STREAM_JAR_PATH=/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar
+ INPUT_FILE_PATH=/user/hadoop/The_Man_of_Property.txt
+ OUTPUT_PATH=/out/wc
+ hadoop jar /usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar -input /user/hadoop/The_Man_of_Property.txt -output /out/wc -mapper 'python map.py' -reducer 'python red.py' -file /root/test/map.py -file /root/test/red.py
20/12/15 05:25:49 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [/root/test/map.py, /root/test/red.py, /tmp/hadoop-unjar7281784309232586653/] [] /tmp/streamjob5660556035539970318.jar tmpDir=null
20/12/15 05:25:50 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 05:25:50 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 05:25:51 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/15 05:25:51 INFO mapreduce.JobSubmitter: number of splits:2
20/12/15 05:25:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1608034597178_0009
20/12/15 05:25:51 INFO impl.YarnClientImpl: Submitted application application_1608034597178_0009
20/12/15 05:25:51 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1608034597178_0009/
20/12/15 05:25:51 INFO mapreduce.Job: Running job: job_1608034597178_0009
20/12/15 05:25:57 INFO mapreduce.Job: Job job_1608034597178_0009 running in uber mode : false
20/12/15 05:25:57 INFO mapreduce.Job: map 0% reduce 0%
20/12/15 05:26:08 INFO mapreduce.Job: map 100% reduce 0%
20/12/15 05:26:14 INFO mapreduce.Job: map 100% reduce 100%
20/12/15 05:26:14 INFO mapreduce.Job: Job job_1608034597178_0009 completed successfully
20/12/15 05:26:14 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=1045598
FILE: Number of bytes written=2418591
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=635802
HDFS: Number of bytes written=93748
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=14535
Total time spent by all reduces in occupied slots (ms)=3954
Total time spent by all map tasks (ms)=14535
Total time spent by all reduce tasks (ms)=3954
Total vcore-seconds taken by all map tasks=14535
Total vcore-seconds taken by all reduce tasks=3954
Total megabyte-seconds taken by all map tasks=14883840
Total megabyte-seconds taken by all reduce tasks=4048896
Map-Reduce Framework
Map input records=2866
Map output records=113132
Map output bytes=819328
Map output materialized bytes=1045604
Input split bytes=210
Combine input records=0
Combine output records=0
Reduce input groups=9114
Reduce shuffle bytes=1045604
Reduce input records=113132
Reduce output records=9114
Spilled Records=226264
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=319
CPU time spent (ms)=2580
Physical memory (bytes) snapshot=486555648
Virtual memory (bytes) snapshot=6174294016
Total committed heap usage (bytes)=258678784
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=635592
File Output Format Counters
Bytes Written=93748
20/12/15 05:26:14 INFO streaming.StreamJob: Output directory: /out/wc
hdfs dfs -ls /out/wc
Found 2 items
-rw-r--r-- 2 root supergroup 0 2020-12-15 05:26 /out/wc/_SUCCESS
-rw-r--r-- 2 root supergroup 93748 2020-12-15 05:26 /out/wc/part-00000
hdfs dfs -cat /out/wc/part-00000
...
yields 1
you 750
young 238
younger 11
youngest 3
youngling 1
your 149
yours 8
yourself 22
yourselves 1
youth 11
z 1
zealand 1
zelle 1
zermatt 1
zoo 9
现在我不想统计所有单词的词频了,给你一个白名单,只统计其包含单词的词频。
[root@node1 test]# vim whiteList.txt
[root@node1 test]# cat whiteList.txt
you
against
recent
在map中根据white list进行过滤
import sys
import re
def read_local_file_func(f):
word_set = set()
file_in = open(f,'r')
for line in file_in:
word = line.strip()
word_set.add(word)
return word_set
def mapper_func(white_list_fd):
word_set = read_local_file_func(white_list_fd)
p = re.compile(r'\w+')
for line in sys.stdin:
ss = line.strip().split(' ')
for s in ss:
array_s = p.findall(s)
for word in array_s:
if word.strip() != "" and (word in word_set):
print '%s\t%s' % (word.lower(), 1)
if __name__=="__main__":
module = sys.modules[__name__]
func = getattr(module, sys.argv[1])
args = None
if len(sys.argv) > 1:
args = sys.argv[2:]
func(*args)
[root@node1 test]# cat The_Man_of_Property.txt |python map.py mapper_func whiteList.txt | head
against 1
recent 1
against 1
against 1
against 1
against 1
against 1
against 1
you 1
against 1
close failed in file object destructor:
Error in sys.excepthook:
Original exception was:
exception
是由于我们用head
进行截断,实际python
程序还没有跑完,不用在意。不使用head
就不会出这个异常了。直接用WordCount的reducer就可以
cat The_Man_of_Property.txt |python map.py mapper_func whiteList.txt |sort -k1|python red.py
against 93
recent 2
you 613
STREAM_JAR_PATH="/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar"
INPUT_FILE_PATH="/user/hadoop/The_Man_of_Property.txt"
OUTPUT_PATH="/out/wl"
hadoop jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH \
-output $OUTPUT_PATH \
-mapper "python map.py mapper_func whiteList.txt" \
-reducer "python red.py" \
-file ./map.py \
-file ./red.py \
-file ./whiteList.txt
[root@node1 test]# sh -x run.sh
+ STREAM_JAR_PATH=/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar
+ INPUT_FILE_PATH=/user/hadoop/The_Man_of_Property.txt
+ OUTPUT_PATH=/out/wl
+ hadoop jar /usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar -input /user/hadoop/The_Man_of_Property.txt -output /out/wl -mapper 'python map.py mapper_func whiteList.txt' -reducer 'python red.py' -file ./map.py -file ./red.py -file ./whiteList.txt
20/12/15 06:16:13 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./map.py, ./red.py, ./whiteList.txt, /tmp/hadoop-unjar7041032493610463978/] [] /tmp/streamjob4902933525116380783.jar tmpDir=null
20/12/15 06:16:14 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 06:16:14 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 06:16:15 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/15 06:16:15 INFO mapreduce.JobSubmitter: number of splits:2
20/12/15 06:16:15 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1608034597178_0013
20/12/15 06:16:15 INFO impl.YarnClientImpl: Submitted application application_1608034597178_0013
20/12/15 06:16:15 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1608034597178_0013/
20/12/15 06:16:15 INFO mapreduce.Job: Running job: job_1608034597178_0013
20/12/15 06:16:23 INFO mapreduce.Job: Job job_1608034597178_0013 running in uber mode : false
20/12/15 06:16:23 INFO mapreduce.Job: map 0% reduce 0%
20/12/15 06:16:31 INFO mapreduce.Job: map 100% reduce 0%
20/12/15 06:16:37 INFO mapreduce.Job: map 100% reduce 100%
20/12/15 06:16:38 INFO mapreduce.Job: Job job_1608034597178_0013 completed successfully
20/12/15 06:16:38 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=6048
FILE: Number of bytes written=340427
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=635802
HDFS: Number of bytes written=28
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=12845
Total time spent by all reduces in occupied slots (ms)=3526
Total time spent by all map tasks (ms)=12845
Total time spent by all reduce tasks (ms)=3526
Total vcore-seconds taken by all map tasks=12845
Total vcore-seconds taken by all reduce tasks=3526
Total megabyte-seconds taken by all map tasks=13153280
Total megabyte-seconds taken by all reduce tasks=3610624
Map-Reduce Framework
Map input records=2866
Map output records=708
Map output bytes=4626
Map output materialized bytes=6054
Input split bytes=210
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=6054
Reduce input records=708
Reduce output records=3
Spilled Records=1416
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=299
CPU time spent (ms)=1620
Physical memory (bytes) snapshot=484392960
Virtual memory (bytes) snapshot=6174113792
Total committed heap usage (bytes)=259112960
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=635592
File Output Format Counters
Bytes Written=28
20/12/15 06:16:38 INFO streaming.StreamJob: Output directory: /out/wl
[root@node1 test]# hdfs dfs -ls /out/wl
Found 2 items
-rw-r--r-- 2 root supergroup 0 2020-12-15 06:16 /out/wl/_SUCCESS
-rw-r--r-- 2 root supergroup 28 2020-12-15 06:16 /out/wl/part-00000
[root@node1 test]# hdfs dfs -cat /out/wl/part-00000
against 93
recent 2
you 613
user1,3
user2,5
user1,4
user6,3
user2,9
user2,5
user9,1
user8,3
user2,4
user2,4
user5,5
user6,8
user7,2
user2,7
user1 3,4
user2 4,4,5,5,7,9
user5 5
user6 3,8
user7 2
user8 3
user9 1
import sys
for line in sys.stdin:
key = line.strip().split(',')
print '\t'.join(key)
import sys
cur = None
cur_list = []
for line in sys.stdin:
ss = line.strip().split('\t')
key = ss[0]
val = ss[1]
if cur == None:
cur = key
elif cur!=key:
print '%s\t%s' % (cur, ','.join(cur_list))
cur = key
cur_list = []
cur_list.append(val)
print '%s\t%s' % (cur, ','.join(cur_list))
[root@node1 test]# cat ub.data |python ubMap.py | sort -k1 | python ubRed.py
user1 3,4
user2 4,4,5,5,7,9
user5 5
user6 3,8
user7 2
user8 3
user9 1
user1,42
user2,55
user3,66
user7,2
user9,38
user2,Hadoop
user3,Spark
user5,Trump
user7,Cap
user88,Laptop
user2 55 Hadoop
user3 66 Spark
user7 2 Cap
通过mapper对两个数据集格式化后保存到hdfs,再启动一个mr对两个数据集进行合并。
import sys
for line in sys.stdin:
ss = line.strip().split(',')
print "%s\t1\t%s" % (ss[0], ss[1])
[root@node1 test]# cat a.txt |python mapA.py
user1 1 42
user2 1 55
user3 1 66
user7 1 2
user9 1 38
import sys
for line in sys.stdin:
ss = line.strip().split(',')
print "%s\t2\t%s" % (ss[0], ss[1])
[root@node1 test]# cat b.txt |python mapB.py
user2 2 Hadoop
user3 2 Spark
user5 2 Trump
user7 2 Cap
user88 2 Laptop
可以看到对a, b集合增加了一个标识(1,2),是为了方便在join的时候判断两个集合的数据都接收到了。
这个阶段的工作就是将前面输出的两个集合合并。
import sys
val_1 = ""
for line in sys.stdin:
key, flag, val = line.strip().split('\t')
if flag == '1':
val_1 = val
elif flag == '2' and val_1 != "":
val_2 = val
print "%s\t%s\t%s" % (key, val_1, val_2)
val_1 = ""
[root@node1 test]# hdfs dfs -put a.txt /user/hadoop
[root@node1 test]# hdfs dfs -put b.txt /user/hadoop
STREAM_JAR_PATH="/usr/local/hadoop/hadoop-2.6.1/share/hadoop/tools/lib/hadoop-streaming-2.6.1.jar"
INPUT_FILE_PATH_A="/user/hadoop/a.txt"
INPUT_FILE_PATH_B="/user/hadoop/b.txt"
OUTPUT_PATH_A="/out/a"
OUTPUT_PATH_B="/out/b"
OUTPUT_PATH_JOIN="/out/JOIN"
echo ========================= step 1.
hdfs dfs -ls $OUTPUT_PATH_A>null
res=$?
if [ "$res" -eq "0" ];then
echo need to delete $OUTPUT_PATH_A
hdfs dfs -rmr $OUTPUT_PATH_A
else
echo no need to delete $OUTPUT_PATH_A
fi
hadoop jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH_A \
-output $OUTPUT_PATH_A \
-mapper "python mapA.py" \
-file ./mapA.py \
echo ========================= step 2.
hdfs dfs -ls $OUTPUT_PATH_B>null
res=$?
if [ "$res" -eq "0" ];then
echo need to delete $OUTPUT_PATH_B
hdfs dfs -rmr $OUTPUT_PATH_B
else
echo no need to delete $OUTPUT_PATH_B
fi
hadoop jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH_B \
-output $OUTPUT_PATH_B \
-mapper "python mapB.py" \
-file ./mapB.py \
echo ========================= step 3.
hdfs dfs -ls $OUTPUT_PATH_JOIN>null
res=$?
if [ "$res" -eq "0" ];then
echo need to delete $OUTPUT_PATH_JOIN
hdfs dfs -rmr $OUTPUT_PATH_JOIN
else
echo no need to delete $OUTPUT_PATH_JOIN
fi
hadoop jar $STREAM_JAR_PATH \
-input $OUTPUT_PATH_A,$OUTPUT_PATH_B \
-output $OUTPUT_PATH_JOIN \
-mapper "cat" \
-reducer "python redJoin.py" \
-file ./redJoin.py \
-jobconf stream.num.map.output.key.fields=2 \
-jobconf num.key.fields.for.partition=1
[root@node1 test]# sh runJoin.sh
========================= step 1.
need to delete /out/a
rmr: DEPRECATED: Please use 'rm -r' instead.
20/12/15 09:23:04 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /out/a
20/12/15 09:23:05 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./mapA.py, /tmp/hadoop-unjar2503113385762558629/] [] /tmp/streamjob9067169500243307079.jar tmpDir=null
20/12/15 09:23:06 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 09:23:06 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 09:23:07 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/15 09:23:07 INFO mapreduce.JobSubmitter: number of splits:2
20/12/15 09:23:07 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1608034597178_0031
20/12/15 09:23:07 INFO impl.YarnClientImpl: Submitted application application_1608034597178_0031
20/12/15 09:23:07 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1608034597178_0031/
20/12/15 09:23:07 INFO mapreduce.Job: Running job: job_1608034597178_0031
20/12/15 09:23:13 INFO mapreduce.Job: Job job_1608034597178_0031 running in uber mode : false
20/12/15 09:23:13 INFO mapreduce.Job: map 0% reduce 0%
20/12/15 09:23:22 INFO mapreduce.Job: map 50% reduce 0%
20/12/15 09:23:23 INFO mapreduce.Job: map 100% reduce 0%
20/12/15 09:23:28 INFO mapreduce.Job: map 100% reduce 100%
20/12/15 09:23:29 INFO mapreduce.Job: Job job_1608034597178_0031 completed successfully
20/12/15 09:23:29 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=70
FILE: Number of bytes written=324793
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=240
HDFS: Number of bytes written=54
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=12251
Total time spent by all reduces in occupied slots (ms)=3152
Total time spent by all map tasks (ms)=12251
Total time spent by all reduce tasks (ms)=3152
Total vcore-seconds taken by all map tasks=12251
Total vcore-seconds taken by all reduce tasks=3152
Total megabyte-seconds taken by all map tasks=12545024
Total megabyte-seconds taken by all reduce tasks=3227648
Map-Reduce Framework
Map input records=5
Map output records=5
Map output bytes=54
Map output materialized bytes=76
Input split bytes=174
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=76
Reduce input records=5
Reduce output records=5
Spilled Records=10
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=297
CPU time spent (ms)=1320
Physical memory (bytes) snapshot=487895040
Virtual memory (bytes) snapshot=6174117888
Total committed heap usage (bytes)=258678784
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=66
File Output Format Counters
Bytes Written=54
20/12/15 09:23:29 INFO streaming.StreamJob: Output directory: /out/a
========================= step 2.
need to delete /out/b
rmr: DEPRECATED: Please use 'rm -r' instead.
20/12/15 09:23:33 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /out/b
20/12/15 09:23:34 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [./mapB.py, /tmp/hadoop-unjar76200206261107368/] [] /tmp/streamjob7206024831264575031.jar tmpDir=null
20/12/15 09:23:35 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 09:23:35 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 09:23:36 INFO mapred.FileInputFormat: Total input paths to process : 1
20/12/15 09:23:36 INFO mapreduce.JobSubmitter: number of splits:2
20/12/15 09:23:36 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1608034597178_0032
20/12/15 09:23:37 INFO impl.YarnClientImpl: Submitted application application_1608034597178_0032
20/12/15 09:23:37 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1608034597178_0032/
20/12/15 09:23:37 INFO mapreduce.Job: Running job: job_1608034597178_0032
20/12/15 09:23:43 INFO mapreduce.Job: Job job_1608034597178_0032 running in uber mode : false
20/12/15 09:23:43 INFO mapreduce.Job: map 0% reduce 0%
20/12/15 09:23:51 INFO mapreduce.Job: map 100% reduce 0%
20/12/15 09:23:56 INFO mapreduce.Job: map 100% reduce 100%
20/12/15 09:23:56 INFO mapreduce.Job: Job job_1608034597178_0032 completed successfully
20/12/15 09:23:56 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=87
FILE: Number of bytes written=324827
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=266
HDFS: Number of bytes written=71
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=11561
Total time spent by all reduces in occupied slots (ms)=3155
Total time spent by all map tasks (ms)=11561
Total time spent by all reduce tasks (ms)=3155
Total vcore-seconds taken by all map tasks=11561
Total vcore-seconds taken by all reduce tasks=3155
Total megabyte-seconds taken by all map tasks=11838464
Total megabyte-seconds taken by all reduce tasks=3230720
Map-Reduce Framework
Map input records=5
Map output records=5
Map output bytes=71
Map output materialized bytes=93
Input split bytes=174
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=93
Reduce input records=5
Reduce output records=5
Spilled Records=10
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=292
CPU time spent (ms)=1300
Physical memory (bytes) snapshot=483713024
Virtual memory (bytes) snapshot=6174130176
Total committed heap usage (bytes)=259031040
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=92
File Output Format Counters
Bytes Written=71
20/12/15 09:23:56 INFO streaming.StreamJob: Output directory: /out/b
========================= step 3.
need to delete /out/JOIN
rmr: DEPRECATED: Please use 'rm -r' instead.
20/12/15 09:24:00 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /out/JOIN
20/12/15 09:24:02 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
20/12/15 09:24:02 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [./redJoin.py, /tmp/hadoop-unjar7560819781671361131/] [] /tmp/streamjob3781723006571695879.jar tmpDir=null
20/12/15 09:24:02 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 09:24:03 INFO client.RMProxy: Connecting to ResourceManager at node1/192.168.126.118:8032
20/12/15 09:24:04 INFO mapred.FileInputFormat: Total input paths to process : 2
20/12/15 09:24:04 INFO mapreduce.JobSubmitter: number of splits:3
20/12/15 09:24:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1608034597178_0033
20/12/15 09:24:04 INFO impl.YarnClientImpl: Submitted application application_1608034597178_0033
20/12/15 09:24:04 INFO mapreduce.Job: The url to track the job: http://node1:8088/proxy/application_1608034597178_0033/
20/12/15 09:24:04 INFO mapreduce.Job: Running job: job_1608034597178_0033
20/12/15 09:24:10 INFO mapreduce.Job: Job job_1608034597178_0033 running in uber mode : false
20/12/15 09:24:10 INFO mapreduce.Job: map 0% reduce 0%
20/12/15 09:24:22 INFO mapreduce.Job: map 100% reduce 0%
20/12/15 09:24:29 INFO mapreduce.Job: map 100% reduce 100%
20/12/15 09:24:29 INFO mapreduce.Job: Job job_1608034597178_0033 completed successfully
20/12/15 09:24:30 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=151
FILE: Number of bytes written=436951
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=392
HDFS: Number of bytes written=43
HDFS: Number of read operations=12
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=3
Launched reduce tasks=1
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=29703
Total time spent by all reduces in occupied slots (ms)=3315
Total time spent by all map tasks (ms)=29703
Total time spent by all reduce tasks (ms)=3315
Total vcore-seconds taken by all map tasks=29703
Total vcore-seconds taken by all reduce tasks=3315
Total megabyte-seconds taken by all map tasks=30415872
Total megabyte-seconds taken by all reduce tasks=3394560
Map-Reduce Framework
Map input records=10
Map output records=10
Map output bytes=125
Map output materialized bytes=163
Input split bytes=258
Combine input records=0
Combine output records=0
Reduce input groups=10
Reduce shuffle bytes=163
Reduce input records=10
Reduce output records=3
Spilled Records=20
Shuffled Maps =3
Failed Shuffles=0
Merged Map outputs=3
GC time elapsed (ms)=579
CPU time spent (ms)=2060
Physical memory (bytes) snapshot=640995328
Virtual memory (bytes) snapshot=8230125568
Total committed heap usage (bytes)=379858944
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=134
File Output Format Counters
Bytes Written=43
20/12/15 09:24:30 INFO streaming.StreamJob: Output directory: /out/JOIN
[root@node1 test]# hdfs dfs -ls /out/JOIN
Found 2 items
-rw-r--r-- 2 root supergroup 0 2020-12-15 09:24 /out/JOIN/_SUCCESS
-rw-r--r-- 2 root supergroup 43 2020-12-15 09:24 /out/JOIN/part-00000
[root@node1 test]# hdfs dfs -cat /out/JOIN/part-00000
user2 55 Hadoop
user3 66 Spark
user7 2 Cap
"OS error code 1: Operation not permitted"
"OS error code 2: No such file or directory"
"OS error code 3: No such process"
"OS error code 4: Interrupted system call"
"OS error code 5: Input/output error"
"OS error code 6: No such device or address"
"OS error code 7: Argument list too long"
"OS error code 8: Exec format error"
"OS error code 9: Bad file descriptor"
"OS error code 10: No child processes"
"OS error code 11: Resource temporarily unavailable"
"OS error code 12: Cannot allocate memory"
"OS error code 13: Permission denied"
"OS error code 14: Bad address"
"OS error code 15: Block device required"
"OS error code 16: Device or resource busy"
"OS error code 17: File exists"
"OS error code 18: Invalid cross-device link"
"OS error code 19: No such device"
"OS error code 20: Not a directory"
"OS error code 21: Is a directory"
"OS error code 22: Invalid argument"
"OS error code 23: Too many open files in system"
"OS error code 24: Too many open files"
"OS error code 25: Inappropriate ioctl for device"
"OS error code 26: Text file busy"
"OS error code 27: File too large"
"OS error code 28: No space left on device"
"OS error code 29: Illegal seek"
"OS error code 30: Read-only file system"
"OS error code 31: Too many links"
"OS error code 32: Broken pipe"
"OS error code 33: Numerical argument out of domain"
"OS error code 34: Numerical result out of range"
"OS error code 35: Resource deadlock avoided"
"OS error code 36: File name too long"
"OS error code 37: No locks available"
"OS error code 38: Function not implemented"
"OS error code 39: Directory not empty"
"OS error code 40: Too many levels of symbolic links"
"OS error code 42: No message of desired type"
"OS error code 43: Identifier removed"
"OS error code 44: Channel number out of range"
"OS error code 45: Level 2 not synchronized"
"OS error code 46: Level 3 halted"
"OS error code 47: Level 3 reset"
"OS error code 48: Link number out of range"
"OS error code 49: Protocol driver not attached"
"OS error code 50: No CSI structure available"
"OS error code 51: Level 2 halted"
"OS error code 52: Invalid exchange"
"OS error code 53: Invalid request descriptor"
"OS error code 54: Exchange full"
"OS error code 55: No anode"
"OS error code 56: Invalid request code"
"OS error code 57: Invalid slot"
"OS error code 59: Bad font file format"
"OS error code 60: Device not a stream"
"OS error code 61: No data available"
"OS error code 62: Timer expired"
"OS error code 63: Out of streams resources"
"OS error code 64: Machine is not on the network"
"OS error code 65: Package not installed"
"OS error code 66: Object is remote"
"OS error code 67: Link has been severed"
"OS error code 68: Advertise error"
"OS error code 69: Srmount error"
"OS error code 70: Communication error on send"
"OS error code 71: Protocol error"
"OS error code 72: Multihop attempted"
"OS error code 73: RFS specific error"
"OS error code 74: Bad message"
"OS error code 75: Value too large for defined data type"
"OS error code 76: Name not unique on network"
"OS error code 77: File descriptor in bad state"
"OS error code 78: Remote address changed"
"OS error code 79: Can not access a needed shared library"
"OS error code 80: Accessing a corrupted shared library"
"OS error code 81: .lib section in a.out corrupted"
"OS error code 82: Attempting to link in too many shared libraries"
"OS error code 83: Cannot exec a shared library directly"
"OS error code 84: Invalid or incomplete multibyte or wide character"
"OS error code 85: Interrupted system call should be restarted"
"OS error code 86: Streams pipe error"
"OS error code 87: Too many users"
"OS error code 88: Socket operation on non-socket"
"OS error code 89: Destination address required"
"OS error code 90: Message too long"
"OS error code 91: Protocol wrong type for socket"
"OS error code 92: Protocol not available"
"OS error code 93: Protocol not supported"
"OS error code 94: Socket type not supported"
"OS error code 95: Operation not supported"
"OS error code 96: Protocol family not supported"
"OS error code 97: Address family not supported by protocol"
"OS error code 98: Address already in use"
"OS error code 99: Cannot assign requested address"
"OS error code 100: Network is down"
"OS error code 101: Network is unreachable"
"OS error code 102: Network dropped connection on reset"
"OS error code 103: Software caused connection abort"
"OS error code 104: Connection reset by peer"
"OS error code 105: No buffer space available"
"OS error code 106: Transport endpoint is already connected"
"OS error code 107: Transport endpoint is not connected"
"OS error code 108: Cannot send after transport endpoint shutdown"
"OS error code 109: Too many references: cannot splice"
"OS error code 110: Connection timed out"
"OS error code 111: Connection refused"
"OS error code 112: Host is down"
"OS error code 113: No route to host"
"OS error code 114: Operation already in progress"
"OS error code 115: Operation now in progress"
"OS error code 116: Stale NFS file handle"
"OS error code 117: Structure needs cleaning"
"OS error code 118: Not a XENIX named type file"
"OS error code 119: No XENIX semaphores available"
"OS error code 120: Is a named type file"
"OS error code 121: Remote I/O error"
"OS error code 122: Disk quota exceeded"
"OS error code 123: No medium found"
"OS error code 124: Wrong medium type"
"OS error code 125: Operation canceled"
"OS error code 126: Required key not available"
"OS error code 127: Key has expired"
"OS error code 128: Key has been revoked"
"OS error code 129: Key was rejected by service"
"OS error code 130: Owner died"
"OS error code 131: State not recoverable"
"MySQL error code 132: Old database file"
"MySQL error code 133: No record read before update"
"MySQL error code 134: Record was already deleted (or record file crashed)"
"MySQL error code 135: No more room in record file"
"MySQL error code 136: No more room in index file"
"MySQL error code 137: No more records (read after end of file)"
"MySQL error code 138: Unsupported extension used for table"
"MySQL error code 139: Too big row"
"MySQL error code 140: Wrong create options"
"MySQL error code 141: Duplicate unique key or constraint on write or update"
"MySQL error code 142: Unknown character set used"
"MySQL error code 143: Conflicting table definitions in sub-tables of MERGE table"
"MySQL error code 144: Table is crashed and last repair failed"
"MySQL error code 145: Table was marked as crashed and should be repaired"
"MySQL error code 146: Lock timed out; Retry transaction"
"MySQL error code 147: Lock table is full; Restart program with a larger locktable"
"MySQL error code 148: Updates are not allowed under a read only transactions"
"MySQL error code 149: Lock deadlock; Retry transaction"
"MySQL error code 150: Foreign key constraint is incorrectly formed"
"MySQL error code 151: Cannot add a child row"
"MySQL error code 152: Cannot delete a parent row"