Hadoop从入门到精通41：使用Python开发MapReducer程序2

前面我们使用Python开发了MapReduce的WordCount程序，用以统计所有单词出现的次数。本节介绍如何在WordCount中加入白名单。

案例：使用Python开发带有白名单的WordCount程序并提交到Hadoop上运行。

1.单机版的Python-WordCount程序

（1）Mapper阶段：

# mapper.py
import sys
import re

def load_white_list(path):
   whitelist = set()
   file = open(path, 'r')
   for line in file:
       word = line.strip()
       whitelist.add(word)
   return whitelist

def mapper(path):
   whitelist = load_white_list(path)
   p = re.compile(r'\w+')
   for line in sys.stdin:
       words = line.strip().split(' ')
       for word in words:
           w = p.findall(word)
           if len(w) < 1:
               continue
           s = w[0].strip().lower()
           if s != "" and (s in whitelist):
               print("%s\t%s" % (s, 1))

if __name__ == "__main__":
   module_name = sys.modules[__name__]
   function_name = sys.argv[1]
   whitelist_path = sys.argv[2]
   mapper_func = getattr(module_name, function_name)
   mapper_func(whitelist_path)

（2）Reducer阶段：

# reducer.py
import sys

def reducer():
    res = dict()
    for word_one in sys.stdin:
        word, one = word_one.strip().split('\t')
        if word in res.keys():
            res[word] = res[word] + 1
        else:
            res[word] = 1
    print(res)

if __name__ == "__main__":
    module_name = sys.modules[__name__]
    function_name = sys.argv[1]
    reducer_func = getattr(module_name, function_name)
    reducer_func()

（3）运行程序
测试数据如下：

# cat a.data
I love Beijing
I love China
Beijing is the capital of China

# cat whitelist.data
Beijing
China

执行程序（本地模拟MapReduce过程）：

# cat a.data | python3 mapper.py mapper whitelist.data | sort -k1 | python3 reducer.py reducer
'beijing': 2, 'china': 2}

2.将Python程序提交到Hadoop运行

（1）将测试数据上传到HDFS

# hdfs dfs -put a.data /data
# hdfs dfs -put whitelist.data /data

# hdfs dfs -cat /data/a.data
I love Beijing
I love China
Beijing is the capital of China

# hdfs dfs -cat /data/whitelist.data
Beijing
China

（2）使用Hadoop-Streaming工具
提交Python程序到Hadoop中运行，需要用到Hadoop自带的hadoop-streaming工具包：

# ls $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*
/root/trainings/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar

（3）提交程序

# hadoop jar /root/trainings/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar
-input /data/a.data
-output /output/wc2
-mapper "python3 mapper.py mapper whitelist.data"
-reducer "python3 reducer.py reducer"
-jobconf "mapred.reduce.tasks=2"
-file ./mapper.py
-file ./reducer.py
-file ./whitelist.data
......
18/12/30 22:38:29 INFO mapreduce.Job: map 0% reduce 0%
18/12/30 22:38:38 INFO mapreduce.Job: map 100% reduce 0%
18/12/30 22:38:42 INFO mapreduce.Job: map 100% reduce 50%
18/12/30 22:38:43 INFO mapreduce.Job: map 100% reduce 100%
18/12/30 22:38:44 INFO mapreduce.Job: Job job_1546139257431_0003 completed successfully

说明：hadoop-streaming的几个参数：

input：输出数据文件/目录
output：输出目录，不能事先存在
mapper：指定Mapper的运行方式
reducer：指定Reducer的运行方式
file：指定要加载的文件（会分发给各个计算节点）
jobconf：指定任务的配置参数，如：这里表示以2个reduce进程执行任务，输出结果有两个文件。

（4）查看结果

# hdfs dfs -ls /output/wc2
Found 3 items
-rw-r--r-- 1 root supergroup 0 2018-12-30 22:38 /output/wc2/_SUCCESS
-rw-r--r-- 1 root supergroup 14 2018-12-30 22:38 /output/wc2/part-00000
-rw-r--r-- 1 root supergroup 16 2018-12-30 22:38 /output/wc2/part-00001
# hdfs dfs -cat /output/wc2/part-00000
{'china': 2}
# hdfs dfs -cat /output/wc2/part-00001
{'beijing': 2}

至此，使用Python开发带有白名单的WordCount程序并提交到Hadoop集群中运行已经介绍完毕！祝你玩得愉快！

Hadoop从入门到精通41：使用Python开发MapReducer程序2

1.单机版的Python-WordCount程序

2.将Python程序提交到Hadoop运行

你可能感兴趣的:(Hadoop从入门到精通41：使用Python开发MapReducer程序2)