Hadoop从入门到精通41:使用Python开发MapReducer程序2

前面我们使用Python开发了MapReduce的WordCount程序,用以统计所有单词出现的次数。本节介绍如何在WordCount中加入白名单。

案例:使用Python开发带有白名单的WordCount程序并提交到Hadoop上运行。

1.单机版的Python-WordCount程序

(1)Mapper阶段:

# mapper.py
import sys
import re

def load_white_list(path):
   whitelist = set()
   file = open(path, 'r')
   for line in file:
       word = line.strip()
       whitelist.add(word)
   return whitelist

def mapper(path):
   whitelist = load_white_list(path)
   p = re.compile(r'\w+')
   for line in sys.stdin:
       words = line.strip().split(' ')
       for word in words:
           w = p.findall(word)
           if len(w) < 1:
               continue
           s = w[0].strip().lower()
           if s != "" and (s in whitelist):
               print("%s\t%s" % (s, 1))

if __name__ == "__main__":
   module_name = sys.modules[__name__]
   function_name = sys.argv[1]
   whitelist_path = sys.argv[2]
   mapper_func = getattr(module_name, function_name)
   mapper_func(whitelist_path)

(2)Reducer阶段:

# reducer.py
import sys

def reducer():
    res = dict()
    for word_one in sys.stdin:
        word, one = word_one.strip().split('\t')
        if word in res.keys():
            res[word] = res[word] + 1
        else:
            res[word] = 1
    print(res)

if __name__ == "__main__":
    module_name = sys.modules[__name__]
    function_name = sys.argv[1]
    reducer_func = getattr(module_name, function_name)
    reducer_func()

(3)运行程序
测试数据如下:

# cat a.data
I love Beijing
I love China
Beijing is the capital of China

# cat whitelist.data
Beijing
China

执行程序(本地模拟MapReduce过程):

# cat a.data | python3 mapper.py mapper whitelist.data | sort -k1 | python3 reducer.py reducer
'beijing': 2, 'china': 2}

2.将Python程序提交到Hadoop运行

(1)将测试数据上传到HDFS

# hdfs dfs -put a.data /data
# hdfs dfs -put whitelist.data /data

# hdfs dfs -cat /data/a.data
I love Beijing
I love China
Beijing is the capital of China

# hdfs dfs -cat /data/whitelist.data
Beijing
China

(2)使用Hadoop-Streaming工具
提交Python程序到Hadoop中运行,需要用到Hadoop自带的hadoop-streaming工具包:

# ls $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*
/root/trainings/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar

(3)提交程序

# hadoop jar /root/trainings/hadoop-2.7.3/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar
-input /data/a.data
-output /output/wc2
-mapper "python3 mapper.py mapper whitelist.data"
-reducer "python3 reducer.py reducer"
-jobconf "mapred.reduce.tasks=2"
-file ./mapper.py
-file ./reducer.py
-file ./whitelist.data
......
18/12/30 22:38:29 INFO mapreduce.Job: map 0% reduce 0%
18/12/30 22:38:38 INFO mapreduce.Job: map 100% reduce 0%
18/12/30 22:38:42 INFO mapreduce.Job: map 100% reduce 50%
18/12/30 22:38:43 INFO mapreduce.Job: map 100% reduce 100%
18/12/30 22:38:44 INFO mapreduce.Job: Job job_1546139257431_0003 completed successfully

说明:hadoop-streaming的几个参数:

  • input:输出数据文件/目录
  • output:输出目录,不能事先存在
  • mapper:指定Mapper的运行方式
  • reducer:指定Reducer的运行方式
  • file:指定要加载的文件(会分发给各个计算节点)
  • jobconf:指定任务的配置参数,如:这里表示以2个reduce进程执行任务,输出结果有两个文件。

(4)查看结果

# hdfs dfs -ls /output/wc2
Found 3 items
-rw-r--r-- 1 root supergroup 0 2018-12-30 22:38 /output/wc2/_SUCCESS
-rw-r--r-- 1 root supergroup 14 2018-12-30 22:38 /output/wc2/part-00000
-rw-r--r-- 1 root supergroup 16 2018-12-30 22:38 /output/wc2/part-00001
# hdfs dfs -cat /output/wc2/part-00000
{'china': 2}
# hdfs dfs -cat /output/wc2/part-00001
{'beijing': 2}

至此,使用Python开发带有白名单的WordCount程序并提交到Hadoop集群中运行已经介绍完毕!祝你玩得愉快!

你可能感兴趣的:(Hadoop从入门到精通41:使用Python开发MapReducer程序2)