本文基于实验室已经搭建好的Hadoop平台。
参考http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
1.编写mapper.py
#!/usr/bin/python2.6 import sys for line in sys.stdin: line = line.strip() words = line.split() for word in words: print '%s %s' % (word, 1)
2.编写reducer.py
#!/usr/bin/python import sys from operator import itemgetter current_word = None current_count = 0 word = None for line in sys.stdin: line = line.strip() word, count = line.split(' ', 1) try: count = int(count) except ValueError: continue if current_word == word: current_count += count else: if current_word: print '%s %s' % (current_word, current_count) current_word = word current_count = count if current_word == word: print '%s %s' % (current_word, current_count)
3.将mapper.py和reducer.py上传到HadoopMaster上/home/hduser目录下
确保这两个文件具有执行权限:chmod +x /home/hduser/mapper.py
chmod +x /home/hduser/reducer.py
注意若执行时出现如下错误: /usr/bin/python^M: bad interpreter: No such file or directory
原因是:在Windows下编写的文件格式是dos,将文件上传到HadoopMaster(Linux系统)后,需要将文件格式修改为unix
vi filename # 打开文件 :set ff # 查看文件格式 :set ff=unix # 设置文件格式 :wq # 保存并退出
4.通过Hue将测试文件上传至HDFS上
5.切换之hdfs用户,执行hadoop jar 命令
6.执行结果