什么是mrjob
一个通过hadoop、emr的mapreduce编程接口(streamming),扩展出来的一个python的编程框架。
先安装python 2.5+版本(对应0.4)
线上目前版本:python 2.6.8
调度机安装mrjob即可:
http://pythonhosted.org/mrjob/guides/quickstart.html
具体安装方法:
进入mrjob安装包解压后的目录
安装python setup.py install
frommrjob.jobimport MRJob
classMRWordCounter(MRJob):
defmapper(self, _, line):
for word in line.split():
yield word, 1
defreducer(self, word,occurrences):
yield word, sum(occurrences)
if __name__ =='__main__':
MRWordCounter.run()
注:
Generator使用:
occurrences:
for each inoccurrences:
#todo对所有的value进行操作
python pyfile.py infilename.file
输出结果到文件outputfilename.file
python pyfile.py infilename.file >outputfilename.file
python pyfile.py infilename.file –r hadoop
输出结果到文件outputfilename.file:
python pyfile.py infilename.file –r hadoop >outputfilename.file
python pyfile.py infilename.file –r hadoop –mapper–step-num=0
解决方案:
--fileupload filename.file
网站http://stackoverflow.com/
官方网站
控制map,reduce数量?
数据流切换问题?