大数据-hadoopMapReduce的mrjob实现

MR实现

  • WordCount
  • top-N
  • inline方式运行
  • local方式运行
  • 提交到集群运行
  • hadoop-streaming

WordCount

from mrjob.job import MRJob

class MRWordCounter(MRJob):

    def mapper(self, key, line):
        for word in line.split():
            yield word, 1

    def reducer(self, word, occurrences):
        yield word, sum(occurrences)
        
if __name__=='__main__':
    MRWordCounter.run()

top-N

import sys
from mrjob.job import MRJob,MRStep
import heapq

class TopNWords(MRJob):

    def mapper(self, _, line):
        if line.strip() != "":
            for word in line.strip().split():
                yield word,1
                
    def combiner(self, word, counts):    #介于mapper和reducer之间,用于临时的将mapper输出的数据进行统计
        yield word,sum(counts)

    def reducer_sum(self, word, counts):
        yield None,(sum(counts),word)

    def top_n_reducer(self,_,word_cnts):    #利用heapq将数据进行排序,将最大的2个取出
        for cnt,word in heapq.nlargest(2,word_cnts):
            yield word,cnt

    def steps(self):    #实现steps方法用于指定自定义的mapper,comnbiner和reducer方法
        return [
            MRStep(mapper=self.mapper,
                   combiner=self.combiner,
                   reducer=self.reducer_sum),
            MRStep(reducer=self.top_n_reducer)
        ]
        
if __name__=='__main__':
    TopNWords.run()

inline方式运行

  • 调试方便,启动单一进程模拟任务执行状态和结果
python3 word_count.py input.txt > output.txt

local方式运行

  • 与inline的区别是启动多线程执行每一个task
python3 word_count.py -r local input.txt > output1.txt

提交到集群运行

  • 指定Hadoop任务调度优先级(VERY_HIGH|HIGH)
    --jobconf mapreduce.job.priority=VERY_HIGH
  • Map及Reduce任务个数限制
    --jobconf mapreduce.map.tasks=2
    --jobconf mapreduce.reduce.tasks=5
python3 word_count.py -r hadoop --python /usr/bin/python3/bin/python3 hdfs:///test.txt -o hdfs:///output

hadoop-streaming

run.sh

HADOOP_CMD="/root/bigdata/hadoop-2.6.0-cdh5.7.0/bin/hadoop"
STREAM_JAR_PATH="/root/bigdata/hadoop-2.6.0-cdh5.7.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0-cdh5.7.0.jar"

INPUT_FILE_PATH="/wd/words.txt"
OUTPUT_PATH="/wd/output_hadoop_tmp"

$HADOOP_CMD fs -rm -R -skipTrash $OUTPUT_PATH

$HADOOP_CMD jar $STREAM_JAR_PATH \
-input $INPUT_FILE_PATH \
-output $OUTPUT_PATH \
-mapper "/root/bigdata/python3/bin/python3 word_count.py" \
-reducer "/root/bigdata/python3/bin/python3 word_count.py" \
-file ./word_count.py \
-file ./word_count.py

你可能感兴趣的:(大数据)