Hadoop之MapReduce

		

1. MapReduce解决的问题

1) 数据问题:10G 的 TXT 文件 2) 生活问题:统计分类上海市的图书馆的书
		

2. MapReduce 是什么

MapReduce 是一种分布式的离线计算框架,是一种编程模型,用于大规模数据集(大于 1TB)的并行运算将自己的程序运行在分布式系统上。 概念是:Map(映射)"和"Reduce(归约) 指定一个 Map(映射)函数,用来把一组键值对映射成一组新的键值对,指定并发的Reduce(归约)函数,用来保证所有映射的键值对中的每一个共享相同的键组。可应用于大规模的算法图形处理、文字处理
		

3. MapReduce 的设计理念

1) 分布式计算 1. 分布式计算将该应用分解成许多小的部分,分配给多台计算机节点进行处理。这样可以节约整体计算时间,大大提高计算效率 2) 移动计算,而不是移动数据 1. 将计算程序应用移动到具有数据的集群计算机节点之上进行计算操作
		

4-1. MapReduce 计算框架的组成

1) Mapper 详解(分) 1. Mapper 负责“分”,即把得到的复杂的任务分解为若干个简单的任务执行。 2. “简单的任务”有几个含义 1. 数据或计算规模相对于原任务要大大缩小; 2. 就近计算,即会被分配到存放了所需数据的节点进行计算 3. 这些小任务可以并行计算,彼此间几乎没有依赖关系
		

4-2. MapReduce 计算框架的组成

1) Reduce 详解(合) 1. Reduce 的任务是对 map 阶段的结果进行“汇总”并输出 1. Reducer 的数目由 mapred-site.xml 配 置 文 件 里 的 项 目mapred.reduce.tasks 决定。缺省值为 1,用户可以覆盖之。
		

4-3. MapReduce 计算框架的组成

1) Shuffle 详解(分) 1. 在 mapper 和 reducer 中间的一个步骤,包含于 Reduce 阶段 2. 可以简化 reducer 过程
		

5. MR 架构

1) 一主多从架构 1. 主 JobTracker:(RM) 1. 负责调度分配每一个子任务 task 运行于 TaskTracker 上,如果发现有失败的 task 就重新分配其任务到其他节点。每一个 hadoop 集群中只一个 JobTracker, 一般它运行在 Master 节点上。 2. 从 TaskTracker:(NM) 1. TaskTracker 主动与 JobTracker 通信,接收作业,并负责直接执行 每一个任务,为了减少网络带宽 TaskTracker 最好运行在 HDFS 的DataNode 上。
		

6. MapReduce 安装

1) mapred-site.xml:
		
			 mapreduce.framework.name
			 yarn
		 

2) yarn-site.xml:
	
		yarn.nodemanager.aux-services
		mapreduce_shuffle
	
	
		yarn.resourcemanager.ha.enabled
		true
	
	
		yarn.resourcemanager.cluster-id
		mr_shsxt
	
	
		yarn.resourcemanager.ha.rm-ids
		rm1,rm2
	
	
		yarn.resourcemanager.hostname.rm1
		node03
	
	
		yarn.resourcemanager.hostname.rm2
		node04
	
	
		yarn.resourcemanager.zk-address
		node01:2181,node02:2181,node03:2181
	
		

7. WordCount 项目

'''
数据为:
	The Apache? Hadoop? project develops open-source software for reliable, scalable, distributed computing.
	The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
	The project includes these modules:
	Hadoop Common: The common utilities that support the other Hadoop modules.
	Hadoop Distributed File System A distributed file system that provides high-throughput access to application data.
	Hadoop YARN: A framework for job scheduling and cluster resource management.

# 实现 统计 单词出现的次数
from mrjob.job import MRJob

class WordCount(MRJob):
    def mapper(self, key, lines):
        line = lines.strip().split(' ')
        for word in line:
            yield word,1

    def reducer(self, words,occurrences ):
        yield words,sum(occurrences)

if __name__ == '__main__':
    WordCount.run()
	
'''
		

8. QQ 好友推荐系统

'''
数据为:
	tom	cat	hadoop	hello
	hello	mr	tom	world	hive
	cat	tom	hive
	hive	cat	hadoop	world	hello	mr
	mr	hive	hello
	hadoop	tom	hive	world
	world	hadoop	hive	hello

# 实现 推荐 好友功能
def fof(friend01, friend02):
    if friend01 > friend02:
        return friend02+":"+friend01
    return friend01+":"+friend02


class Friend_Friend(MRJob):
    def mapper(self, key, line):
        lines = line.strip().split('\t')
        for i in range(1,len(lines)):
            yield fof(lines[0],lines[i]),0

            for j in range(i+1,len(lines)):
                yield fof(lines[i],lines[j]),1


    def reducer(self, key, values):
        sums = 0
        flag = 0

        for hots in values:
            if hots == 0:
                flag =1
            sums+=hots

        if flag==0:
            yield key,sums

if __name__ == '__main__':
    Friend_Friend.run()

'''
		

9. 天气统计

找出每个月气温最高的 2 天
'''
数据为:
	1949-10-01#14:21:02#34c
	1949-10-01#19:21:02
	1949-10-01#19:21:02#39c
	1949-10-02#14:01:02#36c
	1950-01-01#11:21:02#32c
	1950-10-01#12:21:02#37c
	1951-12-01#12:21:02#23c
	1950-10-02#12:21:02#41c
	1950-10-03#12:21:02#27c
	1951-07-01#12:21:02#45c
	1951-07-01#12:21:02#46c
	1951-07-01#12:21:02#45c
	1951-07-01#12:21:02#46c
	1951-07-03#12:21:03#47c

# 代码
from mrjob.job import MRJob
class FinalTQ(MRJob):

    def mapper(self, key, value):
        lines = value.split('#')
        YearMonth = lines[0]+"-"+lines[1]
        yield YearMonth+"-"+lines[2],0

    def reducer(self, key, values):
        result = list()
        result.append(key+'\n')
        open('new_file.txt', 'a').write('%s' % '\n'.join(result))
        print(key)

if __name__ == '__main__':
    FinalTQ.run()




from mrjob.job import MRJob
import re
class FinalTQ2(MRJob):
    def mapper(self, key, value):
        lines = value.strip().split('-')
        year = lines[0]
        month= lines[1]
        wd = re.sub('\D', '', lines[4])
        wd=100-int(wd)
        # print(year+'-'+month+' '+str(wd))
        yield year+'-'+month+' '+str(wd),str(wd)

    def reducer(self, key, values):
        flag = 0
        result = list()
        for i in values:
            i=100-int(i)
            result.append(key+' '+str(i)+ '\n')
            open('new_file2.txt', 'a').write('%s' % '\n'.join(result))
            result.clear()
            # yield key,i

if __name__ == '__main__':
    FinalTQ2.run()




from mrjob.job import MRJob
import re
class Final(MRJob):
    def mapper(self, key, value):
        lines = value.split(' ')
        times = lines[0]
        wd = lines[2]
        print(times+" "+wd)
        yield times,wd

    def reducer(self, key, values):
        flag = 0
        for i in values:
            flag+=1
            if flag < 3:
                print(key,i)

if __name__ == '__main__':
    Final.run()
'''
		

10. 协同过滤算法

from mrjob.job import MRJob
'''
https://www.cnblogs.com/one--way/p/5648165.html
https://blog.csdn.net/xiaokang123456kao/article/details/74735992
'''

class UserGoodsScore(MRJob):
    """获得用户购买的商品与评分记录"""
    def mapper(self, _, line):
        # 解析行: 用户, 商品, 评分
        user, goods, score = line.split(',')
        # 输出串: 商品:评分
        output = '{goods}:{score}'.format(goods=goods,
                                          score=score)
        yield user, output

    def reducer(self, key, values):
        yield key, ','.join(values)

def main():
    UserGoodsScore.run()

if __name__ == '__main__':
    main()








rom mrjob.job import MRJob


class GoodsBoughtCountMatrix(MRJob):
    """获得商品购买次数的同现矩阵"""
    def mapper(self, _, line):
        # 解析行
        tokens = line.split('"')
        user = tokens[1]  # 获得用户, 如: user1
        score_matrix = tokens[3]  # 评分矩阵, 如: 101:5.0,102:3.0,103:2.5
        # 获得物品 列表
        goods_score_list = score_matrix.split(',')  # 获得商品得分列表
        # 获得商品列表
        goods_list = [goods_socre.split(':')[0] for goods_socre in goods_score_list]

        # 获得商品同现矩阵
        for goods1 in goods_list:
            for goods2 in goods_list:
                matrix_item = '{goods1}:{goods2}'.format(goods1=goods1,
                                                         goods2=goods2)
                yield matrix_item, 1

    def reducer(self, key, values):
        yield key, sum(values)


def main():
    GoodsBoughtCountMatrix.run()


if __name__ == '__main__':
    main()






from mrjob.job import MRJob


class GoodsUserScoreMatrix(MRJob):
    """商品用户评分矩阵  
     python3 03goodsuserscorematrix.py -r local data/input/01_user_goods_score.data >data/output/03goodsuserscorematrix
"""

    def mapper(self, _, line):
        # 解析行: 用户, 商品, 评分
        user, goods, score = line.split(',')
        # 输出串: 商品:评分
        output = '{user}:{score}'.format(user=user,score=score)
        yield goods, output

    def reducer(self, key, values):
        for value in values:
            yield key, value


def main():
    GoodsUserScoreMatrix.run()


if __name__ == '__main__':
    main()











from mrjob.job import MRJob

import os


class Matrix(MRJob):
    flag = None  # 用于标记数据是哪个矩阵来的
    goods_ids = [101, 102, 103, 104, 105, 106, 107]  # 矩阵A的商品ID
    user_ids = ['user1', 'user2', 'user3', 'user4', 'user5']  # 矩阵B 用户ID

    def mapper(self, _, line):
        file_name = os.environ['mapreduce_map_input_file']

        if file_name == '02goodsmatrix':  # 商品购买次数同现矩阵  注意路径名和文件名要对应
            # 获得数组 矩阵
            item_i, item_j, item_value = self._get_matrix_item_1(line)

            for user_id in self.user_ids:
                # 获取key
                key = '{user_id},{item_i}'.format(item_i=item_i,
                                                  user_id=user_id)
                value = 'a,{item_j},{item_value}'.format(
                    item_j=item_j,
                    item_value=item_value)
                value = ['a', item_j, item_value]
                yield key, value

        elif file_name == '03goodsuserscorematrix':  # 商品用户评分矩阵  注意路径名和文件名要对应
            # 获得数组 矩阵
            item_i, item_j, item_value = self._get_matrix_item_2(line)

            for goods_id in self.goods_ids:
                # 获取key
                key = '{item_j},{goods_id}'.format(goods_id=goods_id,
                                                   item_j=item_j)
                value = 'b,{item_i},{item_value}'.format(
                    item_i=item_i,
                    item_value=item_value)
                value = ['b', item_i, item_value]
                yield key, value


    def reducer(self, key, values):
        goods_score = {}
        user_score = {}
        for flag, goods_id, value in values:
            if flag == 'a':
                goods_score[goods_id] = value
            elif flag == 'b':
                user_score[goods_id] = value
        # 计算有相同 goods_id 乘积之和
        sum = 0
        for goods_id in set(goods_score) & set(user_score):
            sum += float(goods_score[goods_id]) * float(user_score[goods_id])

        yield key, sum

    def _get_matrix_item_1(self, line):
        """解析02goodsmatrix
           从而获得商品购买次数同现矩阵中的每个元素
           Args:
               line: "101:102"       3
           Return:
               item_i: 101
               item_j: 102
               value: 3
        """
        items = line.split('"')
        item_i, item_j = items[1].split(':')
        value = items[2].strip()
        return item_i, item_j, value

    def _get_matrix_item_2(self, line):
        """解析03goodsuserscorematrix
           从而获得商品用户评分(只包含有买过的)矩阵中的每个元素
           Args:
               line: "107"   "user3:5.0"
           Return:
               item_i: 107
               item_j: user3
               value: 5.0
        """
        items = line.split('"')
        item_i = items[1]
        item_j, value = items[3].split(':')
        return item_i, item_j, value


def main():
    Matrix.run()


if __name__ == '__main__':
    main()

你可能感兴趣的:(Hadoop之MapReduce)