Hadoop streaming是Hadoop的一个工具, 它帮助用户创建和运行一类特殊的map/reduce作业。
这些特殊的map/reduce作业是由一些可执行文件或脚本文件充当mapper或者reducer。例如,我们可以用Python来编写脚本:mapper.py和reducer.py。
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper mapper.py \
-reducer reducer.py
上面的过程涉及Streaming的工作原理,mapper和reducer都是可执行文件,他们从标准输入读入数据(一行一行读),并把结果发给标准输出。
Streaming工具会创建一个Map/Reduce作业,并把它发送给合适的集群,同时监视这个作业的整个执行过程。
所以,面向具体任务,重点是我们该怎么编写python脚本呢?从上面的话我得到的2个信息:一、编写的代码要遵从标准输入输出流;二、因为程序是要上传到集群上执行的,一些Python库可能是不受支持的,应要注意这点。
废话不多讲,下面根据4个MapReduce任务具体分析。
A B C D E F
B A H C D E I
C B E G A J
D A B E
E H A B C D G
F A J G
G C E F I
H B J E
I G B
J H C F
B,C A
D,A E
编写 mapper:
import sys
# 读入每行input
for line in sys.stdin:
line = line.strip()
# 把字符串拆分成朋友列表
friends_l = line.split(" ")
#创建Key-Value 并输出
#好友:本人
for i in range(len(friends_l)-1):
results = [friends_l[i+1],friends_l[0]]
print("\t".join(results))
编写 reducer:
import sys
import itertools
# 变量初始化
current_friend = None
common_friend_l = []
# 逐行读入
for line in sys.stdin:
line = line.strip()
common_friend, friend = line.split("\t")
#print(common_friend)
# 如果是第一次循环
if not current_friend:
current_friend = common_friend
# 如果仍处于当前共同好友 则把该人放入共同好友列表
if current_friend == common_friend:
common_friend_l += friend
# 如果当前共同好友数据已读取完毕
else:
# 去重
common_friend_l = list(set(common_friend_l))
# 生成pair
for j in itertools.combinations(common_friend_l, 2):
# 按 A,B 共同好友C 格式输出
print(",".join(str(v) for v in list(j)) + " " + current_friend)
# 变量重新初始化
current_friend = common_friend
common_friend_l = [friend]
# 最后一个列表打印出来
common_friend_l = list(set(common_friend_l))
for j in itertools.combinations(common_friend_l, 2):
print(",".join(str(v) for v in list(j)) + " " + current_friend)
集群上执行的操作,都可以在本地上通过管道模拟:
type data.txt |python mapper.py |sort |python reducer.py
在集群上执行,只需要一系列Hadoop命令,最主要的就是开头所讲的那个hadoop jar ...
命令。文章最后统一讲一下注意点。
用户手机号 出现的地点 出现的时间 逗留的时间
111111111 2 2014-02-18 19:03:56.123445 133
222222222 1 2013-03-14 03:18:45.263536 241
333333333 3 2014-10-23 17:14:23.176345 68
222222222 1 2013-03-14 03:20:47.123445 145
333333333 3 2014-09-15 15:24:56.222222 345
222222222 2 2011-08-30 18:13:58.111111 145
222222222 2 2011-08-30 18:18:24.222222 130
222222222 2 2011-08-30 18:13:58.111111 145
222222222 2 2011-08-30 18:18:24.222222 130
222222222 1 2013-03-14 03:18:45.263536 24
111111111~~~~
333333333~~~
编写mapper
import sys
for line in sys.stdin:
line=line.strip()
print line
本地管道模拟结果:
有点短… ,没想到其它的方法。当map可以完成任务的时候,reducer不是必需的。
有如下格式的日志:
id_a, id_b, id_c, id_d
id_a, id_a, id_f
id_b, id_b, id_d, id_f, id_a
id_m, id_n
编写 map-reduce 任务,统计每一行最后字母的个数。
编写mapper
import sys
for line in sys.stdin:
line=line.strip()
letters=line.split(',')
print '%s\t1' % letters[-1][-1]
编写reducer
import sys
cur_letter=None
cur_count=0
for line in sys.stdin:
line=line.strip()
letter,count=line.split()
if letter==cur_letter:
cur_count+=1
else:
if cur_letter!=None:
print "%s\t%d" %(cur_letter,cur_count)
cur_letter=letter
cur_count=1
print "%s\t%d" %(cur_letter,cur_count)
本地管道模拟结果:
和词频统计类似,不多做解释。
编写map-reduce任务对电商数据(user_session.data)进行分析:
文件的格式为
否买(是/否) 数据域(比如DAYOFWEEK):特征id(比如DAYOFWEEK4):特征值(比如1)
统计
1)数据域、数据域下的特征id 在数据中出现了多少次
2)购买 和 非购买行为下,数据域下各特征出现的次数和百分比
把user_session.data文件中的数据贴上一部分:
0 WEEK_COUPON:4.0:1 GEO5:GEO5whz2b:1 CATEGORYID:1:1 DEALLISTINFO:HAS_PROMOTION_TYPE:1 DISTANCE:7.0:1 COMMENTS:0.0:1 USER_ORDER_CNT_MONTH:0.0:1 THREESCORE_COUNT:0.0:1 GCOMMENT_RATIO:0.0:1 OPEN_ALL_DAY:1.0:1 NUM_CATEGORY:4.0:1 SLOTID:50011:1 SHOW_TYPE:SHOW_TYPE_FOOD:1 HISTORY_COUPON:8.0:1 AVG_SCORE:5.0:1 DEALLISTINFO:HAS_GROUP:1 UA_WEEK_CLICK:0.0:1 DEALLISTINFO:HAS_GROUP_TYPE:1 DAYOFWEEK:DAYOFWEEK4:1 POS_RANK:1.0:1 BUYERCOUNT:0.0:1 POICATEGORYID:201:1 WIFI:1.0:1 POIDEALLIST_CPR:5.0:1 NUM_DEAL:3.0:1 PICCLASS8ID:7:1 CLICKNET:1.0:1 LAUNCHID:20266411:1 CHANNEL:SIEVELIST:1 VIEW_ORDER_RATE:16.0:1 PICCLASS64ID:24:1 BIAS:1.0:1 PICSCORE:12.0:1 AVG_PRICE2:4.0:1 TWOSCORE_COUNT:0.0:1 PICMEMSCORE:15.0:1 POIDEALLIST_SHOWCOUNT:10.0:1 PICTURE_RATIO:29.0:1 PICQSCORE:11:1 POIBAREAID:28462:1 POICITYID:334:1 USER_CLICK_CNT_TEN_MIN:0.0:1 DEALLISTINFO:HAS_DEAL:1 FIVESCORE_COUNT:0.0:1 DETAILINFO:HAS_LANDMARK:1 PICSIZE:13.0:1 POIDEALLIST_CTR:10.0:1 POIID:95155846:1 POSITION:1500111:1 ORDERCOUNT:0.0:1 CATE_ID3:20557.0:1 CATE_ID2:20632.0:1 CATE_ID1:1.0:1 USER_CLICK_CNT_THREE_MIN:0.0:1 PICCLASS256ID:11:1 DINNER:DINNER_2:1 DEALLISTINFO:HAS_MENU:1 UA_WEEK_SHOW:0.0:1 POIAREAID:21114:1 ALGORITHM:SelectBaselineVertical:1 TARGETID:1293560:1 PHONE:PHONE_FIXED:1 REQCATEGORYID:1:1 HOUROFDAY:HOUROFDAY0:1 ONESCORE_COUNT:0.0:1 PICID:i5543769329:1 RELAY_RATIO:0.0:1 PICCLASS16ID:7:1 FOURSCORE_COUNT:0.0:1 UA_MONTH_SHOW:0.0:1 USER_CLICK_CNT_MONTH:0.0:1 USER_CLICK_CNT_WEEK:0.0:1 POICLASSID:226:1 NDVICE:NDVICEANDROID:1 PICCLASSID:11:1 UUID_ORDER_COUNT:3.0:1 PICRATIO:7.0:1 UUID_VIEW_COUNT:7.0:1 PICCLASS128ID:24:1 DISCOUNT:0.0:1 POIBRANDID:-95155846:1 POITYPEID:1930:1 PICISDEFAULT:2:1 DETAILINFO:HAS_TAG:1 POIDEALLIST_CVR:5.0:1 NUM_TAG:2.0:1 SLOTCATEID:500111:1 UA_MONTH_CLICK:0.0:1 LOWEST_PRICE2:0.0:1 USERUUID:F22C49C6CBF145D817A66820C69EA0000D05A9B6D17F83F39AE777F334AA0420:1 PICCLASS32ID:24:1 MARK_NUMBER:7.0:1 https://p1.meituan.net/shaitu/37007020d615ddcbcf6bf256c26a962097773.jpg
0 WEEK_COUPON:6.0:1 CATEGORYID:112:1 DETAILINFO:HAS_PARK:1 CHANNEL:HYBRID_SIEVELIST:1 DISTANCE:9.0:1 SLOTCATEID:50020112:1 COMMENTS:0.0:1 USER_ORDER_CNT_MONTH:0.0:1 THREESCORE_COUNT:0.0:1 USER_FIVEGEO_SCORE:1:1 GCOMMENT_RATIO:6.0:1 NUM_CATEGORY:3.0:1 SLOTID:50020:1 USERUUID:ADD6B6C42A30DFF8D0996B1AAD71949812EB89D192024797B9FEF2F9B6465AA0:1 HISTORY_COUPON:8.0:1 AVG_SCORE:5.0:1 UA_WEEK_CLICK:0.0:1 DEALLISTINFO:HAS_GROUP:1 SHOW_TYPE:SHOW_TYPE_NORMAL:1 DEALLISTINFO:HAS_GROUP_TYPE:1 DAYOFWEEK:DAYOFWEEK4:1 PICCLASS64ID:44:1 POS_RANK:1.0:1 BUYERCOUNT:0.0:1 PICCLASS256ID:88:1 POICATEGORYID:275:1 PICCLASS8ID:3:1 WIFI:1.0:1 POIDEALLIST_CPR:2.0:1 NUM_DEAL:4.0:1 PICID:267073589:1 LAUNCHID:20621120:1 VIEW_ORDER_RATE:3.0:1 USER_SIXGEO_SCORE:1:1 BIAS:1.0:1 AVG_PRICE2:4.0:1 TWOSCORE_COUNT:0.0:1 PICCLASSID:88:1 POIDEALLIST_SHOWCOUNT:10.0:1 PICTURE_RATIO:12.0:1 ALGORITHM:FFMSelectHybridUpdate:1 PICQSCORE:11:1 REQCATEGORYID:112:1 POIBAREAID:5934:1 POICITYID:151:1 USER_CLICK_CNT_TEN_MIN:0.0:1 DEALLISTINFO:HAS_DEAL:1 PICRATIO:15.0:1 FIVESCORE_COUNT:0.0:1 DETAILINFO:HAS_LANDMARK:1 PICCLASS128ID:44:1 POIDEALLIST_CTR:7.0:1 POIID:1188525:1 PICSCORE:65.0:1 CATE_ID3:20611.0:1 ORDERCOUNT:0.0:1 CATE_ID2:20426.0:1 CATE_ID1:2.0:1 USERID:79341734:1 USER_CLICK_CNT_THREE_MIN:2.0:1 UA_WEEK_SHOW:1.0:1 POIAREAID:0:1 TARGETID:2311420:1 HOUROFDAY:HOUROFDAY0:1 ONESCORE_COUNT:0.0:1 PHONE:PHONE_MOBILE:1 POSITION:150020112:1 RELAY_RATIO:4.0:1 PICCLASS16ID:3:1 FOURSCORE_COUNT:0.0:1 POI_VIEWED:1.0:1 UA_MONTH_SHOW:1.0:1 USER_CLICK_CNT_MONTH:0.0:1 USER_CLICK_CNT_WEEK:0.0:1 PICSIZE:15.0:1 POICLASSID:3:1 NDVICE:NDVICEANDROID:1 GEO5:GEO5wxrd8:1 UUID_ORDER_COUNT:2.0:1 UUID_VIEW_COUNT:8.0:1 POIBRANDID:-1188525:1 DISCOUNT:8.0:1 POITYPEID:277:1 PICISDEFAULT:1:1 DETAILINFO:HAS_TAG:1 POIDEALLIST_CVR:2.0:1 NUM_TAG:1.0:1 UA_MONTH_CLICK:0.0:1 LOWEST_PRICE2:3.0:1 PICCLASS32ID:23:1 PICMEMSCORE:14.0:1 MARK_NUMBER:6.0:1 http://p0.meituan.net/joymerchant/7823629573661548001.jpg
1 WEEK_COUPON:3.0:1 CATEGORYID:52:1 POSITION:35002052,112:1 DETAILINFO:HAS_PARK:1 CHANNEL:HYBRID_SIEVELIST:1 REQCATEGORYID:52,112:1 DISTANCE:13.0:1 COMMENTS:0.0:1 USER_ORDER_CNT_MONTH:0.0:1 THREESCORE_COUNT:0.0:1 GCOMMENT_RATIO:0.0:1 NUM_CATEGORY:3.0:1 SLOTID:50020:1 ALGORITHM:SelectHybridPicSelect:1 PICID:9625230:1 HISTORY_COUPON:7.0:1 AVG_SCORE:3.4:1 UA_WEEK_CLICK:0.0:1 DEALLISTINFO:HAS_GROUP:1 SHOW_TYPE:SHOW_TYPE_NORMAL:1 DEALLISTINFO:HAS_GROUP_TYPE:1 DAYOFWEEK:DAYOFWEEK4:1 DEALLISTINFO:HAS_INTRO:1 POS_RANK:3.0:1 BUYERCOUNT:0.0:1 POICATEGORYID:275:1 PICCLASS8ID:3:1 WIFI:1.0:1 POIDEALLIST_CPR:0.0:1 NUM_DEAL:1.0:1 DETAILINFO:FREE_PARK:1 LAUNCHID:22021805:1 PICRATIO:14.0:1 VIEW_ORDER_RATE:0.0:1 PICCLASS256ID:142:1 BIAS:1.0:1 AVG_PRICE2:5.0:1 TWOSCORE_COUNT:0.0:1 POIDEALLIST_SHOWCOUNT:10.0:1 PICTURE_RATIO:17.0:1 PICCLASS128ID:111:1 NDVICE:NDVICEIPHONE:1 PICQSCORE:11:1 POIBAREAID:28217:1 POICITYID:120:1 USER_CLICK_CNT_TEN_MIN:0.0:1 DEALLISTINFO:HAS_DEAL:1 FIVESCORE_COUNT:0.0:1 DETAILINFO:HAS_LANDMARK:1 POIDEALLIST_CTR:13.0:1 POIID:4869763:1 CATE_ID3:20611.0:1 ORDERCOUNT:0.0:1 CATE_ID2:20426.0:1 CATE_ID1:2.0:1 USERID:116913365:1 USER_CLICK_CNT_THREE_MIN:0.0:1 SLOTCATEID:5002052,112:1 UA_WEEK_SHOW:0.0:1 POIAREAID:0:1 TARGETID:4059172:1 PHONE:PHONE_FIXED:1 GEO5:GEO5wx4f9:1 HOUROFDAY:HOUROFDAY0:1 ONESCORE_COUNT:0.0:1 USERUUID:A88AC9FC0057D2B7B73DD59CD92B689B60C846AE764D580B6F7870369CF31024:1 PHONE:PHONE_MOBILE:1 RELAY_RATIO:45.0:1 PICCLASS16ID:3:1 FOURSCORE_COUNT:0.0:1 UA_MONTH_SHOW:0.0:1 USER_CLICK_CNT_MONTH:0.0:1 USER_CLICK_CNT_WEEK:0.0:1 POICLASSID:3:1 PICSCORE:69.0:1 PICCLASS64ID:9:1 PICCLASS32ID:9:1 PICCLASSID:142:1 UUID_ORDER_COUNT:0.0:1 PICMEMSCORE:15.0:1 UUID_VIEW_COUNT:7.0:1 DISCOUNT:7.0:1 POIBRANDID:171761:1 POITYPEID:48:1 PICISDEFAULT:1:1 PHONE:PHONE_FIXED_MOBILE:1 DETAILINFO:HAS_TAG:1 POIDEALLIST_CVR:0.0:1 NUM_TAG:2.0:1 UA_MONTH_CLICK:0.0:1 LOWEST_PRICE2:4.0:1 PICSIZE:12.0:1 MARK_NUMBER:5.0:1 http://p0.meituan.net/deal/__26385508__9466023.jpg
这里有两个小问,分开来解。
首先第一个小问。编写mapper1.py
import sys
for line in sys.stdin:
line.strip()
features=line.split()
for feature in features:
feature_list=feature.split(":")
if len(feature_list)==3:
print "%s\t1" % feature_list[0]
第一个小问。编写reducer1.py
import sys
cur_feature=None
cur_count=0
for line in sys.stdin:
line=line.strip()
feature,count=line.split()
if cur_feature==feature:
cur_count+=1
else:
if cur_feature!=None:
print "%s\t%d"%(cur_feature,cur_count)
cur_feature=feature
cur_count=1
print "%s\t%d"%(cur_feature,cur_count)
本地管道模拟结果:
第二个小问。编写mapper2.py
import sys
flag=False;
for line in sys.stdin:
line.strip()
if line[0]=='0':
flag=True
elif line[0]=='1':
flag=False
features=line.split()
for feature in features:
feature_list=feature.split(":")
if len(feature_list)==3:
if flag==True:
print "0:%s\t1" % feature_list[0]
else:
print "1:%s\t1" % feature_list[0]
第二个小问。编写reducer2.py
import sys
cur_feature=None
cur_count=0
for line in sys.stdin:
line=line.strip()
feature,count=line.split()
if cur_feature==feature:
cur_count+=1
else:
if cur_feature!=None:
print "%s\t%d"%(cur_feature,cur_count)
cur_feature=feature
cur_count=1
print "%s\t%d"%(cur_feature,cur_count)
本地管道模拟结果:
只统计了出现的次数,百分比没实现。
快用这几个例子练练手吧~
用到的命令操作
#删除已有文件夹
hadoop fs -rmr /sxydata/input/example_1
hadoop fs -rmr /sxydata/output/example_1
#创建输入文件夹
hadoop fs -mkdir /sxydata/input/example_1
#放入输入文件
hadoop fs -put text* /sxydata/input/example_1
#查看文件是否放好
hadoop fs -ls /sxydata/input/example_1
#本地测试一下map和reduce
head -20 text1.txt | python count_mapper.py | sort | python count_reducer.py
#集群上跑任务
hadoop jar /home/ds/hadoop/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -file count_mapper.py -mapper count_mapper.py -file count_reducer.py -reducer count_reducer.py -input /sxydata/input/example_1 -output /sxydata/output/example_1
整个过程就是在HDFS上创建输入文件目录(输出文件目录不用创建,只需运行streaming命令指定,不然报错),将本地文件上传,然后通过Streaming工具执行MapReduce任务。而执行脚本在本地,在streaming命令中,要加上-file
选项。