最近无意中找到自己2016年写过的一篇文章,当时是基于Pagerank算法来解决一个用户分类的问题,用的是Hadoop和Spark的技术,感概时间真是过得飞快,2016年是我开始接触机器学习和大数据的一年,这么多年过去了,自己也是一直在这个领域不断地学习和提高,虽然技术一直在发展,但是一些经典的算法还是能在我们的工作中发挥重要的作用。为此特意把之前的文章发表在这个博客上,纪念一下自己的这个学习历程。
这篇文章介绍了如何基于SPARK和HADOOP的分布式环境,通过应用Pagerank算法来解决用户分类的问题。数据源采用了Cloudera网站提供的电影日志的数据。
根据Cloudera提供的用户访问电影的日志数据,判断某个用户ID对应的是成人还是小孩。这是一个应用机器学习进行分类的问题。
Cloudera提供了大约200MB的JSON格式的日志数据(分布在20个文件,每个文件对应某一天的日志数据),数据可以见链接:
以文件中的某一行为例,可以看到数据的结构如下:
{
"created_at": "2013-05-08T08:00:00Z",
"payload": {
"item_id": "11086",
"marker": 3540
},
"session_id": "b549de69-a0dc-4b8a-8ee1-01f1a1f5a66e",
"type": "Play",
"user": 81729334,
"user_agent": "Mozilla/5.0 (iPad; CPU OS 5_0_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko) Mobile/9A405"
}
这些日志数据记录了用户访问Cloudera网站提供的电影服务的行为,其中created_at表示日志记录生成的时间戳,payload表示用户访问的具体内容(item_id代表某一个节目的ID,marker表示用户播放该节目的位置),session_id是用户的会话id,type表示用户的行为类型(例如play, account, login等等),user表示用户的id,user_agent表示用户的终端类型信息。
1. 把数据装载进HDFS:
$ hadoop fs -copyFromLocal data
2. 查看日志文件,文件中的每一行对应一条记录,探查记录的结构,例如每个记录包含了哪些字段和子字段,字段的取值类型,最大值,最小值,平均值,出现次数等信息,并对这些信息进行汇总。可以通过编写map-reduce任务来进行这个数据探查任务:
Map任务代码:
#!/usr/bin/python
#coding=utf-8
import json
import sys
# 从stdin中读取所有的行
for line in sys.stdin:
try:
# 某些行的JSON格式不正确,例如多打了一个双引号,需要修正
data = json.loads(line.replace('""', '"'))
for field in data.keys():
if field == 'type':
# 打印输出type字段的取值
print "%s" % (data[field])
else:
# 规范其他字段的命名,例如某些字段存在拼写不统一的问题
real = field
if real == 'user_agent':
real = 'userAgent'
elif real == 'session_id':
real = 'sessionID'
elif real == 'created_at' or real == 'craetedAt':
real = 'createdAt'
# 如果这些字段有子字段,规范其子字段的命名并打印其取值
if type(data[field]) is dict:
print "%s:%s" % (data['type'], real)
# 规范子字段的命名
for subfield in data[field]:
subreal = subfield
if subreal == 'item_id':
subreal = 'itemId'
print "%s:%s:%s\t%s" % (data['type'], real, subreal, data[field][subfield])
else:
# 打印字段的取值
print "%s:%s\t%s" % (data['type'], real, data[field])
except ValueError:
# 记录错误信息
sys.stderr.write("%s\n" % line)
exit(1)
Reduce任务代码:
#!/usr/bin/python
#coding=utf-8
import dateutil.parser
import re
import sys
#print_summary判断字段的取值类型,并且输出其统计信息,例如如果是数值类型,输出其最大值,最小值,平均值,次数。如果是日期类型,输出其最大值,最小值,次数。
def print_summary():
if is_heading:
print "%s - %d" % (last, count)
elif is_date:
print "%s - min: %s, max %s, count: %d" % (last, min, max, count)
elif is_number:
print "%s - min: %d, max %d, average: %.2f, count: %d" % (last, min, max, float(sum)/count, count)
elif is_value:
print "%s - %s, count: %d" % (last, list(values), count)
else:
print "%s - identifier, count: %d" % (last, count)
last = None
values = set()
is_date = True
is_number = False
is_value = False
is_heading = True
min = None
max = None
sum = 0
count = 0
# 正则表达式判断是否是日期类型
date_pattern = re.compile(r'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}(Z|[+-]\d{2}:\d{2})')
# 从stdin读取所有的行
for line in sys.stdin:
# 以tab来进行分割,前一部分为Key,后一部分为Value
parts = line.strip().split('\t')
if parts[0] != last:
# 如果和之前的Key不同,打印之前Key的统计信息
if last != None:
print_summary()
# 重置所有的统计项和变量
last = parts[0]
values = set()
is_date = True
is_number = False
is_identifier = False
is_heading = True
min = None
max = None
sum = 0
count = 0
# 对这个字段出现的次数进行统计
count += 1
# 如果这个字段有一个非零长度的取值,就进行处理
if len(parts) > 1 and len(parts[1]) > 0:
is_heading = False
# 判断是否是日期类型
if is_date:
if date_pattern.match(parts[1]):
try:
tstamp = dateutil.parser.parse(parts[1])
# 如果可以解析,就更新统计信息
if min == None or tstamp < min:
min = tstamp
if max == None or tstamp > max:
max = tstamp
except (TypeError, ValueError):
# 如果无法处理,就假设是数值类型
is_date = False
is_number = True
min = None
max = None
else:
# 如果正则表达式的判断不匹配,则假设是数值类型
is_date = False
is_number = True
min = None
max = None
# 如果是数值类型,测试是否能解析为一个数字
if is_number:
try:
num = int(parts[1])
sum += num
# 可以解析的话,对统计信息进行更新
if min == None or num < min:
min = num
if max == None or num > max:
max = num
except ValueError:
# 如果解析有问题,假设是类别类型
is_number = False
is_value = True
# 如果是类别类型,那么把这个类别加入到values这个set中
if is_value:
values.add(parts[1])
# 如果这个set的长度超出10,则认为类别数太多,不应设置为类别类型,应该设置为标识类型
if len(values) > 10:
is_value = False
values = None
# 打印上一个Key的统计信息
print_summary()
运行以下的命令来执行(为了便利,设置了一个STREAMING的变量指向HADOOP的hadoop-streaming jar文件):
$ export STREAMING=/software/hadoop-2.7.1-src/hadoop-dist/target/hadoop-2.7.1/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar
$ hadoop jar $STREAMING -input data/heckle/ -input data/jeckle/ -output summary -mapper summary_map.py -file summary_map.py -reducer summary_reduce.py -file summary_reduce.py
运行结果如下,从结果中我们可以看到一个重要的信息,在Account:payload里面,有177个包含了subAction字段,subAction有三个取值,其中一个取值是parental controls,可以帮助我们来判别用户是属于小孩。
Account - 177
Account:auth - identifier, count: 99
Account:createdAt - min: 2013-05-06 08:02:56+00:00, max 2013-05-12 23:06:11-08:00, count: 177
Account:payload - 177
Account:payload:new - ['kid'], count: 134
Account:payload:old - ['adult'], count: 134
Account:payload:subAction - ['updatePassword', 'parentalControls', 'updatePaymentInfo'], count: 177
Account:refId - identifier, count: 99
Account:sessionID - identifier, count: 177
Account:user - min: 1634306, max 99314647, average: 47623573.86, count: 177
Account:userAgent - identifier, count: 177
AddToQueue - 5091
AddToQueue:auth - identifier, count: 2919
AddToQueue:createdAt - min: 2013-05-06 08:00:32+00:00, max 2013-05-12 23:57:46-08:00, count: 5091
AddToQueue:payload - 5091
AddToQueue:payload:itemId - identifier, count: 5091
AddToQueue:refId - identifier, count: 2919
AddToQueue:sessionID - identifier, count: 5091
AddToQueue:user - min: 1169676, max 99985450, average: 50609784.80, count: 5091
AddToQueue:userAgent - identifier, count: 5091
Advance - 3062
Advance:createdAt - min: 2013-05-06 08:02:10+00:00, max 2013-05-09 07:27:37+00:00, count: 3062
Advance:payload - 3062
Advance:payload:itemId - identifier, count: 3062
Advance:payload:marker - min: 0, max 8491, average: 2085.52, count: 3062
Advance:sessionID - identifier, count: 3062
Advance:user - min: 1091145, max 99856025, average: 51177856.67, count: 3062
Advance:userAgent - identifier, count: 3062
Home - 5425
Home:auth - identifier, count: 3109
Home:createdAt - min: 2013-05-06 08:00:08+00:00, max 2013-05-12 23:57:18-08:00, count: 5425
Home:payload - 5425
Home:payload:popular - min: 1094, max 39475, average: 21398.60, count: 27125
Home:payload:recent - identifier, count: 5222
Home:payload:recommended - identifier, count: 27125
Home:refId - identifier, count: 3109
Home:sessionID - identifier, count: 5425
Home:user - min: 1091145, max 99985450, average: 49984786.82, count: 5425
Home:userAgent - identifier, count: 5425
Hover - 19617
Hover:auth - identifier, count: 11376
Hover:createdAt - min: 2013-05-06 08:00:18+00:00, max 2013-05-12 23:58:08-08:00, count: 19617
Hover:payload - 19617
Hover:payload:itemId - identifier, count: 19617
Hover:refId - identifier, count: 11376
Hover:sessionID - identifier, count: 19617
Hover:user - min: 1091145, max 99985450, average: 50158191.38, count: 19617
Hover:userAgent - identifier, count: 19617
ItemPage - 274
ItemPage:auth - identifier, count: 154
ItemPage:createdAt - min: 2013-05-06 08:02:15+00:00, max 2013-05-12 23:34:12-08:00, count: 274
ItemPage:payload - 274
ItemPage:payload:itemId - identifier, count: 274
ItemPage:refId - identifier, count: 154
ItemPage:sessionID - identifier, count: 274
ItemPage:user - min: 1263067, max 99270605, average: 49770339.91, count: 274
ItemPage:userAgent - identifier, count: 274
Login - 1057
Login:auth - identifier, count: 603
Login:createdAt - min: 2013-05-06 08:01:42+00:00, max 2013-05-12 23:36:07-08:00, count: 1057
Login:refId - identifier, count: 603
Login:sessionID - identifier, count: 1057
Login:user - min: 1091145, max 99856025, average: 49325068.51, count: 1057
Login:userAgent - identifier, count: 1057
Logout - 1018
Logout:auth - identifier, count: 571
Logout:createdAt - min: 2013-05-06 08:13:44+00:00, max 2013-05-12 23:59:15-08:00, count: 1018
Logout:refId - identifier, count: 571
Logout:sessionID - identifier, count: 1018
Logout:user - min: 1091145, max 99985450, average: 48915219.96, count: 1018
Logout:userAgent - identifier, count: 1018
Pause - 4424
Pause:auth - identifier, count: 2543
Pause:createdAt - min: 2013-05-06 08:00:49+00:00, max 2013-05-12 23:57:22-08:00, count: 4424
Pause:payload - 4424
Pause:payload:itemId - identifier, count: 4424
Pause:payload:marker - min: 1, max 7215, average: 2207.71, count: 4424
Pause:refId - identifier, count: 2543
Pause:sessionID - identifier, count: 4424
Pause:user - min: 1091145, max 99985450, average: 50103317.24, count: 4424
Pause:userAgent - identifier, count: 4424
Play - 558568
Play:auth - identifier, count: 323244
Play:createdAt - min: 2013-05-06 08:00:01+00:00, max 2013-05-12 23:59:59-08:00, count: 558568
Play:payload - 558568
Play:payload:itemId - identifier, count: 543129
Play:payload:marker - min: 0, max 8525, average: 2138.50, count: 543129
Play:refId - identifier, count: 323244
Play:sessionID - identifier, count: 558568
Play:user - min: 1091145, max 99985450, average: 50192151.13, count: 558568
Play:userAgent - identifier, count: 558568
Position - 164
Position:createdAt - min: 2013-05-06 08:25:34+00:00, max 2013-05-09 07:02:48+00:00, count: 164
Position:payload - 164
Position:payload:itemId - identifier, count: 164
Position:payload:marker - min: 0, max 6690, average: 2358.18, count: 164
Position:sessionID - identifier, count: 164
Position:user - min: 1091145, max 99413523, average: 49317538.82, count: 164
Position:userAgent - identifier, count: 164
Queue - 1313
Queue:auth - identifier, count: 735
Queue:createdAt - min: 2013-05-06 08:01:21+00:00, max 2013-05-12 23:36:31-08:00, count: 1313
Queue:refId - identifier, count: 735
Queue:sessionID - identifier, count: 1313
Queue:user - min: 1091145, max 99806989, average: 50424708.80, count: 1313
Queue:userAgent - identifier, count: 1313
Rate - 652
Rate:auth - identifier, count: 387
Rate:createdAt - min: 2013-05-06 08:03:32+00:00, max 2013-05-12 23:36:08-08:00, count: 652
Rate:payload - 652
Rate:payload:itemId - identifier, count: 652
Rate:payload:rating - min: 1, max 5, average: 3.54, count: 652
Rate:refId - identifier, count: 387
Rate:sessionID - identifier, count: 652
Rate:user - min: 1091145, max 99314647, average: 49635732.38, count: 652
Rate:userAgent - identifier, count: 652
Recommendations - 1344
Recommendations:auth - identifier, count: 784
Recommendations:createdAt - min: 2013-05-06 08:00:56+00:00, max 2013-05-12 23:58:00-08:00, count: 1344
Recommendations:payload - 1344
Recommendations:payload:recs - identifier, count: 33600
Recommendations:refId - identifier, count: 784
Recommendations:sessionID - identifier, count: 1344
Recommendations:user - min: 1091145, max 99985450, average: 50165065.09, count: 1344
Recommendations:userAgent - identifier, count: 1344
Resume - 1774
Resume:createdAt - min: 2013-05-06 08:02:04+00:00, max 2013-05-09 07:31:48+00:00, count: 1774
Resume:payload - 1774
Resume:payload:itemId - identifier, count: 1774
Resume:payload:marker - min: 0, max 6917, average: 2250.60, count: 1774
Resume:sessionID - identifier, count: 1774
Resume:user - min: 1091145, max 99985450, average: 51027539.16, count: 1774
Resume:userAgent - identifier, count: 1774
Search - 1328
Search:auth - identifier, count: 769
Search:createdAt - min: 2013-05-06 08:02:11+00:00, max 2013-05-12 23:36:56-08:00, count: 1328
Search:payload - 1328
Search:payload:results - identifier, count: 26560
Search:refId - identifier, count: 769
Search:sessionID - identifier, count: 1328
Search:user - min: 1170207, max 99976229, average: 50523812.45, count: 1328
Search:userAgent - identifier, count: 1328
Stop - 7178
Stop:auth - identifier, count: 4187
Stop:createdAt - min: 2013-05-06 08:04:10+00:00, max 2013-05-12 23:55:49-08:00, count: 7178
Stop:payload - 7178
Stop:payload:itemId - identifier, count: 7178
Stop:payload:marker - min: 172, max 8233, average: 2692.93, count: 7178
Stop:refId - identifier, count: 4187
Stop:sessionID - identifier, count: 7178
Stop:user - min: 1091145, max 99976229, average: 49769162.34, count: 7178
Stop:userAgent - identifier, count: 7178
VerifyPassword - 133
VerifyPassword:auth - identifier, count: 78
VerifyPassword:createdAt - min: 2013-05-06 10:02:24+00:00, max 2013-05-12 23:02:33-08:00, count: 133
VerifyPassword:refId - identifier, count: 78
VerifyPassword:sessionID - identifier, count: 133
VerifyPassword:user - min: 1634306, max 99314647, average: 47262951.69, count: 133
VerifyPassword:userAgent - identifier, count: 133
WriteReview - 274
WriteReview:auth - identifier, count: 154
WriteReview:createdAt - min: 2013-05-06 08:11:46+00:00, max 2013-05-12 23:38:58-08:00, count: 274
WriteReview:payload - 274
WriteReview:payload:itemId - identifier, count: 274
WriteReview:payload:length - min: 52, max 1192, average: 627.63, count: 274
WriteReview:payload:rating - min: 1, max 5, average: 4.03, count: 274
WriteReview:refId - identifier, count: 154
WriteReview:sessionID - identifier, count: 274
WriteReview:user - min: 1263067, max 99270605, average: 49770339.91, count: 274
WriteReview:userAgent - identifier, count: 274
从以上的数据探查阶段,我们了解了日志文件的结构,知道了包含哪些字段与子字段,字段取值的范围,以及字段的出现次数。为了便于下一步的机器学习,我们可以对这些日志文件进行整理。对于不是每个记录都有的auth和refid字段,因为对于用户分类来说没有意义,可以直接去除。对于和播放相关的事件类型,例如Play, Pause, Advance等等,可以整合在一起。对于Account字段,如果Subaction是Parental Control,表示成人对这个用户帐号启动了家长控制功能,因此现在该用户帐号为小孩,在这个Parental Control动作发生之前的这个用户帐号为成人。换句话说,在这一个用户会话中,以parental control发生为界,分别标识这同一个用户为成人和小孩。除此之外,出现Account字段但没有发生parental control的用户标识为成人(只有成人才会产生Account字段)。按照这个想法,我编写了clean_map.py和clean_reduce.py两个程序来对数据进行整理。
clean_map.py
#!/usr/bin/python
#coding=utf-8
import dateutil.parser
import json
import sys
from datetime import tzinfo, timedelta, datetime
def main():
for line in sys.stdin:
# Correct for double quotes
data = json.loads(line.replace('""', '"'))
# Correct for variance in field names
item_id = 'item_id'
session_id = 'session_id'
created_at = 'created_at'
if 'sessionID' in data:
session_id = 'sessionID'
if 'createdAt' in data:
created_at = 'createdAt'
elif 'craetedAt' in data:
created_at = 'craetedAt'
if 'payload' in data and 'itemId' in data['payload']:
item_id = 'itemId'
# Prepare the key
userid = data['user']
sessionid = data[session_id]
timestamp = total_seconds(dateutil.parser.parse(data[created_at]) - EPOCH)
key = '%s,%10d,%s' % (userid, timestamp, sessionid)
# Write out the value
if data['type'] == "Account" and data['payload']['subAction'] == "parentalControls":
print "%s\tx:%s" % (key, data['payload']['new'])
elif data['type'] == "Account":
print "%s\tc:%s" % (key, data['payload']['subAction'])
elif data['type'] == "AddToQueue":
print "%s\ta:%s" % (key, data['payload'][item_id])
elif data['type'] == "Home":
print "%s\tP:%s" % (key, ",".join(data['payload']['popular']))
print "%s\tR:%s" % (key, ",".join(data['payload']['recommended']))
print "%s\tr:%s" % (key, ",".join(data['payload']['recent']))
elif data['type'] == "Hover":
print "%s\th:%s" % (key, data['payload'][item_id])
elif data['type'] == "ItemPage":
print "%s\ti:%s" % (key, data['payload'][item_id])
elif data['type'] == "Login":
print "%s\tL:" % key
elif data['type'] == "Logout":
print "%s\tl:" % key
elif data['type'] == "Play" or \
data['type'] == "Pause" or \
data['type'] == "Position" or \
data['type'] == "Stop" or \
data['type'] == "Advance" or \
data['type'] == "Resume":
if len(data['payload']) > 0:
print "%s\tp:%s,%s" % (key, data['payload']['marker'], data['payload'][item_id])
elif data['type'] == "Queue":
print "%s\tq:" % key
elif data['type'] == "Rate":
print "%s\tt:%s,%s" % (key, data['payload'][item_id], data['payload']['rating'])
elif data['type'] == "Recommendations":
print "%s\tC:%s" % (key, ",".join(data['payload']['recs']))
elif data['type'] == "Search":
print "%s\tS:%s" % (key, ",".join(data['payload']['results']))
elif data['type'] == "VerifyPassword":
print "%s\tv:" % key
elif data['type'] == "WriteReview":
print "%s\tw:%s,%s,%s" % (key, data['payload'][item_id], data['payload']['rating'], data['payload']['length'])
"""
Return the number of seconds since the epoch, calculated the hard way.
"""
def total_seconds(td):
return (td.microseconds + (td.seconds + td.days * 24 * 3600) * 10**6) / 10**6
"""
A constant for 0 time difference
"""
ZERO = timedelta(0)
"""
A Timezone class for UTC
"""
class UTC(tzinfo):
def utcoffset(self, dt):
return ZERO
def tzname(self, dt):
return "UTC"
def dst(self, dt):
return ZERO
"""
A constant for the beginning of the epoch
"""
EPOCH = datetime(1970,1,1,tzinfo=UTC())
if __name__ == '__main__':
main()
clean_reduce.py:
#!/usr/bin/python
#coding=utf-8
import json
import sys
def main():
currentSession = None
lastTime = None
data = {}
for line in sys.stdin:
key, value = line.strip().split('\t')
userid, timestr, sessionid = key.split(',')
flag, payload = value.split(':')
if sessionid != currentSession:
currentSession = sessionid;
kid = None
if data:
data['end'] = lastTime
print json.dumps(data)
data = {"popular": [], "recommended": [], "searched": [], "hover": [], "queued": [],
"browsed": [], "recommendations": [], "recent": [], "played": {}, "rated": {}, "reviewed": {},
"actions": [], "kid": kid, "user": userid, "session": sessionid, "start": timestr}
if flag == "C":
data['recommendations'].extend(payload.split(","))
elif flag == "L":
data['actions'].append('login')
elif flag == "P":
data['popular'].extend(payload.split(","))
elif flag == "R":
data['recommended'].extend(payload.split(","))
elif flag == "S":
data['searched'].extend(payload.split(","))
elif flag == "a":
data['queued'].append(payload)
elif flag == "c":
data['actions'].append(payload)
data['kid'] = False
elif flag == "h":
data['hover'].append(payload)
elif flag == "i":
data['browsed'].append(payload)
elif flag == "l":
data['actions'].append('logout')
elif flag == "p":
(marker, itemid) = payload.split(",")
data['played'][itemid] = marker
elif flag == "q":
data['actions'].append('reviewedQueue')
elif flag == "r":
data['recent'].extend(payload.split(","))
elif flag == "t":
(itemid, rating) = payload.split(",")
data['rated'][itemid] = rating
elif flag == "v":
data['actions'].append('verifiedPassword')
data['kid'] = False
elif flag == "w":
(itemid, rating, length) = payload.split(",")
data['reviewed'][itemid] = {}
data['reviewed'][itemid]["rating"] = rating
data['reviewed'][itemid]["length"] = length
elif flag == "x":
# If we see a parental controls event, assume this session was the opposite and start a new session
data['kid'] = payload != "kid"
data['end'] = lastTime
print json.dumps(data)
data = {"popular": [], "recommended": [], "searched": [], "hover": [], "queued": [],
"browsed": [], "recommendations": [], "recent": [], "played": {}, "rated": {}, "reviewed": {},
"actions": [], "kid": payload == "kid", "user": userid, "session": sessionid, "start": timestr}
lastTime = timestr
data['end'] = lastTime
print json.dumps(data, sort_keys=True)
if __name__ == '__main__':
main()
运行以下命令来执行:
$ hadoop jar $STREAMING -input data/heckle/ -input data/jeckle/ -output clean -mapper clean_map.py -file clean_map.py -reducer clean_reduce.py -file clean_reduce.py
运行成功后,通过以下命令来查看结果:
$ hadoop fs -cat clean/part-00000 | head -1
结果如下:
{
"session": "2b5846cb-9cbf-4f92-a1e7-b5349ff08662",
"hover": [
"16177",
"10286",
"8565",
"10596",
"29609",
"13338"
],
"end": "1368189995",
"played": {
"16316": "4990"
},
"browsed": [
],
"recommendations": [
"13338",
"10759",
"39122",
"26996",
"10002",
"25224",
"6891",
"16361",
"7489",
"16316",
"12023",
"25803",
"4286e89",
"1565",
"20435",
"10596",
"29609",
"14528",
"6723",
"35792e23",
"25450",
"10143e155",
"10286",
"25668",
"37307"
],
"actions": [
"login"
],
"reviewed": {
},
"start": "1368189205",
"recommended": [
"8565",
"10759",
"10002",
"25803",
"10286"
],
"rated": {
},
"user": "10108881",
"searched": [
],
"popular": [
"16177",
"26365",
"14969",
"38420",
"7097"
],
"kid": null,
"queued": [
"10286",
"13338"
],
"recent": [
"18392e39"
]
}
从以上的数据整理阶段,我们已经把数据进行了简化和整理,并根据Account字段来标识了一部分用户是成人还是小孩。下一步的工作就是提取相关的特征,来对其他还没标识的用户来进行分类。从电影网站的业务分析,我们可以想到成人和小孩所访问的内容是不一样的。因此我们可以根据用户访问的Item来进行判断,把每个用户所访问的Item,以及每个Item被哪些用户访问这些信息整理出来,这是一个标准的二分图结构,可以应用Pagerank算法来进行计算。
1. 可以通过编写map-reduce任务来提取用户访问过的Item,见以下的程序附件kid_map.py和kid_reduce.py:
#!/usr/bin/python
#coding=utf-8
#kid_map.py
import json
import sys
def main():
# Read all lines from stdin
for line in sys.stdin:
data = json.loads(line)
# Collect all items touched
items = set()
items.update(data['played'].keys())
items.update(data['rated'].keys())
items.update(data['reviewed'].keys())
# Generate a comma-separated list
if items:
itemstr = ','.join(items)
else:
itemstr = ','
# Emit a compound key and compound value
print "%s,%010d,%010d\t%s,%s" % (data['user'], long(data['start']), long(data['end']), data['kid'], itemstr)
if __name__ == '__main__':
main()
#!/usr/bin/python
#coding=utf-8
import sys
def main():
current = None
# Read all lines from stdin
for line in sys.stdin:
# Decompose the compound key and value
key, value = line.strip().split('\t')
user, start, end = key.split(',')
kid, itemstr = value.split(',', 1)
# Create a data record for this user
data = {}
data['user'] = user
data['kid'] = str2bool(kid)
data['items'] = set(itemstr.split(","))
if not current:
current = data
else:
if current['user'] != user or \
(data['kid'] != None and current['kid'] != None and data['kid'] != current['kid']):
# If this is a new user or we have new information about whether the user is a kid
# that conflicts with what we knew before, then print the current record and start
# a new record.
dump(current)
current = data
else:
if data['kid'] != None and current['kid'] == None:
# If we just found out whether the user is a kid, store that
current['kid'] = data['kid']
# Store the items
current['items'].update(data['items'])
# Print the record for the last user.
dump(current)
"""
Emit the data record
"""
def dump(data):
# Remove any empty items
try:
data['items'].remove('')
except KeyError:
pass
# If there are still items in the record, emit it
if len(data['items']) > 0:
# Annotate the session ID if we know the user is an adult or child
if data['kid'] == True:
data['user'] += 'k'
elif data['kid'] == False:
data['user'] += 'a'
print "%s\t%s" % (data['user'], ",".join(data['items']))
"""
Translate a string into a boolean, but return None if the string doesn't parse.
"""
def str2bool(str):
b = None
if str.lower() == 'true':
b = True
elif str.lower() == 'false':
b = False
return b
if __name__ == '__main__':
main()
运行以下命令来执行:
$ hadoop jar $STREAMING -mapper kid_map.py -file kid_map.py -reducer kid_reduce.py -file kid_reduce.py -input clean -output kid
生成的数据如下(第一列是用户名,后缀为a表示成人,k表示小孩。第二列是该用户访问过的Item列表):
$ hadoop fs -cat kid/part-00000 | head -4
10108881 9107,16316
10142325 9614
10151338a 34645
10151338k 38467,33449,26266
2. 从已标识身份的用户中,准备一个用户列表作为teleport列表,其中一份teleport列表对应的是成人,另外一份对应的是小孩。另外把用户按80%、20%的比例划分训练集和测试集。通过以下的命令来完成:
$ hadoop fs -cat kid/part-\* | cut -f1 | grep a > adults
$ expr `wc -l adults | awk '{ print $1 }'` / 5
20
$ tail -n +21 adults | hadoop fs -put - adults_train
$ head -20 adults | hadoop fs -put - adults_test
$ hadoop fs -cat kid/part-\* | cut -f1 | grep k > kids
$ expr `wc -l kids | awk '{ print $1 }'` / 5
24
$ tail -n +25 kids | hadoop fs -put - kids_train
$ head -24 kids | hadoop fs -put - kids_test
3. 建立相邻矩阵。我们已经有了用户访问Item的数据,同样还需要建立一个Item被用户访问的数据文件。通过map-reduce任务来完成,见以下程序item_map.py和item_reduce.py:
#!/usr/bin/python
import sys
def main():
# Read all lines from stdin
for line in sys.stdin:
key, value = line.strip().split('\t')
items = value.split(',')
# Emit every item in the set paired with the user ID
for item in items:
print "%s\t%s" % (item, key)
if __name__ == '__main__':
main()
#!/usr/bin/python
import sys
def main():
last = None
# Read all lines from stdin
for line in sys.stdin:
item, user = line.strip().split('\t')
if item != last:
if last != None:
# Emit the previous key
print "%s\t%s" % (last, ','.join(users))
last = item
users = set()
users.add(user)
# Emit the last key
print "%s\t%s" % (last, ','.join(users))
if __name__ == '__main__':
main()
执行以下的命令来运行:
$ hadoop jar $STREAMING -mapper item_map.py -file item_map.py -reducer item_reduce.py -file item_reduce.py -input kid -output item
生成的数据如下(第一列是item名,第二列是这个item被哪些用户访问过的列表):
$ hadoop fs -cat item/part-\* | head
10081e1 85861225,78127887,83817844,67863534,79043502
10081e10 10399917
10081e11 58912004
10081e2 58912004
10081e3 10399917
10081e4 10399917
10081e5 10399917
10081e7 10399917
10081e8 58912004
10081e9 58912004
算法的实现思想是,从训练集的成人用户列表出发,先赋予每个成人用户一个概率,计算从这个成人用户出发访问每一个Item的概率,对于每个Item的概率,再通过这个Item被哪些用户访问过,来把概率继续传播下去。重复以上的过程,知道上次迭代计算的用户,Item概率值和这次迭代计算的概率值的和之间的差值小于设定的门限值时,即认为已经收敛。计算的结果表示用户属于成人用户的概率。
同样,从训练集的小孩用户列表出发,按以上的流程也进行迭代计算,最后得出的结果是用户属于小孩的概率。
把用户的概率值进行比较,看是属于成人的概率值大还是小孩的概率值大,从而判断用户是属于成人还是小孩。
考虑SPARK通过内存运算的方式,特别适合分布式的多次迭代算法的计算,因此基于SPARK来实现这个算法。程序如下:
#coding=utf-8
from pyspark import SparkConf, SparkContext
import re
import json
import sys
conf = SparkConf().setMaster("local").setAppName("MyAPP")
sc = SparkContext(conf = conf)
adults_train = sc.textFile("adults_train") #读取adults_train训练数据,每一行代表一个用户,后缀a表示这个用户是adult
init_prob = 1.0 / adults_train.count() #计算初始概率,即随机选择任意一个用户的概率
v = adults_train.map(lambda x: (x.split("\t")[0], init_prob)).cache() #生成一个名字为v的rdd,其中key是训练数据里面的用户名,value是初始概率
v.count() #对rdd执行count,使得立即生成该rdd
BETA = 0.8 #设置转移概率的控制参数
teleport = init_prob * (1 - BETA) #teleport为转移概率,即多大概率从当前的选择跳到另一个选择的概率,模拟pagerank里面提到的网页随机跳转的情形
sourceRDD = adults_train.map(lambda x: (x.split("\t")[0], teleport)).cache() #生成一个名字为sourceRDD的rdd,其中key是训练数据里面的用户名,value是转移概率
sourceRDD.count() #对rdd执行count,使得立即生成该rdd
limit = 0.01 #设置收敛误差的值,当每次迭代后的结果与上次迭代的结果的相差值的总和小于该数值,即认为已经收敛
#mapFunc函数的作用为对rdd的概率值进行计算
#举例,例如rdd里面的某一个tuple为(12345, (['11144, 12155'], 0.1)), 其中12345表示是一个user或item的名字,['11144, 12155']表示12345所访问过或被访问过的item或user,0.1表示从初始概率出发经过多次迭代后选择12345的概率
#因此,从12345连接到11144或12155的概率为0.1×(1/2)=0.05。所以到达11144或者12155的概率都为0.05,返回这个结果
def mapFunc(inputTuple):
result = []
items = inputTuple[1][0].split(",")
prob = 1.0 / len(items)
value = prob * inputTuple[1][1]
for item in items:
result.append((item, value))
return result
#reduceFunc函数的作用为考虑随机跳转的情况下,有多大概率连接到其他节点,解决Spider trap的问题
def reduceFunc(inputTuple):
if inputTuple[1] != None: #inputTuple[1]!=None,表示这个value对应的key是sourceRDD里面的某一个Key,即这是一个出发节点,需要考虑随机跳转的概率
value = BETA * inputTuple[0] + inputTuple[1]
else: #否则,这是一个中间节点,把概率值乘以BETA系数
value = BETA * inputTuple[0]
return value
kid = sc.textFile("kid/part*") #读入用户访问item的文件,每一行的第一列数据代表用户名,后缀为k表示该用户为kid,后缀为a表示该用户为adult,第二列数据包括了多个以逗号分隔的item名字,表示该用户访问过这些item
item = sc.textFile("item/part*") #读入item被用户访问的文件,每一行的第一列数据代表item名字,第二列数据包括了多个以逗号分隔的用户名字,表示这个item被这些用户访问过
inputRDD = kid.union(item).map(lambda x: (x.split("\t")[0], x.split("\t")[1])).cache() #把以上两个文件组合起来生成一个rdd,其中第一列的数据作为Key, 第二列的数据作为value
inputRDD.count()
i = 0 #i为迭代次数
prevResultRDD = None #这个rdd用于保存上次迭代的计算结果
resultRDD = None #这个rdd用于保存本次迭代的计算结果
while True:
temp = inputRDD.leftOuterJoin(v) #把inputRDD和v这2个rdd进行leftOuterJoin, 在第一次迭代时,join之后的结果是Key为item名字或者用户名字,Value是一个tuple,其中第一个数值对应的是Key所访问的item的列表(当key是用户名时)或者key被Uer访问的列表(当key是item名字时),第二个数值代表v这个RDD里面对应的概率值
temp = temp.filter(lambda (key, value): value[-1]!=None) #过滤掉概率值为空的那些Key
temp = temp.map(mapFunc) #对概率值进行更新计算,计算后的结果是从key出发,转移到value里面的各个item或者user的概率
temp = temp.flatMap(lambda x: x) #把结果展开
temp = temp.reduceByKey(lambda x,y: x+y) #对相同key值的概率值进行求和
temp = temp.leftOuterJoin(sourceRDD) #把随机跳转概率也合并到一起
temp = temp.mapValues(reduceFunc) #把两个概率值进行合并计算
temp = temp.coalesce(5) #压缩一下任务的数目
temp1 = sourceRDD.subtractByKey(temp) #在sourceRDD中删除掉和temp一样的key-value对
resultRDD = temp1.union(temp) #合并temp1和temp这2个rdd,保存为resultRDD
v = resultRDD #把resultRDD复制给v,以便下一个迭代计算之用
if i > 0: #如果不是第一次迭代
diff = resultRDD.union(prevResultRDD).reduceByKey(lambda x,y: abs(x-y)).values().sum() #计算resultRDD与prevResultRDD这2个rdd之间相同Key的数值的差值,并求和
print "Iteration %i diff: %f" % (i,diff) #打印这是第几次的迭代,以及这次迭代的差值的和
if diff < limit: #如果差值小于limit,表示已经收敛,可以退出迭代
break
prevResultRDD = resultRDD #把这次迭代的计算结果保存为prevResultRDD,以便与下次迭代的结果进行比较
i = i + 1
resultRDD.saveAsTextFile("adults_spark_result") #把结果保存为adults_spark_result文件
kids_train = sc.textFile("kids_train") #读取训练数据,每一行代表一个用户,后缀K表示这个用户是Kid
init_prob = 1.0 / kids_train.count() #计算初始概率,即随机选择任意一个用户的概率
v = kids_train.map(lambda x: (x.split("\t")[0], init_prob)).cache() #生成一个名字为v的rdd,其中key是训练数据里面的用户名,value是初始概率
v.count() #对rdd执行count,使得立即生成该rdd
BETA = 0.8 #设置转移概率的控制参数
teleport = init_prob * (1 - BETA) #teleport为转移概率,即多大概率从当前的选择跳到另一个选择的概率,模拟pagerank里面提到的网页随机跳转的情形
sourceRDD = kids_train.map(lambda x: (x.split("\t")[0], teleport)).cache() #生成一个名字为sourceRDD的rdd,其中key是训练数据里面的用户名,value是转移概率
sourceRDD.count() #对rdd执行count,使得立即生成该rdd
limit = 0.01 #设置收敛误差的值,当每次迭代后的结果与上次迭代的结果的相差值的总和小于该数值,即认为已经收敛
#mapFunc函数的作用为对rdd的概率值进行计算
#举例,例如rdd里面的某一个tuple为(12345, (['11144, 12155'], 0.1)), 其中12345表示是一个user或item的名字,['11144, 12155']表示12345所访问过或被访问过的item或user,0.1表示从初始概率出发经过多次迭代后选择12345的概率
#因此,从12345连接到11144或12155的概率为0.1×(1/2)=0.05。所以到达11144或者12155的概率都为0.05,返回这个结果
def mapFunc(inputTuple):
result = []
items = inputTuple[1][0].split(",")
prob = 1.0 / len(items)
value = prob * inputTuple[1][1]
for item in items:
result.append((item, value))
return result
#reduceFunc函数的作用为考虑随机跳转的情况下,有多大概率连接到其他节点,解决Spider trap的问题
def reduceFunc(inputTuple):
if inputTuple[1] != None: #inputTuple[1]!=None,表示这个value对应的key是sourceRDD里面的某一个Key,即这是一个出发节点,需要考虑随机跳转的概率
value = BETA * inputTuple[0] + inputTuple[1]
else: #否则,这是一个中间节点,把概率值乘以BETA系数
value = BETA * inputTuple[0]
return value
kid = sc.textFile("kid/part*") #读入用户访问item的文件,每一行的第一列数据代表用户名,后缀为k表示该用户为kid,后缀为a表示该用户为adult,第二列数据包括了多个以逗号分隔的item名字,表示该用户访问过这些item
item = sc.textFile("item/part*") #读入item被用户访问的文件,每一行的第一列数据代表item名字,第二列数据包括了多个以逗号分隔的用户名字,表示这个item被这些用户访问过
inputRDD = kid.union(item).map(lambda x: (x.split("\t")[0], x.split("\t")[1])).cache() #把以上两个文件组合起来生成一个rdd,其中第一列的数据作为Key, 第二列的数据作为value
inputRDD.count()
i = 0 #i为迭代次数
prevResultRDD = None #这个rdd用于保存上次迭代的计算结果
resultRDD = None #这个rdd用于保存本次迭代的计算结果
while True:
temp = inputRDD.leftOuterJoin(v) #把inputRDD和v这2个rdd进行leftOuterJoin, 在第一次迭代时,join之后的结果是Key为item名字或者用户名字,Value是一个tuple,其中第一个数值对应的是Key所访问的item的列表(当key是用户名时)或者key被Uer访问的列表(当key是item名字时),第二个数值代表v这个RDD里面对应的概率值
temp = temp.filter(lambda (key, value): value[-1]!=None) #过滤掉概率值为空的那些Key
temp = temp.map(mapFunc) #对概率值进行更新计算,计算后的结果是从key出发,转移到value里面的各个item或者user的概率
temp = temp.flatMap(lambda x: x) #把结果展开
temp = temp.reduceByKey(lambda x,y: x+y) #对相同key值的概率值进行求和
temp = temp.leftOuterJoin(sourceRDD) #把随机跳转概率也合并到一起
temp = temp.mapValues(reduceFunc) #把两个概率值进行合并计算
temp = temp.coalesce(5) #压缩一下任务的数目
temp1 = sourceRDD.subtractByKey(temp) #在sourceRDD中删除掉和temp一样的key-value对
resultRDD = temp1.union(temp) #合并temp1和temp这2个rdd,保存为resultRDD
v = resultRDD #把resultRDD复制给v,以便下一个迭代计算之用
if i > 0: #如果不是第一次迭代
diff = resultRDD.union(prevResultRDD).reduceByKey(lambda x,y: abs(x-y)).values().sum() #计算resultRDD与prevResultRDD这2个rdd之间相同Key的数值的差值,并求和
print "Iteration %i diff: %f" % (i,diff) #打印这是第几次的迭代,以及这次迭代的差值的和
if diff < limit: #如果差值小于limit,表示已经收敛,可以退出迭代
break
prevResultRDD = resultRDD #把这次迭代的计算结果保存为prevResultRDD,以便与下次迭代的结果进行比较
i = i + 1
resultRDD.saveAsTextFile("kids_spark_result") #把结果保存为kids_spark_result文件
运行结果如下,这里只列出前10位用户的概率:
用户属于小孩的概率:
(u'8411e9', 2.871297291176899e-05)
(u'35138e2', 0.00012247778794087868)
(u'21393e18', 4.5223254296937176e-05)
(u'27596', 0.00070710092703777001)
(u'33455e4', 0.00013915494161768303)
(u'7088e2', 5.793089136754104e-05)
(u'36894585k', 0.0027620469037547314)
(u'13697e4', 0.00011937506683255453)
(u'68317773', 0.00027186506000864545)
(u'14187e7', 7.5962565791353471e-06)
用户属于成人的概率:
(u'18529e1', 4.3669091207240875e-06)
(u'31739', 2.9099292474953138e-05)
(u'23197e4', 8.4660175805440307e-05)
(u'62946607', 6.4222417209513122e-05)
(u'27404e3', 1.5859663143835915e-06)
(u'5279e4', 1.3262905188722944e-05)
(u'32109', 0.00011087543073187501)
(u'39715', 3.5655696054560281e-06)
(u'90719334', 0.0001079840537305259)
(u'32718', 1.2653791421612406e-05)
对这2个结果进行比较,看哪个概率值较大则认为用户属于哪类。在比较概率值时,需要考虑成人用户训练集中的用户数与小孩用户训练集中的用户数。
首先我们编写一个Map任务,根据成人用户与小孩用户的比例,来重新计算用户属于成人的概率值,并把这个概率值设为负数。这样我们可以把这个概率值直接与用户属于小孩的概率值相加,如果结果大于0则表示该用户为小孩,否则为成人。程序如下:
#!/usr/bin/python
import sys
def main():
if len(sys.argv) < 3:
sys.stderr.write("Missing args: %s\n" % sys.argv)
# Calculate conversion factor
num_adults = float(sys.argv[1])
num_kids = float(sys.argv[2])
factor = -num_adults / num_kids
# Apply the conversion to every record and emit it
for line in sys.stdin:
key, value = line.strip().split('\t')
print "%s\t%.20f" % (key, float(value) * factor)
if __name__ == "__main__":
main()
先运行以下2个命令,统计成人用户训练集中的用户个数,以及小孩用户训练集中的个数:
$ hadoop fs -cat adults_train | wc -l
84
$ hadoop fs -cat kids_train | wc -l
96
把这2个统计数据作为参数,运行以下命令:
$ hadoop jar $STREAMING -D mapred.reduce.tasks=0 -input adult_final -output adult_mod -mapper "adult_map.py 84 96" -file adult_map.py
输出的结果adult_mod就是调整后的成人用户概率值。
然后创建一个reduce任务,程序如下:
#!/usr/bin/python
import re
import sys
def main():
last = None
last_root = None
sum = 0.0
p = re.compile(r'^(\d{7,8})[ak]?$')
# Read all the lines from stdin
for line in sys.stdin:
key, value = line.strip().split('\t')
m = p.match(key)
# Ignore anything that's not a user ID
if m:
if key != last:
# Dump the previous user ID's label if it's not
# the same real user as the current user.
if last != None and last_root != m.group(1):
dump(last, sum)
last = key
last_root = m.group(1)
sum = 0.0
sum += float(value)
dump(last, sum)
def dump(key, sum):
# sum is between -1 and 1, so adding one and truncating gets us 0 or 1.
# We want ties to go to adults, though.
if sum != 0:
print "%s\t%d" % (key, int(sum + 1))
else:
print "%s\t0" % key
if __name__ == "__main__":
main()
运行以下命令:
$ hadoop jar $STREAMING -D mapred.textoutputformat.separator=, -input adult_mod -input kid_final -output final -reducer combine_reduce.py -file combine_reduce.py -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat
输出的结果final就是用户属于成人还是小孩的分类结果,0表示成人,1表示小孩
运行以下命令来判断我们的分类结果是否正确:
$ hadoop fs -cat final/part\* | grep a, | grep -v ,0
$ hadoop fs -cat final/part\* | grep k, | grep -v ,1
以上命令的含义是,从结果中查找后缀名为a的用户,再查找其分类结果是否有不为0的记录。从结果中查找后缀名为k的用户,再查找其分类结果是否有不为1的记录。如果有输出结果,表示有用户被不正确的分类。从以上命令的执行结果来看,所有用户都被正确的分类。
重复以上的步骤,把pagerank.py里面的读取adult_train和kid_train改为adult_test和kid_test即可。其他步骤类似。最后得出的准确率在99%左右。这里就不再赘述了。
应用Pagerank算法,可以很好的解决二分图类型的问题