Apache Spark is the smartphone of BigData
后台是三节点spark集群,python的版本是3.5.4,spark版本是spark-2.3.0-bin-hadoop2.7,在windows10系统下运行
需要处理的数据部分内容如下所列,字段以TAB键分隔
121508281810000000 http://www.yhd.com/?union_ref=7&cp=0 3 PR4E9HWE38DMN4Z6HUG667SCJNZXMHSPJRER VFA5QRQ1N4UJNS9P6MH6HPA76SXZ737P 10977119545124.65.159.122 unionKey:10977119545 2015-08-28 18:10:00 50116447 http://image.yihaodianimg.com/virtual-web_static/virtual_yhd_iframe_index_widthscreen.html?randid=2015828 6 1000 Mozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/40.0 Win32 lunbo_tab_3 北京
市 2 北京市 1 1 1 1 1440
*900 1440756285639
121508281810000001 http://my.yhd.com/order/finishOrder.do?orderCode=5435446505152 http://buy.yhd.com/checkoutV3/index.do 3 YJ25S3QAVPAS31PHSB3HFGZ1E5AYMKX9XUTX 6W26QM41DM6HHND3R4FP42YYXXE1NKGA 222.73.202.251 2015-08-28 18:10:00 85133152 http://www.haosou.com/s?src=new_isearch&q=1%E5%8F%B7%E5%BA%97 25 0 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 Win32 MY_ORDERCOMP
LETION_EDITADDRESS 上海市 1 上海市 2058 0 2058 0 1366*768 1440
756699916
121508281810000002 http://list.yhd.com/p/c5072-b-a-s1-v0-p1-price-d0-pid-pt1086211-pl1171565-m0-k?tp=44.1086211.0.0.0.Kxnn54p-11-FFJKr http://list.yhd.com/p/pt1086211-pl1171565?tp=44.1086211.1508.0.1.Kxn
myye-11-FFJKr 3 JRBWWU6ECXN15Q2Z5QT4TETNHKY7QHE3Y8B3 44.1086211.0.0.0.Kxnn54p-11-FFJKr 5Z5JZMYUGK9TP3QWHDDTU6G5T6PHEQRZ 4734 111.193.165.158 msessionid:DW6SB2FGG84ZZ2WD77DAZHFBXNV8D5776RQ4,uname:gaochentongxue,unionKey:4734,websiteId:A100215249 2015-08-28 18:10:00 116262550 http://www.yhd.com/?tracker_u=1624169&t=1440753050503 1071000 Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 Win32 107 1 search_navi_
cat_4 北京市 2 北京市 47 44 KxnnKjs-11-FFJKr 1086211 44 Kxnn54p-11-FFJKr 1086211 1366*768 0 0 0 1 1440756878359
121508281810000003 http://list.yhd.com/p/c5996-b-a-s1-v0-p1-price-d0-pid21496-pt1074467-pl1157690-m0-k?tp=44.1074467.0.0.0.KxnlcrD-11-EnNUs http://list.yhd.com/p/c0-b-a-s1-v0-p1-price-d0-pid21496-pt10
74467-pl1157690-m0?ref=1_1_51_search&tc=3.1.5.994560.48&tp=51.%E5%84%BF%E7%AB%A5%E6%B2%90%E6%B5%B4%E9%9C%B2.124.48.4.KxnlGug-11-EnNUs 2 37G1MDD68UF8K9XYGVCUA9WFNNR7C1133W9S 44.1074467.0
.0.0.KxnlcrD-11-EnNUs 5TMZXMUKJWK76FNZMVE2TCM4UQW7ZNJH 8363 1 180.162.8.13 msessionid:D3DNC2F91D4VNF49RQG3RDG5J2SQ2JD9,uname:梁静,unionKey:8363,unionType:1,uid:B000ph4
54itr 2015-08-28 18:10:00 117902468 http://fun.fanli.com/goshop/go/?id=633&lc=shopdetail_goumai&v=1440755590146 58 20 1 Mozi
lla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 Win32 493 1 search_navi_cat_0 上海市 1上海市 78 44 KxnmEa2-11-EnNUs 1074467 44 KxnlcrD-11-EnNUs 1074467 1280
*1024 0 0 0 1 1440756591009
121508281810000004 3 2EC97A32-7C27-4F53-A122-B60AA9A987F7 PAU63A8H6A21F81NHTG2X4O9M08Y6148 10680917 49.65.71.158 2015-08-28 18:10:00 156854179 5 37 iPhone 江苏省 5南京市 1200002 MTQ0MDc1NjU1OTA1NQ== 300000 50568587 yhdapp 4.1.1 8.1.1 iPho
ne 750*1334 cmcc 4g 1 8366231 118.777052 32.005723 江苏省-南京
-雨花台区-雨花路-203号
121508281810000005 http://t.yhd.com/detailBrand/21782?tp=4.174850.m3022912.0.2.KxnmxAS-11-8nDA8 http://cms.yhd.com/sale/174850?tc=ad.0.0.15114-19945723.1&tp=1.1.708.0.1.Kxnms^7-11-F4YGp 3 X491CWDNMC1YTEK7WRVUTQMZMXF4X63U54SC 4.174850.m3022912.0.2.KxnmxAS-11-8nDA8 25UYKQWJ13GB2E23SC1HH64HXV12TX3E 118.112.161.79 msessionid:ZNTMUVQ5VTTVHJ376
3DFQVZM955ZWTNE,uname:大肚子 2015-08-28 18:10:00 163477423 http://www.yhd.com/ 4 12 111 Mozilla/5.0 (Windows NT 6.1; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.152 Safari/537.36 Win32 四川省 12 成都市 9 2019 KxnmyW3-11-8nDA8 21782_1_-100_1 4 KxnmxAS-11-8nDA8 174850 1920*1080 m3022912 1440756787216
121508281810000006 3 7878E64C-B06B-4188-9F77-C4C77C38FDBD 823JXQ1J6BH4L0C68I78R83TK9ZQ6VND 10680917 117.94.64.207 unionKey:10680917 2015-08-28 18:10:00 137779971 5 48 iPhone江苏省 5 泰州市 PhoneMallMoreProductsVC MTQ0MDc1NjU0NzExMA== 5028 36400 yhda
pp 4.1.1 8.0.2 iPhone 750*1334 cmcc 2g 1 8366231 119.865009 32.992094
121508281810000007 3 E5FC9449-3777-4103-BF0E-F402F74E2C00 EW935DJP6RWJUCDR8REOQ1YA808F6LBF 112.82.93.34 2015-08-28 18:10:00 140741829 5 45 iPhone 1914 江苏省 5盐城市 200000 MTQ0MDc1NjQyNzY0OQ== 洗发水男士 200000 洗发水男士 yhdapp 4.1.1 8.1.1 iPho
ne 750*1334 cmcc 2g 3 3 2 8366231 0.000000 0.000000
121508281810000008 3 00000000-4049-1cca-e842-1f5049b8efdf W2T8GEGK1JGHNGVCQRKP4A9S681VNBBF 221.220.248.11 2015-08-28 18:10:00 63573725 162371 2 1000 4.1.1 android 北京市 2北京市 51283808 46000 dvb`WiU 2040913 5005 dvb`C8O 175319 43943800 yhdapp 4.1.1 4.1.2 ardphone 720*
1280 cmcc wifi m3032384 2 1 1019323363 116.481765 40.00163 北京市-北京
-东城区-灯市口大街-17号
121508281810000009 http://item.m.yhd.com/item/73725?tp=5006.0.1756.0.11.Kxnlg48-11-4lXQc http://m.yhd.com/1?tracker_u=10525888234 3 3CUMUY385J8ZH1Y72EA8C1W3WT9PT3C2PN4C undefined.un
defined.0.0.0.undefined 5006.0.1756.0.11.Kxnlg48-11-4lXQc HEZ1RNMMBV1D4RQ38CB27J9KN95ZFNF4 10525888234 180.154.150.29 msessionid:2EKNRYZB1K5YWGCHXK6W2RCPKW1Q88SRE
KBP,uname:13862976878@phone,unionKey:10525888234 2015-08-28 18:10:00 202561027 http://m.yhd.com/1?tracker_u=10525888234 6 9928 1 1 1 Mozilla/5.0 (Linux; U; Android 4.4.2; zh-cn; Che2-TL00 Build/HonorChe2-TL00) AppleWebKit/537.36 (KHTML, like Gecko)Version/4.0 MQQBrowser/6.0 Mobile Safari/537.36 AndroidSystem pms_apphome_umaylike_intent_profile_yhd_6_0_9928 上海市 1 上海市 73725 5048 KxnmGkk-11-4lXQc 73725 5006 Kxnlg48-11-4
lXQc 0 br_qq 6.0 4.4.2 720*1280 wifi 1756 0 11 1440756599518
121508281810000010
在CENTOS下,用cat -T 命令时,可以看得更清楚,^I表示一个TAB,
121508281810000000^Ihttp://www.yhd.com/?union_ref=7&cp=0^I^I^I3^IPR4E9HWE38DMN4Z6HUG667SCJNZXMHSPJRER^I^I^I^I^IVFA5QRQ1N4UJNS9P6MH6HPA76SXZ737P^I10977119545^I^I124.65.159.122^I^IunionKey:10977119545^I^I2015-08-28
18:10:00^I50116447^Ihttp://image.yihaodianimg.com/virtual-web_static/virtual_yhd_iframe_index_widthscreen.html?randid=2015828^I6^I^I^I^I1000^I^I^I^I^IMozilla/5.0 (Windows NT 6.1; rv:40.0) Gecko/20100101 Firefox/
40.0^IWin32^I^I^I^I^Ilunbo_tab_3^I^IM-eM-^LM-^WM-dM-:M-,M-eM-8M-^B^I2^I^I^IM-eM-^LM-^WM-dM-:M-,M-eM-8M-^B^I^I^I^I^I^I1^I^I1^I1^I^I1^I^I^I^I^I^I^I^I^I^I^I1440*900^I^I^I^I^I^I^I^I^I^I^I^I^I^I^I^I^I^I^I^I^I^I1440756
285639$
将上述log文件中的时间、URL、GUID抽取出来,进行统计
import os, sys, time
from pyspark import SparkConf, SparkContext
if __name__ == '__main__':
os.environ['SPARK_HOME'] = "C:/spark-2.3.0-bin-hadoop2.7"
os.environ['HADOOP_HOME'] = 'C:/Users/test/PycharmProjects/pyshark-project/winutils'
# Create SparkConf
sparkConf = SparkConf().setAppName("analyzer_yhd_log").setMaster("local[2]")
# Create SparkContext
sc = SparkContext(conf=sparkConf)
sc.setLogLevel('WARN')
"""
Step 1:
read data : SparkContext in HDFS
"""
# file hdfs directory
track_log = "hdfs://NAMENODE-IP:9000/datas/yhd_log"
# read data
track_rdd = sc.textFile(track_log)
# test
# print("Count = " + str(track_rdd.count()))
# print(track_rdd.first())
"""
step 2 : proces2 data
RDD# Transformation
PV UV
PV url url.length>0 col2
UV guid (distinct guid) col6
date&&time tracktime 2015-08-28 18:10:00 col18
"""
# function
def split_data_func(line):
words = line.split("\t")
date_str = str(words[17])[0:10]
return date_str,words[1],words[5]
# data filter and tractformation
filtered_rdd = track_rdd \
.filter(lambda line: (len(line.strip()) > 0) and (len(line.split("\t")) > 20)) \
.map(split_data_func)
# test
# print(str(filtered_rdd.first()))
# print(filtered_rdd.count())
filtered_rdd.cache()
######################################
pv_rdd = filtered_rdd\
.map(lambda t: (list(t)[0], 1)) \
.reduceByKey(lambda x, y: x + y)
print(pv_rdd.first())
#######################################
uv_rdd = (filtered_rdd\
.map(lambda t: (list(t)[2], 1))
.distinct()\
.reduceByKey(lambda x, y: x + y))
#print(uv_rdd.first())
# FOR 循环打印
print(type(uv_rdd.collect()))
for uv_item in uv_rdd.collect()[1:10]:
print(str(uv_item[0]) + " = " + str(uv_item[1]))
#######################################
filtered_rdd.unpersist()
time.sleep(10)
sc.stop()
执行结果
('2015-08-28', 126134)
75443B45-B929-4D30-B58D-D6C7E0F200B0 = 1
3B7A3E81-DEC1-49F2-B55D-2B290EB83B59 = 1
906D430A-48D4-436A-A7A5-73DFD913E5F6 = 1
284508 = 1
DC1K31VCXJPF4CAKYC79TZCHNM8TJTX7TWZ4 = 1
W8368CQW7MJZY8XYE833YSX8UPPJP87M1XQA = 1
3Y78Y8EUFTS3XTD76ZT3HF8WVW9JYD5UD51F = 1
CF3174F2-3528-4C70-942C-04131CDAB275 = 1
00000000-2627-6aa4-ffff-ffff88480ee0 = 1
Process finished with exit code 0