Spark-SQL优化案例---股票点击实时排行,需求每天股票点击统计排列统计、每小时股票统计排列统计,及天与小时的环比变化。
1、时间的取值采用Linux中的shell来取值,虽然sql中也能取到时间或者UDF来实现,但是shell来取明显更快。
#!/bin/sh # upload logs to hdfs source /etc/profile yesterday=`date --date='1 days ago' +%Y%m%d` yesterday2=`date --date='1 days ago' +%Y-%m-%d` today=`date --date='0 days ago' +%Y%m%d` today2=`date --date='0 days ago' +%Y-%m-%d` nowtime=`date --date='0 days ago' +%H:%M:%S` onehourage=`date --date='1 hours ago' "+%Y-%m-%d %H:%M:%S"` twohourage=`date --date='2 hours ago' "+%Y-%m-%d %H:%M:%S"`
取出昨天与今天的日期,及现在的时间与过去1小时及2小时的时间点
调用spark-sql的shell命令
/opt/modules/spark/bin/spark-sql --master spark://10.130.2.20:7077 --executor-memory 10g --total-executor-cores 120 --conf spark.ui.port=54689 --driver-memory 5g -e
2、每天股票点击统计排列统计及天的环比变化
INSERT overwrite TABLE st.stock_realtime_analysis PARTITION ( DTYPE='01' ) SELECT t1.stockId AS stockId, t1.url AS url, t1.clickcnt AS clickcnt, 0, ROUND((t1.clickcnt / ( CASE WHEN t2.clickcntyesday IS NULL THEN 0 ELSE t2.clickcntyesday END) - 1) * 100, 2) AS LPcnt, '01' AS type, '${today2}' analysis_date, '${nowtime}' analysis_time FROM ( SELECT stock_code stockId, concat('http://stockdata.stock.hexun.com/', stock_code,'.shtml') url, COUNT(1) clickcnt FROM dms.tracklog_5min WHERE stock_type = 'STOCK' AND DAY ='${today}' GROUP BY stock_code ORDER BY clickcnt DESC limit 20) t1 LEFT JOIN ( SELECT stock_code stockId, COUNT(1) clickcntyesday FROM dms.tracklog_5min a WHERE stock_type = 'STOCK' AND SUBSTR(DATETIME, 1, 10) = '${yesterday2}' AND SUBSTR(DATETIME, 12, 5) < '${nowtime}' AND DAY = '${yesterday}' GROUP BY stock_code) t2 ON t1.stockId = t2.stockId;
查询的时候,注意确定好 day 即partion的范围。
3、每小时股票点击统计排列统计及小时的环比变化
INSERT overwrite TABLE st.stock_realtime_analysis PARTITION ( DTYPE='02' ) SELECT t1.stockId stockId, t1.url url, t1.clickcnt clickcnt, 0, ROUND( ( t1.clickcnt / ( CASE WHEN t2.clickcnt IS NULL THEN 0 ELSE t2.clickcnt END ) - 1 ) * 100, 2 ) LPcnt, '02' type, '${today2}' analysis_date, '${nowtime}' analysis_time FROM ( SELECT stock_code stockId, concat('http://stockdata.stock.hexun.com/', stock_code, '.shtml') url, COUNT(*) clickcnt FROM dms.tracklog_5min WHERE DAY = '${today}' AND stock_type = 'STOCK' AND DATETIME >= '${onehourage}' GROUP BY stock_code ORDER BY clickcnt DESC limit 20 ) t1 LEFT JOIN ( SELECT stock_code stockId, COUNT(*) clickcnt FROM dms.tracklog_5min WHERE DAY = '${today}' AND stock_type = 'STOCK' AND DATETIME <= '${onehourage}' AND DATETIME >= '${twohourage}' GROUP BY stock_code ) t2 ON t1.stockId = t2.stockId ORDER BY clickcnt DESC limit 20;
我们用的是 hive 1.1 , spark1.4.1 可能在升级版本后,会解决这个问题
4、sqoop导入到关系库中
sqoop export --connect jdbc:mysql://10.130.3.211:3306/charts --username dbcharts --password Abcd1234 --table stock_realtime_analysis --fields-terminated-by '\001' --columns "stockid,url,clickcnt,splycnt,lpcnt,type" --export-dir /dw/st/stock_realtime_analysis/dtype=01; sqoop export --connect jdbc:mysql://10.130.3.211:3306/charts --username dbcharts --password Abcd1234 --table stock_realtime_analysis --fields-terminated-by '\001' --columns "stockid,url,clickcnt,splycnt,lpcnt,type" --export-dir /dw/st/stock_realtime_analysis/dtype=02;
5、总结:
优化后每天的任务执行时间约1分钟,每小时的任务执行在半分钟内。原来都是10分钟以上的。