Spark-SQL优化案例---股票点击实时排行,需求每天股票点击统计排列统计、每小时股票统计排列统计,及天与小时的环比变化。
1、时间的取值采用Linux中的shell来取值,虽然sql中也能取到时间或者UDF来实现,但是shell来取明显更快。
#!/bin/sh
# upload logs to hdfs
source /etc/profile
yesterday=`date --date='1 days ago' +%Y%m%d`
yesterday2=`date --date='1 days ago' +%Y-%m-%d`
today=`date --date='0 days ago' +%Y%m%d`
today2=`date --date='0 days ago' +%Y-%m-%d`
nowtime=`date --date='0 days ago' +%H:%M:%S`
onehourage=`date --date='1 hours ago' "+%Y-%m-%d %H:%M:%S"`
twohourage=`date --date='2 hours ago' "+%Y-%m-%d %H:%M:%S"`
取出昨天与今天的日期,及现在的时间与过去1小时及2小时的时间点
调用spark-sql的shell命令
/opt/modules/spark/bin/spark-sql --master spark://10.130.2.20:7077 --executor-memory 10g --total-executor-cores 120 --conf spark.ui.port=54689 --driver-memory 5g -e
2、每天股票点击统计排列统计及天的环比变化
INSERT
overwrite TABLE st.stock_realtime_analysis PARTITION
(
DTYPE='01'
)
SELECT
t1.stockId AS stockId,
t1.url AS url,
t1.clickcnt AS clickcnt,
0,
ROUND((t1.clickcnt / (
CASE
WHEN t2.clickcntyesday IS NULL
THEN 0
ELSE t2.clickcntyesday
END) - 1) * 100, 2) AS LPcnt,
'01' AS type,
'${today2}' analysis_date,
'${nowtime}' analysis_time
FROM
(
SELECT
stock_code stockId,
concat('http://stockdata.stock.hexun.com/', stock_code,'.shtml') url,
COUNT(1) clickcnt
FROM
dms.tracklog_5min
WHERE
stock_type = 'STOCK'
AND DAY ='${today}'
GROUP BY
stock_code
ORDER BY
clickcnt DESC limit 20) t1
LEFT JOIN
(
SELECT
stock_code stockId,
COUNT(1) clickcntyesday
FROM
dms.tracklog_5min a
WHERE
stock_type = 'STOCK'
AND SUBSTR(DATETIME, 1, 10) = '${yesterday2}'
AND SUBSTR(DATETIME, 12, 5) < '${nowtime}'
AND DAY = '${yesterday}'
GROUP BY
stock_code) t2
ON
t1.stockId = t2.stockId;
查询的时候,注意确定好 day 即partion的范围。
3、每小时股票点击统计排列统计及小时的环比变化
INSERT
overwrite TABLE st.stock_realtime_analysis PARTITION
(
DTYPE='02'
)
SELECT
t1.stockId stockId,
t1.url url,
t1.clickcnt clickcnt,
0,
ROUND( ( t1.clickcnt / (
CASE
WHEN t2.clickcnt IS NULL
THEN 0
ELSE t2.clickcnt
END ) - 1 ) * 100, 2 ) LPcnt,
'02' type,
'${today2}' analysis_date,
'${nowtime}' analysis_time
FROM
(
SELECT
stock_code stockId,
concat('http://stockdata.stock.hexun.com/', stock_code, '.shtml') url,
COUNT(*) clickcnt
FROM
dms.tracklog_5min
WHERE
DAY = '${today}'
AND stock_type = 'STOCK'
AND DATETIME >= '${onehourage}'
GROUP BY
stock_code
ORDER BY
clickcnt DESC limit 20 ) t1
LEFT JOIN
(
SELECT
stock_code stockId,
COUNT(*) clickcnt
FROM
dms.tracklog_5min
WHERE
DAY = '${today}'
AND stock_type = 'STOCK'
AND DATETIME <= '${onehourage}'
AND DATETIME >= '${twohourage}'
GROUP BY
stock_code ) t2
ON
t1.stockId = t2.stockId
ORDER BY
clickcnt DESC limit 20;
我们用的是 hive 1.1 , spark1.4.1 可能在升级版本后,会解决这个问题
4、sqoop导入到关系库中
sqoop export --connect jdbc:mysql://10.130.3.211:3306/charts --username dbcharts --password Abcd1234 --table stock_realtime_analysis --fields-terminated-by '\001' --columns "stockid,url,clickcnt,splycnt,lpcnt,type" --export-dir /dw/st/stock_realtime_analysis/dtype=01;
sqoop export --connect jdbc:mysql://10.130.3.211:3306/charts --username dbcharts --password Abcd1234 --table stock_realtime_analysis --fields-terminated-by '\001' --columns "stockid,url,clickcnt,splycnt,lpcnt,type" --export-dir /dw/st/stock_realtime_analysis/dtype=02;
5、总结:
优化后每天的任务执行时间约1分钟,每小时的任务执行在半分钟内。原来都是10分钟以上的。