Spark-SQL优化案例---股票点击实时排行

Spark-SQL优化案例---股票点击实时排行,需求每天股票点击统计排列统计、每小时股票统计排列统计,及天与小时的环比变化。


1、时间的取值采用Linux中的shell来取值,虽然sql中也能取到时间或者UDF来实现,但是shell来取明显更快。

#!/bin/sh
# upload logs to hdfs

source /etc/profile

yesterday=`date --date='1 days ago' +%Y%m%d`
yesterday2=`date --date='1 days ago' +%Y-%m-%d`
today=`date --date='0 days ago' +%Y%m%d`
today2=`date --date='0 days ago' +%Y-%m-%d`
nowtime=`date --date='0 days ago' +%H:%M:%S`
onehourage=`date --date='1 hours ago' "+%Y-%m-%d %H:%M:%S"`
twohourage=`date --date='2 hours ago' "+%Y-%m-%d %H:%M:%S"`

取出昨天与今天的日期,及现在的时间与过去1小时及2小时的时间点


调用spark-sql的shell命令

/opt/modules/spark/bin/spark-sql --master spark://10.130.2.20:7077 --executor-memory 10g --total-executor-cores 120 --conf spark.ui.port=54689 --driver-memory 5g  -e

这里类似 hive -e "select  day,count(*) from test group by day;" 操作


2、每天股票点击统计排列统计及天的环比变化

INSERT
    overwrite TABLE st.stock_realtime_analysis PARTITION
    (
        DTYPE='01'
    )
SELECT
    t1.stockId  AS stockId,
    t1.url      AS url,
    t1.clickcnt AS clickcnt,
    0,
    ROUND((t1.clickcnt / (
        CASE
            WHEN t2.clickcntyesday IS NULL
            THEN 0
            ELSE t2.clickcntyesday
        END) - 1) * 100, 2) AS LPcnt,
    '01'                    AS type,
    '${today2}'                analysis_date,
    '${nowtime}'               analysis_time
FROM
    (
        SELECT
            stock_code                                                       stockId,
            concat('http://stockdata.stock.hexun.com/', stock_code,'.shtml') url,
            COUNT(1)                                                         clickcnt
        FROM
            dms.tracklog_5min
        WHERE
            stock_type = 'STOCK'
        AND DAY ='${today}'
        GROUP BY
            stock_code
        ORDER BY
            clickcnt DESC limit 20) t1
LEFT JOIN
    (
        SELECT
            stock_code stockId,
            COUNT(1)   clickcntyesday
        FROM
            dms.tracklog_5min a
        WHERE
            stock_type = 'STOCK'
        AND SUBSTR(DATETIME, 1, 10) = '${yesterday2}'
        AND SUBSTR(DATETIME, 12, 5) < '${nowtime}'
        AND DAY = '${yesterday}'
        GROUP BY
            stock_code) t2
ON
    t1.stockId = t2.stockId;

查询的时候,注意确定好 day 即partion的范围。



3、每小时股票点击统计排列统计及小时的环比变化

INSERT
    overwrite TABLE st.stock_realtime_analysis PARTITION
    (
        DTYPE='02'
    )
SELECT
    t1.stockId  stockId,
    t1.url      url,
    t1.clickcnt clickcnt,
    0,
    ROUND( ( t1.clickcnt / (
        CASE
            WHEN t2.clickcnt IS NULL
            THEN 0
            ELSE t2.clickcnt
        END ) - 1 ) * 100, 2 ) LPcnt,
    '02'                       type,
    '${today2}'                analysis_date,
    '${nowtime}'               analysis_time
FROM
    (
        SELECT
            stock_code                                                        stockId,
            concat('http://stockdata.stock.hexun.com/', stock_code, '.shtml') url,
            COUNT(*)                                                          clickcnt
        FROM
            dms.tracklog_5min
        WHERE
            DAY = '${today}'
        AND stock_type = 'STOCK'
        AND DATETIME >= '${onehourage}'
        GROUP BY
            stock_code
        ORDER BY
            clickcnt DESC limit 20 ) t1
LEFT JOIN
    (
        SELECT
            stock_code stockId,
            COUNT(*)   clickcnt
        FROM
            dms.tracklog_5min
        WHERE
            DAY = '${today}'
        AND stock_type = 'STOCK'
        AND DATETIME <= '${onehourage}'
        AND DATETIME >= '${twohourage}'
        GROUP BY
            stock_code ) t2
ON
    t1.stockId = t2.stockId
ORDER BY
    clickcnt DESC limit 20;

这里有个坑,子查询里已有order by 了,在shell中查询是正常排序的,但是 insert到 结果表中的时候,顺序就乱了,必须要在外层再加一个 order by 。

我们用的是 hive 1.1 , spark1.4.1 可能在升级版本后,会解决这个问题


4、sqoop导入到关系库中

sqoop export  --connect jdbc:mysql://10.130.3.211:3306/charts   --username dbcharts  --password Abcd1234  --table stock_realtime_analysis  --fields-terminated-by '\001' --columns "stockid,url,clickcnt,splycnt,lpcnt,type" --export-dir /dw/st/stock_realtime_analysis/dtype=01;
sqoop export  --connect jdbc:mysql://10.130.3.211:3306/charts   --username dbcharts  --password Abcd1234  --table stock_realtime_analysis  --fields-terminated-by '\001' --columns "stockid,url,clickcnt,splycnt,lpcnt,type" --export-dir /dw/st/stock_realtime_analysis/dtype=02;


5、总结:

优化后每天的任务执行时间约1分钟,每小时的任务执行在半分钟内。原来都是10分钟以上的。


你可能感兴趣的:(Spark,spark-sql,排行,优化)