主机 | cpu | 内存 | 磁盘 |
---|---|---|---|
p1(master) | 16 core | 64GB | 100GB |
P2 | 8 core | 32GB | 100GB |
P2 | 8 core | 32GB | 100GB |
之前我基于TPC-DS 做了很多轮测试,数据量也分布了几个维度,从结果行看性能提升并不明显,只是单表查询上略有提升,多表基本无变化。结果懒得贴上去了,具体可以看我提的issue。
TPC-DS 典型的单表查询
select ss_sold_Date_sk,count(*) as cnt from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk
select ss_sold_Date_sk,count(*) as cnt from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk order by cnt desc,ss_sold_Date_sk limit 10
select ss_sold_Date_sk,ss_wholesale_cost,avg(ss_item_sk) as cnt,count(distinct(ss_sales_price)) as avg1 from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk,ss_wholesale_cost order by cnt desc,ss_sold_Date_sk limit 10
TPC-DS 典型的多表查询
select channel, col_name, d_year, d_qoy, i_category, COUNT(*) sales_cnt, SUM(ext_sales_price) sales_amt FROM (
SELECT 'store' as channel, 'ss_promo_sk' col_name, d_year, d_qoy, i_category, ss_ext_sales_price ext_sales_price
FROM store_sales, item, date_dim
WHERE ss_promo_sk IS NULL
AND ss_sold_date_sk=d_date_sk
AND ss_item_sk=i_item_sk
UNION ALL
SELECT 'web' as channel, 'ws_ship_customer_sk' col_name, d_year, d_qoy, i_category, ws_ext_sales_price ext_sales_price
FROM web_sales, item, date_dim
WHERE ws_ship_customer_sk IS NULL
AND ws_sold_date_sk=d_date_sk
AND ws_item_sk=i_item_sk
UNION ALL
SELECT 'catalog' as channel, 'cs_bill_hdemo_sk' col_name, d_year, d_qoy, i_category, cs_ext_sales_price ext_sales_price
FROM catalog_sales, item, date_dim
WHERE cs_bill_hdemo_sk IS NULL
AND cs_sold_date_sk=d_date_sk
AND cs_item_sk=i_item_sk) foo
GROUP BY channel, col_name, d_year, d_qoy, i_category
ORDER BY channel, col_name, d_year, d_qoy, i_category
limit 10
针对tpc-ds的sql,单表查询聚合函数多个联合使用,多表关联太多且又使用复杂的聚合函数。测试下来总是囫囵吞枣,很难侦查到有无alluxio时性能的变化,我觉得需要更细粒度的单表测试。
store_sale
,22个字段,1.4亿+条亿数据CREATE TABLE `store_sales`(
`ss_sold_time_sk` bigint,
`ss_item_sk` bigint,
`ss_customer_sk` bigint,
`ss_cdemo_sk` bigint,
`ss_hdemo_sk` bigint,
`ss_addr_sk` bigint,
`ss_store_sk` bigint,
`ss_promo_sk` bigint,
`ss_ticket_number` bigint,
`ss_quantity` int,
`ss_wholesale_cost` decimal(7,2),
`ss_list_price` decimal(7,2),
`ss_sales_price` decimal(7,2),
`ss_ext_discount_amt` decimal(7,2),
`ss_ext_sales_price` decimal(7,2),
`ss_ext_wholesale_cost` decimal(7,2),
`ss_ext_list_price` decimal(7,2),
`ss_ext_tax` decimal(7,2),
`ss_coupon_amt` decimal(7,2),
`ss_net_paid` decimal(7,2),
`ss_net_paid_inc_tax` decimal(7,2),
`ss_net_profit` decimal(7,2))
PARTITIONED BY (
`ss_sold_date_sk` bigint)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'alluxio://p1:19998/user/hive/warehouse/tpcds_bin_partitioned_orc_50.db/store_sales'
TBLPROPERTIES (
'transient_lastDdlTime'='1574907136')
每个查询三次,最后一列为平均时间,单位毫秒。
20381 19798 19916 20031
19213 19564 19855 19544
20350 20671 19432 20151
6479 6711 6816 6668
6669 7759 6179 6869
6465 7507 7567 7179
19920 20434 19285 19879
19583 19369 19004 19318
19988 20062 20294 20114
6194 5739 5638 5857
6291 6482 5496 6089
6275 5651 5440 5788
21166 18758 20217 20047
20514 19281 20184 19993
20574 19703 19994 20090
6203 6268 5857 6109
6484 6443 5758 6228
5432 6173 5811 5805
22183 20706 21496 21461
21398 20386 20828 20870
20357 20878 21721 20985
14981 15045 14967 14997
14141 14594 14949 14561
14379 15013 14243 14545
21916 20552 19876 22114
19675 19231 20083 19663
20904 19721 20155 20260
6555 5928 6317 6266
6514 6161 6255 6310
5856 6365 6586 6269
单表常规聚合, 有alluxio性能大约2-3倍的提升,presto对于join的查询不擅长,join做分析性能没什么提升,具体也可以看我提交的issue。另外对于多个聚合函数的测试可能也会干扰测试结果,建议大家单独的测试,alluxio+presto 做adhoc对sql是有要求的,不是有缓存了,性能就提升了,需要仔细甄别。