presto+alluxio查询性能分析

Alluxio+Presto 查询

1、 总述

主机 cpu 内存 磁盘
p1(master) 16 core 64GB 100GB
P2 8 core 32GB 100GB
P2 8 core 32GB 100GB

之前我基于TPC-DS 做了很多轮测试,数据量也分布了几个维度,从结果行看性能提升并不明显,只是单表查询上略有提升,多表基本无变化。结果懒得贴上去了,具体可以看我提的issue。

TPC-DS 典型的单表查询

  • group by 、count
select ss_sold_Date_sk,count(*) as cnt from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk
  • group by、 order by、count
select ss_sold_Date_sk,count(*) as cnt from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk order by cnt desc,ss_sold_Date_sk limit 10
  • group by 、order by、count 、avg
select ss_sold_Date_sk,ss_wholesale_cost,avg(ss_item_sk) as cnt,count(distinct(ss_sales_price)) as avg1 from hive.tpcds_bin_partitioned_orc_50.store_sales group by ss_sold_Date_sk,ss_wholesale_cost  order by cnt desc,ss_sold_Date_sk limit 10

TPC-DS 典型的多表查询

select channel, col_name, d_year, d_qoy, i_category, COUNT(*) sales_cnt, SUM(ext_sales_price) sales_amt FROM (
        SELECT 'store' as channel, 'ss_promo_sk' col_name, d_year, d_qoy, i_category, ss_ext_sales_price ext_sales_price
         FROM store_sales, item, date_dim
         WHERE ss_promo_sk IS NULL
           AND ss_sold_date_sk=d_date_sk
           AND ss_item_sk=i_item_sk
        UNION ALL
        SELECT 'web' as channel, 'ws_ship_customer_sk' col_name, d_year, d_qoy, i_category, ws_ext_sales_price ext_sales_price
         FROM web_sales, item, date_dim
         WHERE ws_ship_customer_sk IS NULL
           AND ws_sold_date_sk=d_date_sk
           AND ws_item_sk=i_item_sk
        UNION ALL
        SELECT 'catalog' as channel, 'cs_bill_hdemo_sk' col_name, d_year, d_qoy, i_category, cs_ext_sales_price ext_sales_price
         FROM catalog_sales, item, date_dim
         WHERE cs_bill_hdemo_sk IS NULL
           AND cs_sold_date_sk=d_date_sk
           AND cs_item_sk=i_item_sk) foo
GROUP BY channel, col_name, d_year, d_qoy, i_category
ORDER BY channel, col_name, d_year, d_qoy, i_category
limit 10

针对tpc-ds的sql,单表查询聚合函数多个联合使用,多表关联太多且又使用复杂的聚合函数。测试下来总是囫囵吞枣,很难侦查到有无alluxio时性能的变化,我觉得需要更细粒度的单表测试。

2、单表单聚合函数测试

表名:store_sale,22个字段,1.4亿+条亿数据

CREATE TABLE `store_sales`(
  `ss_sold_time_sk` bigint,
  `ss_item_sk` bigint,
  `ss_customer_sk` bigint,
  `ss_cdemo_sk` bigint,
  `ss_hdemo_sk` bigint,
  `ss_addr_sk` bigint,
  `ss_store_sk` bigint,
  `ss_promo_sk` bigint,
  `ss_ticket_number` bigint,
  `ss_quantity` int,
  `ss_wholesale_cost` decimal(7,2),
  `ss_list_price` decimal(7,2),
  `ss_sales_price` decimal(7,2),
  `ss_ext_discount_amt` decimal(7,2),
  `ss_ext_sales_price` decimal(7,2),
  `ss_ext_wholesale_cost` decimal(7,2),
  `ss_ext_list_price` decimal(7,2),
  `ss_ext_tax` decimal(7,2),
  `ss_coupon_amt` decimal(7,2),
  `ss_net_paid` decimal(7,2),
  `ss_net_paid_inc_tax` decimal(7,2),
  `ss_net_profit` decimal(7,2))
PARTITIONED BY (
  `ss_sold_date_sk` bigint)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'alluxio://p1:19998/user/hive/warehouse/tpcds_bin_partitioned_orc_50.db/store_sales'
TBLPROPERTIES (
  'transient_lastDdlTime'='1574907136')

聚合查询

每个查询三次,最后一列为平均时间,单位毫秒。

  • presto+hdfs (sum)
20381	19798	19916	20031
19213	19564	19855	19544
20350	20671	19432	20151
  • presto+hdfs+alluxio(sum)
 6479	6711	6816	6668
 6669	7759	6179	6869
 6465	7507	7567	7179
  • presto+hdfs (count)
19920	20434	19285	19879
19583	19369	19004	19318
19988	20062	20294	20114
  • presto+hdfs+alluxio(count)
6194	5739	5638	5857
6291	6482	5496	6089
6275	5651	5440	5788
  • presto+hdfs (count(xxx))
21166	18758	20217	20047
20514	19281	20184	19993
20574	19703	19994	20090
  • presto+hdfs+alluxio (count(xxx))
6203	6268	5857	6109
6484	6443	5758	6228
5432	6173	5811	5805
  • presto+hdfs (order by)
22183	20706	21496	21461
21398	20386	20828	20870
20357	20878	21721	20985
  • presto+hdfs+alluxio (order by )
14981	15045	14967	14997
14141	14594	14949	14561	
14379	15013	14243	14545
  • presto+hdfs (avg)
21916	20552	19876	22114
19675	19231	20083	19663
20904	19721	20155	20260
  • presto+hdfs+alluxio (avg)
6555	5928	6317	6266
6514	6161	6255	6310
5856	6365	6586	6269

3、总结

单表常规聚合, 有alluxio性能大约2-3倍的提升,presto对于join的查询不擅长,join做分析性能没什么提升,具体也可以看我提交的issue。另外对于多个聚合函数的测试可能也会干扰测试结果,建议大家单独的测试,alluxio+presto 做adhoc对sql是有要求的,不是有缓存了,性能就提升了,需要仔细甄别。

你可能感兴趣的:(大数据)