Hive on Spark系列一:CDH5.5配置支持hive on spark



以下内容翻译自cloudera官网Configuring Hive on Spark章节内容,地址:https://www.cloudera.com/documentation/enterprise/5-5-x/topics/admin_hos_config.html

我写文档中CDH5.7以上版本已经全面支持Hive on Spark,具体配置请参考官网。

我们目前使用的是CDH5.5.1,所以我就想尝试下Hive on Spark如何,如果可以后期会升级CDH版本,下文以CDH 5.5作为介绍对象

重要:

CDH 5.4以后引入了Hive on Spark,但是在CDH5.5.x中是不被推荐使用的。为了尝试这一特性,请使用测试环境直到Cloudera解决了当前存在的问题和局限性来确保它能用于生产环境。

注意:

推荐使用HiveServer2的Beeline,当然使用Hive CLI也是可以的。

目录:

安装注意事项

启用Hive on Spark

配置属性

配置Hive

配置Executor Memory Size

安装注意事项:

为了让Hive工作在spark上,你必须在HiveServer2所在机器上部署spark gateway角色。另外,hive on spark不能读取spark的配置,也不能提交spark作业。

在使用过程中,需要手动设置如下命令,以便让之后的查询都能使用spark引擎。

set hive.execution.engine=spark;


启用hive on spark

默认hive on spark是禁用的,需要在Cloudera Manager中启用。

1.登录CM界面,打开hive服务。

2.单击 配置标签,查找enable hive on spark属性。

3.勾选Enbale Hive on Spark(Unsupported),并保存更改。

4.查找Spark on YARN 服务,并勾选保存。

5.保存后,重新部署下客户端使其生效。

配置属性

注:官网提供了2个属性配置,第1个属性我没有找到,两个属性都可以直接采取默认值。

属性 描述
hive.stats.collect.rawdatasize Hive on Spark uses statistics to determine the threshold for converting common join to map join. There are two types of statistics about the size of data:
  • totalSize: The approximate size of data on the disk
  • rawDataSize: The approximate size of data in memory

When both metrics are available, Hive chooses rawDataSize.

Default: True

hive.auto.convert.join.noconditionaltask.size The threshold for the sum of all the small table size (by default, rawDataSize), for map join conversion. You can increase the value if you want better performance by converting more common joins to map joins. However, if you set this value too high, tasks may fail because too much memory is being used by data from small tables.

Default: 20MB

配置hive

为了改善性能,Cloudera推荐配置如下hive附加属性,在Cloudera Manager,设置如下属性到HiveServer2服务中

hive.stats.fetch.column.stats=true

是否从metstore获取行数统计

hive.optimize.index.filter=true

自动使用索引,默认是不开启,设置为false

配置Executor Memory Size

注意:以下内容具体配法请参考官网

For general Spark configuration recommendations, see Configuring Spark on YARN Applications.

Executor memory size can have a number of effects on Hive. Increasing executor memory increases the number of queries for which Hive can enable mapjoin optimization. However, if there's too much executor memory, it takes longer to perform garbage collection. Also, some experiments shows that HDFS doesn’t handle concurrent writers well, so it may face a race condition if there are too many executor cores.

Cloudera recommends that you set the value for spark.executor.cores to 5, 6, or 7, depending on what the host is divisible by. For example, if yarn.nodemanager.resource.cpu-vcores is 19, then you would set the value to 6. Executors must have the same number of cores. If you set the value to 5, each executor only gets three cores, with four left unused. If you set the value to 7, only two executors are used, and five cores are unused. If the number of cores is 20, set the value to 5 so that each executor gets four cores, and no cores are unused.

Cloudera also recommends the following:
  • Compute a memory size equal to yarn.nodemanager.resource.memory-mb * (spark.executor.cores / yarn.nodemanager.resource.cpu-vcores) and then split that between spark.executor.memory and spark.yarn.executor.memoryOverhead.
  • spark.yarn.executor.memoryOverhead is 15-20% of the total memory size.


总结:

经过上述配置,就可以在Hive CLI或用HiveServer2直接使用hive on spark了,使用和原来Hive on MapReduce没什么区别,只是在使用前执行下set hive.execution.engine=spark;即可使用spark引擎来运行hive了。

   



你可能感兴趣的:(大数据)