概述
我们的Hive是HortonWorks提供的1.2.1, 本文档记录下我们在使用过程中遇到的问题和解决方法。
问题
高并发请求时,请求报错:Timed out waiting for a free available connection.
原因
我们查看hiveserver2中的日志发现这些错误最终都的cause by都是一样的
MetaException(message:Unable to update transaction database java.sql.SQLException: Timed out waiting for a free available connection.
at com.jolbox.bonecp.DefaultConnectionStrategy.getConnectionInternal(DefaultConnectionStrategy.java:88)
at com.jolbox.bonecp.AbstractConnectionStrategy.getConnection(AbstractConnectionStrategy.java:90)
at com.jolbox.bonecp.BoneCP.getConnection(BoneCP.java:553)
at com.jolbox.bonecp.BoneCPDataSource.getConnection(BoneCPDataSource.java:131)
at org.apache.hadoop.hive.metastore.txn.TxnHandler.getDbConn(TxnHandler.java:1956)
at org.apache.hadoop.hive.metastore.txn.TxnHandler.enqueueLockWithRetry(TxnHandler.java:941)
at org.apache.hadoop.hive.metastore.txn.TxnHandler.lock(TxnHandler.java:882)
at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.lock(HiveMetaStore.java:5911)
at sun.reflect.GeneratedMethodAccessor38.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147)
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
at com.sun.proxy.$Proxy12.lock(Unknown Source)
at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.lock(HiveMetaStoreClient.java:1947)
at sun.reflect.GeneratedMethodAccessor37.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:178)
at com.sun.proxy.$Proxy13.lock(Unknown Source)
at org.apache.hadoop.hive.ql.lockmgr.DbLockManager.lock(DbLockManager.java:102)
at org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.acquireLocks(DbTxnManager.java:357)
at org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.acquireLocksWithHeartbeatDelay(DbTxnManager.java:373)
at org.apache.hadoop.hive.ql.lockmgr.DbTxnManager.acquireLocks(DbTxnManager.java:182)
at org.apache.hadoop.hive.ql.Driver.acquireLocksAndOpenTxn(Driver.java:1079)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1281)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1158)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1153)
at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:197)
at org.apache.hive.service.cli.operation.SQLOperation.access$300(SQLOperation.java:76)
at org.apache.hive.service.cli.operation.SQLOperation$2$1.run(SQLOperation.java:253)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
at org.apache.hive.service.cli.operation.SQLOperation$2.run(SQLOperation.java:264)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
从异常堆栈里可以看到,这个问题查询的时候申请锁造成的。申请锁的线程太多的时候,BoneCP连接池的代码会抛出异常。BoneCP连接池的抛错代码如下
这个类是DefaultConnectionStrategy,中文注释是我加的
@Override
protected Connection getConnectionInternal() throws SQLException {
ConnectionHandle result = pollConnection();
// we still didn't find an empty one, wait forever (or as per config) until our partition is free
if (result == null) {
int partition = (int) (Thread.currentThread().getId() % this.pool.partitionCount);
ConnectionPartition connectionPartition = this.pool.partitions[partition];
// 每个partition内部是一个单独的LinkedBlockingQueue,来存储连接对象
// 下面的this.pool.nullOnConnectionTimeout是连接池的一个配置项,默认值是false 。
// 当连接没拿到的时候,如果这个值是true就返回一个空对象,否则就抛异常了
try {
result = connectionPartition.getFreeConnections().poll(this.pool.connectionTimeoutInMs, TimeUnit.MILLISECONDS);
if (result == null){
if (this.pool.nullOnConnectionTimeout){
return null;
}
// 08001 = The application requester is unable to establish the connection.
// 这里就是没获取到连接,抛出异常的地方
throw new SQLException("Timed out waiting for a free available connection.", "08001");
}
}
catch (InterruptedException e) {
if (this.pool.nullOnConnectionTimeout){
return null;
}
throw PoolUtil.generateSQLException(e.getMessage(), e);
}
}
return result;
}
解决办法
这个特性是支持事务才开启的,而Hive当前版本里,对事务的支持很鸡肋,所以设置一致性的hive.support.concurrency为false,就可以解决这个问题了。
SQL中出现中文时报不支持的编码方式
原因
Hive中的CBO组件帮我们对SQL做了优化,但是这个组件只支持ISO-8859-1,所以中文会报错
解决办法
配置 hive.cbo.enable 为false 关闭CBO优化
运行一段时间后,metastore报错OOM
原因
对HiveMetastore做dump后得到的结果根据MAT的分析,大量的对象来源于AggregateStatsCache类中的ConcurrentHashMap,从名称上判断,这个类是一个缓存类,大概看了下源码,是在缓存表中分区的信息。这个缓存的开关属性由:hive.metastore.aggregate.stats.cache.enabled 配置来决定,默认配置是true。这个缓存类在初始化时也有一系列默认参数来控制缓存内的对象数量,但默认的数量是100万个节点,每个节点100万个元素,所以总共是1亿个元素的缓存,而我们metastore启动内存只有2G。这个类中有一个方法开启了低优先级的线程来清除缓存,但因为默认配置是:缓存达到最大存储元素个数的90%时才会启动这个线程,所以我们的环境中,这个线程一直没有启动过,但内存已经占到最大了。
解决办法
因为我们的元数据都是存储在本地机房中独立的MySQL里,且MySQL所在机器性能很好,内部万兆网络,所以直接关闭了这个特性。
后续的发现
关闭了上述特性后,Metastore内存占满的情况得到了缓解,但是过一段时间后,仍然会有内存占用达到75%的报警。jstat -gcutil 发现Metastore运行了好久都没有FullGC,但是老年代已经占到80%以上了。突然想到看下垃圾回收器,发现默认情况下,Metastore使用的是ParellelScanvage回收器,这个回收器在对象晋升到老年代的时候会判断一下每次晋升到老年代的对象的平均大小与老年代当前剩余大小之间的关系,如果当前老年代内存不够了,才会执行FullGC。所以我们把这个垃圾回收器修改为G1就可以了。
要修改的文件是 hive-env.sh
export HADOOP_USER_CLASSPATH_FIRST=true #this prevents old metrics libs from mapreduce lib from bringing in old jar deps overriding HIVE_LIB
if [ "$SERVICE" = "cli" ]; then
if [ -z "$DEBUG" ]; then
export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:+UseNUMA -XX:+UseParallelGC -XX:-UseGCOverheadLimit"
else
export HADOOP_OPTS="$HADOOP_OPTS -XX:NewRatio=12 -XX:MaxHeapFreeRatio=40 -XX:MinHeapFreeRatio=15 -XX:-UseGCOverheadLimit"
fi
fi
# The heap size of the jvm stared by hive shell script can be controlled via:
if [ "$SERVICE" = "metastore" ]; then
export HADOOP_HEAPSIZE={{hive_metastore_heapsize}} # Setting for HiveMetastore
else
export HADOOP_HEAPSIZE={{hive_heapsize}} # Setting for HiveServer2 and Client
fi
# Set JVM parameters -Xms4g -Xmx -XX:+UseG1GC when we start metastore, make sure it has sufficent memory and use g1 garbage collector
# 这个是我的修改,判断启动的是metastore就去设置4g内存同时配置使用G1回收器
if [ "$SERVICE" = "metastore" ]; then
export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Xms4g -Xmx4g -XX:+UseG1GC "
else
export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS -Xmx${HADOOP_HEAPSIZE}m"
fi
export HADOOP_CLIENT_OPTS="$HADOOP_CLIENT_OPTS{{heap_dump_opts}}"
# Larger heap size may be required when running queries over large number of files or partitions.
# By default hive shell scripts use a heap size of 256 (MB). Larger heap size would also be
# appropriate for hive server (hwi etc).
# Set HADOOP_HOME to point to a specific hadoop install directory
HADOOP_HOME=${HADOOP_HOME:-{{hadoop_home}}}
export HIVE_HOME=${HIVE_HOME:-{{hive_home_dir}}}
# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=${HIVE_CONF_DIR:-{{hive_config_dir}}}
# Folder containing extra libraries required for hive compilation/execution can be controlled by:
if [ "${HIVE_AUX_JARS_PATH}" != "" ]; then
if [ -f "${HIVE_AUX_JARS_PATH}" ]; then
export HIVE_AUX_JARS_PATH=${HIVE_AUX_JARS_PATH}
elif [ -d "/usr/hdp/current/hive-webhcat/share/hcatalog" ]; then
export HIVE_AUX_JARS_PATH=/usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar
fi
elif [ -d "/usr/hdp/current/hive-webhcat/share/hcatalog" ]; then
export HIVE_AUX_JARS_PATH=/usr/hdp/current/hive-webhcat/share/hcatalog/hive-hcatalog-core.jar
fi
export METASTORE_PORT={{hive_metastore_port}}
{% if sqla_db_used or lib_dir_available %}
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:{{jdbc_libs_dir}}"
export JAVA_LIBRARY_PATH="$JAVA_LIBRARY_PATH:{{jdbc_libs_dir}}"
{% endif %}