Hive实现自增序列
在利用数据仓库进行数据处理时,通常有这样一个业务场景,为一个Hive表新增一列自增字段(比如事实表和维度表之间的"代理主键")。虽然Hive不像RDBMS如mysql一样本身提供自增主键的功能,但它本身可以通过函数来实现自增序列功能:利用row_number()窗口函数或者使用UDFRowSequence。
示例:table_src是我们经过业务需求处理得到的中间表数据,现在我们需要为table_src新增一列自增序列字段auto_increment_id,并将最终数据保存到table_dest中。
1. 利用row_number函数
场景1:table_dest中目前没有数据
insert into table table_dest
select row_number() over(order by 1) as auto_increment_id, table_src.* from table_src;场景2: table_dest中有数据,并且已经经过新增自增字段处理
insert into table table_dest
select (row_number() over(order by 1) + dest.max_id) auto_increment_id, src.* from table_src src cross join (select max(auto_increment_id) max_id from table_dest) dest;
2. 利用UDFRowSequence
首先Hive环境要有hive-contrib相关jar包,然后执行
create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';针对上述场景一,可通过以下语句实现:
insert into table table_dest
select row_sequence() auto_increment_id, table_src.* from table_src;场景2实现起来也很简单,这里不再赘述。
但是,需要注意二者的区别:
row_number函数是对整个数据集做处理,自增序列在当次排序中是连续的唯一的。
UDFRowSequence是按照任务排序,但是一个SQL可能并发执行的job不止一个,而每个job都会从1开始各自排序,所以不能保证序号全局唯一。可以考虑将UDFRowSequence扩展到一个第三方存储系统中,进行序号逻辑管理,来最终实现全局的连续自增唯一序号。
Hive元数据问题
以下基于hive-2.X版本说明。
Hive正常启动,但是执行show databases时报以下错误:
SemanticException org.apache.hadoop.hive.ql.metadata.HiveException:
java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient首先从异常信息分析可知,是元数据问题导致的异常。
Hive默认将元数据存储在derby,但因为用derby作为元数据存储服务弊端太多,我们通常会选择将Hive的元数据存在mysql中。所以我们要确保hive-site.xml中mysql的信息要配置正确,Hive要有mysql的相关连接驱动jar包,并且有mysql的权限。
首先在hive-site.xml中配置mysql信息:
javax.jdo.option.ConnectionURL
jdbc:mysql://localhost:3306/hive_metadata?createDatabaseIfNotExist=true
JDBC connect string for a JDBC metastore
javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore
javax.jdo.option.ConnectionUserName
root
username to use against metastore database
javax.jdo.option.ConnectionPassword
root
password to use against metastore database
执行完上述操作后,如果配置了Hive metastore方式,还需要启动该服务:
nohup hive --service metastore &如果想启动hiveserver2,则执行:
nohup hive --service hiveserver2 &但是,此时可能由于你设置的mysql元数据存储库没有进行schema初始化,会报类似以下异常:
-- 异常1
Exception in thread "main" MetaException(message:Version information not found in metastore. )
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:92)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6896)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6891)
at org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:7149)
at org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:7076)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
Caused by: MetaException(message:Version information not found in metastore. )-- 异常2
MetaException(message:Required table missing : "DBS
" in Catalog "" Schema "". DataNucleus requires this table to perform its persistence operations. Either your MetaData is incorrect, or you need to enable "datanucleus.schema.autoCreateTables")
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.
at org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:92)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6896)
at org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:6891)
at org.apache.hadoop.hive.metastore.HiveMetaStore.startMetaStore(HiveMetaStore.java:7149)
at org.apache.hadoop.hive.metastore.HiveMetaStore.main(HiveMetaStore.java:7076)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
at org.apache.hadoop.util.RunJar.main(RunJar.java:141)此时,还需要在hive-site.xml中配置以下信息:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/soft/apache-hive-2.3.7-bin/lib/log4j-slf4j-impl-2.6.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/soft/hadoop-2.7.7/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.ht... for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Metastore connection URL: jdbc:mysql://localhost:3306/hive_metadata?createDatabaseIfNotExist=true
Metastore Connection Driver : com.mysql.jdbc.Driver
Metastore connection User: root
Starting metastore schema initialization to 2.3.0
Initialization script hive-schema-2.3.0.mysql.sql
Initialization script completed
schemaTool completed最后,重新启动bin/hive,执行show databases、建表、查询等SQL语句进行测试,都能正常执行。