环境
CDH-6.3.2
Flink-1.12.2
最后会补充说明Flink-1.13.1的使用方法
准备
在flink-1.12.2/lib
目录需要放如下jar包
flink-sql-connector-hive-2.2.0_2.11-1.12.0.jar
hadoop-common-3.0.0-cdh6.3.2.jar
hadoop-mapreduce-client-common-3.0.0-cdh6.3.2.jar
hadoop-mapreduce-client-core-3.0.0-cdh6.3.2.jar
hadoop-mapreduce-client-hs-3.0.0-cdh6.3.2.jar
hadoop-mapreduce-client-jobclient-3.0.0-cdh6.3.2.jar
启动Flink Session on YARN
export HADOOP_CLASSPATH=`hadoop classpath`
./bin/yarn-session.sh --detached
配置sql-client-defaults.yaml
此配置文件在flink-1.12.2/conf
目录下,修改后如下:
#==============================================================================
# Tables
#==============================================================================
# Define tables here such as sources, sinks, views, or temporal tables.
tables: [] # empty list
# A typical table source definition looks like:
# - name: ...
# type: source-table
# connector: ...
# format: ...
# schema: ...
# A typical view definition looks like:
# - name: ...
# type: view
# query: "SELECT ..."
# A typical temporal table definition looks like:
# - name: ...
# type: temporal-table
# history-table: ...
# time-attribute: ...
# primary-key: ...
#==============================================================================
# User-defined functions
#==============================================================================
# Define scalar, aggregate, or table functions here.
functions: [] # empty list
# A typical function definition looks like:
# - name: ...
# from: class
# class: ...
# constructor: ...
#==============================================================================
# Catalogs
#==============================================================================
# Define catalogs here.
catalogs:
- name: myhive
type: hive
hive-conf-dir: /etc/hive/conf
hive-version: 2.1.1
default-database: test
#==============================================================================
# Modules
#==============================================================================
# Define modules here.
#modules: # note the following modules will be of the order they are specified
# - name: core
# type: core
#==============================================================================
# Execution properties
#==============================================================================
# Properties that change the fundamental execution behavior of a table program.
execution:
# select the implementation responsible for planning table programs
# possible values are 'blink' (used by default) or 'old'
planner: blink
# 'batch' or 'streaming' execution
type: batch
# allow 'event-time' or only 'processing-time' in sources
time-characteristic: processing-time
# interval in ms for emitting periodic watermarks
periodic-watermarks-interval: 200
# 'changelog', 'table' or 'tableau' presentation of results
result-mode: table
# maximum number of maintained rows in 'table' presentation of results
max-table-result-rows: 100
# parallelism of the program
# parallelism: 1
# maximum parallelism
max-parallelism: 128
# minimum idle state retention in ms
min-idle-state-retention: 0
# maximum idle state retention in ms
max-idle-state-retention: 0
# current catalog ('default_catalog' by default)
current-catalog: default_catalog
# current database of the current catalog (default database of the catalog by default)
current-database: default_database
# controls how table programs are restarted in case of a failures
# restart-strategy:
# strategy type
# possible values are "fixed-delay", "failure-rate", "none", or "fallback" (default)
# type: fallback
#==============================================================================
# Configuration options
#==============================================================================
# Configuration options for adjusting and tuning table programs.
# A full list of options and their default values can be found
# on the dedicated "Configuration" web page.
# A configuration can look like:
# configuration:
# table.exec.spill-compression.enabled: true
# table.exec.spill-compression.block-size: 128kb
# table.optimizer.join-reorder-enabled: true
configuration:
table.sql-dialect: hive
#==============================================================================
# Deployment properties
#==============================================================================
# Properties that describe the cluster to which table programs are submitted to.
deployment:
# general cluster communication timeout in ms
response-timeout: 5000
# (optional) address from cluster to gateway
gateway-address: ""
# (optional) port from cluster to gateway
gateway-port: 0
启动SQL Client
./bin/sql-client.sh embedded
执行查询
▒▓██▓██▒
▓████▒▒█▓▒▓███▓▒
▓███▓░░ ▒▒▒▓██▒ ▒
░██▒ ▒▒▓▓█▓▓▒░ ▒████
██▒ ░▒▓███▒ ▒█▒█▒
░▓█ ███ ▓░▒██
▓█ ▒▒▒▒▒▓██▓░▒░▓▓█
█░ █ ▒▒░ ███▓▓█ ▒█▒▒▒
████░ ▒▓█▓ ██▒▒▒ ▓███▒
░▒█▓▓██ ▓█▒ ▓█▒▓██▓ ░█░
▓░▒▓████▒ ██ ▒█ █▓░▒█▒░▒█▒
███▓░██▓ ▓█ █ █▓ ▒▓█▓▓█▒
░██▓ ░█░ █ █▒ ▒█████▓▒ ██▓░▒
███░ ░ █░ ▓ ░█ █████▒░░ ░█░▓ ▓░
██▓█ ▒▒▓▒ ▓███████▓░ ▒█▒ ▒▓ ▓██▓
▒██▓ ▓█ █▓█ ░▒█████▓▓▒░ ██▒▒ █ ▒ ▓█▒
▓█▓ ▓█ ██▓ ░▓▓▓▓▓▓▓▒ ▒██▓ ░█▒
▓█ █ ▓███▓▒░ ░▓▓▓███▓ ░▒░ ▓█
██▓ ██▒ ░▒▓▓███▓▓▓▓▓██████▓▒ ▓███ █
▓███▒ ███ ░▓▓▒░░ ░▓████▓░ ░▒▓▒ █▓
█▓▒▒▓▓██ ░▒▒░░░▒▒▒▒▓██▓░ █▓
██ ▓░▒█ ▓▓▓▓▒░░ ▒█▓ ▒▓▓██▓ ▓▒ ▒▒▓
▓█▓ ▓▒█ █▓░ ░▒▓▓██▒ ░▓█▒ ▒▒▒░▒▒▓█████▒
██░ ▓█▒█▒ ▒▓▓▒ ▓█ █░ ░░░░ ░█▒
▓█ ▒█▓ ░ █░ ▒█ █▓
█▓ ██ █░ ▓▓ ▒█▓▓▓▒█░
█▓ ░▓██░ ▓▒ ▓█▓▒░░░▒▓█░ ▒█
██ ▓█▓░ ▒ ░▒█▒██▒ ▓▓
▓█▒ ▒█▓▒░ ▒▒ █▒█▓▒▒░░▒██
░██▒ ▒▓▓▒ ▓██▓▒█▒ ░▓▓▓▓▒█▓
░▓██▒ ▓░ ▒█▓█ ░░▒▒▒
▒▓▓▓▓▓▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒░░▓▓ ▓░▒█░
______ _ _ _ _____ ____ _ _____ _ _ _ BETA
| ____| (_) | | / ____|/ __ \| | / ____| (_) | |
| |__ | |_ _ __ | | __ | (___ | | | | | | | | |_ ___ _ __ | |_
| __| | | | '_ \| |/ / \___ \| | | | | | | | | |/ _ \ '_ \| __|
| | | | | | | | < ____) | |__| | |____ | |____| | | __/ | | | |_
|_| |_|_|_| |_|_|\_\ |_____/ \___\_\______| \_____|_|_|\___|_| |_|\__|
Welcome! Enter 'HELP;' to list all available commands. 'QUIT;' to exit.
Flink SQL> use catalog myhive;
Flink SQL> use test;
Flink SQL> select * from u_data;
上面查询的就是hive中test这个库中的u_data表。
特别说明
以上操作对于普通的文本格式的hive表是可以正常查询的,但是对于orc格式的hive表,基于CDH-6.3.2这个环境是会报错的。所以在执行上面的操作之前需要如下准备:
- 使用flink源码重新编译生成flink-sql-connector-hive-2.2.0_2.11,编译之前需要修改flink-sql-connector-hive-2.2.0中的pom文件,将hive-exec的版本改成2.1.1-cdh6.3.2
- 将新编译好的jar包放到flink-1.12.2/lib目录中,替换原有jar包
- 将CDH环境中的libfb303-0.9.3.jar放到flink-1.12.2/lib目录中
根因分析
错误如下
java.lang.IllegalArgumentException: Unrecognized Hadoop major version number: 3.0.0-cdh6.3.2
at org.apache.hadoop.hive.shims.ShimLoader.getMajorVersion(ShimLoader.java:177) ~[flink-sql-connector-hive-2.2.0_2.11-1.12.0.jar:1.12.0]
at org.apache.hadoop.hive.shims.ShimLoader.loadShims(ShimLoader.java:144) ~[flink-sql-connector-hive-2.2.0_2.11-1.12.0.jar:1.12.0]
at org.apache.hadoop.hive.shims.ShimLoader.getHadoopShims(ShimLoader.java:99) ~[flink-sql-connector-hive-2.2.0_2.11-1.12.0.jar:1.12.0]
at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.(OrcInputFormat.java:161) ~[flink-sql-connector-hive-2.2.0_2.11-1.12.0.jar:1.12.0]
at java.lang.Class.forName0(Native Method) ~[?:1.8.0_181]
at java.lang.Class.forName(Class.java:348) ~[?:1.8.0_181]
通过排查ShimLoader.java源码,开源社区hive 2.x的版本这种情况下是不支持hadoop 3.x版本。但是CDH中hive 2.1.1-cdh6.3.2版本和社区版本是不一样的,可以支持hadoop 3.x版本。
社区版
public static String getMajorVersion() {
String vers = VersionInfo.getVersion();
String[] parts = vers.split("\\.");
if (parts.length < 2) {
throw new RuntimeException("Illegal Hadoop Version: " + vers +
" (expected A.B.* format)");
}
switch (Integer.parseInt(parts[0])) {
case 2:
return HADOOP23VERSIONNAME;
default:
throw new IllegalArgumentException("Unrecognized Hadoop major version number: " + vers);
}
}
CDH版(反编译)
public static String getMajorVersion() {
String vers = VersionInfo.getVersion();
String[] parts = vers.split("\\.");
if (parts.length < 2)
throw new RuntimeException("Illegal Hadoop Version: " + vers + " (expected A.B.* format)");
switch (Integer.parseInt(parts[0])) {
case 2:
case 3:
return "0.23";
}
throw new IllegalArgumentException("Unrecognized Hadoop major version number: " + vers);
}
Flink-1.13.1使用方法
flink 1.13.1版本以后可以使用初始化sql脚本的方式启动sql client而不再使用yaml文件,例如创建sql-init.sql放到conf下面,内容如下:
CREATE CATALOG MyCatalog
WITH (
'type' = 'hive',
'hive-conf-dir'='/etc/hive/conf',
'hive-version'='2.1.1'
);
USE CATALOG MyCatalog;
SET 'execution.runtime-mode' = 'batch';
SET 'sql-client.execution.result-mode' = 'table';
SET 'table.sql-dialect'='hive';
启动client的命令./bin/sql-client.sh -i conf/sql-init.sql
这样就可以进入client执行sql命令了。