doris : selectdb-doris-1.2.0-1-caeed14
spark : 2.4.6
Hadoop : 3.1.0
步骤
● 配置环境:参考官网 Doris Spark Load
● 配置示例
○ fe.conf
enable_spark_load=true
# 配置 Spark 目录 (目前Doris使用的是 2.4.6)
spark_home_default_dir=/usr/hdp/3.1.0.0-78/spark-2.4.6
# 配置 spark-jars.zip 地址
spark_resource_path=/usr/hdp/3.1.0.0-78/spark-2.4.6/jars/spark-jars.zip
# 配置 Yarn 执行路径
yarn_client_path=/usr/hdp/3.1.0.0-78/hadoop-yarn/bin/yarn
○ Spark Resource 创建
CREATE EXTERNAL RESOURCE "spark_0103_v1"
PROPERTIES
(
"type" = "spark",
"spark.master" = "yarn",
"spark.submit.deployMode" = "cluster",
"spark.executor.memory" = "8g",
"spark.yarn.queue" = "default",
"spark.hadoop.yarn.resourcemanager.address" = "hdfs://xxxx:8032",
"spark.hadoop.fs.defaultFS" = "hdfs://xxxx:8020",
"working_dir" = "hdfs://xxxx:8020/tmp/doris_1229",
"broker" = "broker_name",
"broker.username" = "hdfs",
"broker.password" = ""
);
○ 导入脚本
-- hive 外部表导入 其他导入可参考官网地址
LOAD LABEL test_1229.a_test_202112211_ext_10
(
DATA FROM TABLE a_test_202112211_ext
INTO TABLE a_test_202112211
)
WITH RESOURCE 'spark_0103_v1'
(
"spark.driver.memory"="8g",
"spark.shuffle.compress" = "true",
"spark.memory.offHeap.size"="8G",
"spark.executor.memory"="16G",
"spark.executor.cores" = "4",
"spark.driver.cores"="4"
)
PROPERTIES
(
"timeout" = "36000"
);
-- 查看导入
show load order by createtime desc limit 1\G
JobId: 63038
Label: a_test_202112211_ext_1
State: CANCELLED
Progress: ETL:N/A; LOAD:N/A
Type: SPARK
EtlInfo: NULL
TaskInfo: cluster:spark_0103_v1; timeout(s):36000; max_filter_ratio:0.0
ErrorMsg: type:ETL_SUBMIT_FAIL; msg:errCode = 2, detailMessage = start spark app failed. error: Waiting too much time to get appId from handle. spark app state: UNKNOWN, loadJobId:63038
CreateTime: 2023-01-03 18:24:40
EtlStartTime: NULL
EtlFinishTime: NULL
LoadStartTime: NULL
LoadFinishTime: 2023-01-03 18:27:10
URL: NULL
JobDetails: {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All backends":{},"FileNumber":0,"FileSize":0}
TransactionId: 0
ErrorTablets: {}
10 rows in set (0.02 sec)
调试相关代码得到如下信息,
public void run() {
......
BufferedReader outReader = null;
String line = null;
long startTime = System.currentTimeMillis();
try {
outReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
while (!isStop && (line = outReader.readLine()) != null) {
if (outputStream != null) {
outputStream.write((line + "\n").getBytes());
}
SparkLoadAppHandle.State oldState = handle.getState();
SparkLoadAppHandle.State newState = oldState;
// parse state and appId
if (line.contains(STATE)) {
// 1. state
String state = regexGetState(line);
if (state != null) {
YarnApplicationState yarnState = YarnApplicationState.valueOf(state);
newState = fromYarnState(yarnState);
if (newState != oldState) {
handle.setState(newState);
}
}
// 2. appId
String appId = regexGetAppId(line);
if (appId != null) {
if (!appId.equals(handle.getAppId())) {
handle.setAppId(appId);
}
}
Doris FE 会在 org.apache.doris.load.loadv2 包 的 SparkLauncherMonitor 类 下的 LogMonitor 中有一个 run 方法里面,获取Spark任务的相关状态根据代码可知有从提交的相关Spark任务的日志中去获取对应的 AppID 和 任务State(猜想主要是获取对应的 APPID,后续根据Yarn命令行去获取状态),该日志猜想就是 spark_launcher_log 文件夹下的对应 SparkLoad 的日志,排查发现 由于环境问题,SparkLoad对应的日志输出不全导致Doris无法获取到Spark任务的状态
异常任务日志不全
23/01/03 14:04:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/01/03 14:04:59 WARN DependencyUtils: Skip remote jar hdfs://xxxx:8020/tmp/doris/1928832733/__spark_repository__spark0/__archive_1.0.0/__lib_ca97ae5fbe7d6bcbf334923633db4c9d_spark-dpp-1.0.0-jar-with-dependencies.jar.
修复这个问题后的正常日志 , 可以正常获取APPID进行下一步
2023-01-04 13:42:06,003 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-01-04 13:42:06,202 WARN deploy.DependencyUtils: Skip remote jar hdfs://xxxx
:8020/tmp/doris_1229/1928832733/__spark_repository__spark_0103_v1/__archive_1.0.0/__lib_ca97ae5fbe7d6bcbf334923633db4c9d_spark-dpp-1.0.0-jar-with-dependencies.jar.
2023-01-04 13:42:06,658 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
2023-01-04 13:42:07,101 INFO impl.TimelineClientImpl: Timeline service address: http://xxxx:8188/ws/v1/timeline/
2023-01-04 13:42:07,230 INFO yarn.Client: Requesting a new application from cluster with 30 NodeManagers
2023-01-04 13:42:07,294 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (63488 MB per container)
2023-01-04 13:42:07,295 INFO yarn.Client: Will allocate AM container, with 45056 MB memory including 4096 MB overhead
2023-01-04 13:42:07,295 INFO yarn.Client: Setting up container launch context for our AM
2023-01-04 13:42:07,297 INFO yarn.Client: Setting up the launch environment for our AM container
2023-01-04 13:42:07,302 INFO yarn.Client: Preparing resources for our AM container
2023-01-04 13:42:07,340 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://xxxx:8020/tmp/doris_1229/1928832733/__spark_repository__spark_0103_v1/__archive_1.0.0/__lib_9012439d621dab595f7134265922c458_spark-2x.zip
2023-01-04 13:42:07,387 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://xxxx:8020/tmp/doris_1229/1928832733/__spark_repository__spark_0103_v1/__archive_1.0.0/__lib_ca97ae5fbe7d6bcbf334923633db4c9d_spark-dpp-1.0.0-jar-with-dependencies.jar
2023-01-04 13:42:07,480 INFO yarn.Client: Uploading resource file:/tmp/spark-59166bba-7d28-47ca-83ab-34b8a5c800ef/__spark_conf__6036209221290820800.zip -> hdfs://xxxx:8020/user/root/.sparkStaging/application_1672723495864_1884/__spark_conf__.zip
2023-01-04 13:42:07,732 WARN yarn.Client: spark.yarn.am.extraJavaOptions will not take effect in cluster mode
2023-01-04 13:42:07,742 INFO spark.SecurityManager: Changing view acls to: root
2023-01-04 13:42:07,743 INFO spark.SecurityManager: Changing modify acls to: root
2023-01-04 13:42:07,743 INFO spark.SecurityManager: Changing view acls groups to:
2023-01-04 13:42:07,743 INFO spark.SecurityManager: Changing modify acls groups to:
2023-01-04 13:42:07,744 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls enabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
2023-01-04 13:42:08,509 INFO yarn.Client: Submitting application application_1672723495864_1884 to ResourceManager
2023-01-04 13:42:08,732 INFO impl.YarnClientImpl: Submitted application application_1672723495864_1884
2023-01-04 13:42:09,737 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:09,741 INFO yarn.Client:
client token: N/A
diagnostics: AM container is launched, waiting for AM container to Register with RM
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1672810928526
final status: UNDEFINED
tracking URL: http://xxxx:8088/proxy/application_1672723495864_1884/
user: root
2023-01-04 13:42:10,744 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:11,746 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:12,749 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:13,752 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:14,755 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:15,758 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:16,761 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:17,764 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:18,766 INFO yarn.Client: Application report for application_1672723495864_1884 (state: RUNNING)
SparkLoad 任务状态
JobId: 70036
Label: a_test_202112211_ext_13
State: CANCELLED
Progress: ETL:N/A; LOAD:N/A
Type: SPARK
EtlInfo: NULL
TaskInfo: cluster:spark_0103_v1; timeout(s):36000; max_filter_ratio:0.0
ErrorMsg: type:ETL_RUN_FAIL; msg:errCode = 2, detailMessage = spark etl job failed. msg:
CreateTime: 2023-01-04 13:38:48
EtlStartTime: 2023-01-04 13:39:09
EtlFinishTime: NULL
LoadStartTime: NULL
LoadFinishTime: 2023-01-04 13:40:03
URL: NULL
JobDetails: {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All backends":{},"FileNumber":0,"FileSize":0}
TransactionId: 1637032
ErrorTablets: {}
排查FE日志发现
2023-01-04 13:40:03,491 INFO (Load etl checker|41) [SparkEtlJobHandler.getEtlJobStatus():200] /usr/hdp/3.1.0.0-78/hadoop-yarn/bin/yarn --config /usr/local/doris/fe/lib/yarn-config/spark_0103_v1 application -status application_1672723495864_1882
2023-01-04 13:40:03,491 INFO (Load etl checker|41) [SparkEtlJobHandler.getEtlJobStatus():213] getEtlJobStatus,appId:application_1672723495864_1882, loadJobId:70036, env:[LC_ALL=zh_CN.UTF-8],resource:{
"clazz": "SparkResource",
"sparkConfigs": {
"spark.executor.memory": "36G",
"spark.master": "yarn",
"spark.driver.memory": "40g",
"spark.yarn.stage.dir": "hdfs://xxxx:8020/tmp/doris_1229",
"spark.driver.cores": "4",
"spark.memory.offHeap.size": "8G",
"spark.yarn.archive": "hdfs://xxxx:8020/tmp/doris_1229/1928832733/__spark_repository__spark_0103_v1/__archive_1.0.0/__lib_9012439d621dab595f7134265922c458_spark-2x.zip",
"spark.executor.cores": "4",
"spark.submit.deployMode": "cluster",
"spark.shuffle.compress": "true",
"spark.hadoop.yarn.resourcemanager.address": "hdfs://xxxx:8032",
"spark.executor.extraJavaOptions": "-XX:+PrintGCDetails -XX:+PrintGCTimeStamps",
"spark.hadoop.fs.defaultFS": "hdfs://xxxx:8020",
"spark.yarn.queue": "default"
},
"workingDir": "hdfs://xxxx:8020/tmp/doris_1229",
"broker": "broker_name",
"brokerProperties": {
"broker.password": "",
"broker.username": "hdfs"
},
"envConfigs": {},
"name": "spark_0103_v1",
"type": "SPARK",
"references": {}
}
2023-01-04 13:40:03,551 WARN (Load etl checker|41) [SparkEtlJobHandler.getEtlJobStatus():224] yarn application status failed. spark app id: application_1672723495864_1882, load job id: 70036, timeout: 30000, msg: WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
ERROR: JAVA_HOME is not set and could not be found.
2023-01-04 13:40:03,552 WARN (Load etl checker|41) [LoadManager.lambda$processEtlStateJobs$7():412] update load job etl status failed. job id: 70036
org.apache.doris.common.LoadException: errCode = 2, detailMessage = spark etl job failed. msg:
at org.apache.doris.load.loadv2.SparkLoadJob.updateEtlStatus(SparkLoadJob.java:310) ~[doris-fe.jar:1.0-SNAPSHOT]
at org.apache.doris.load.loadv2.LoadManager.lambda$processEtlStateJobs$7(LoadManager.java:406) ~[doris-fe.jar:1.0-SNAPSHOT]
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[?:1.8.0_241]
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_241]
at java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3566) ~[?:1.8.0_241]
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_241]
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_241]
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_241]
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_241]
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_241]
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_241]
at org.apache.doris.load.loadv2.LoadManager.processEtlStateJobs(LoadManager.java:404) ~[doris-fe.jar:1.0-SNAPSHOT]
at org.apache.doris.load.loadv2.LoadEtlChecker.runAfterCatalogReady(LoadEtlChecker.java:43) ~[doris-fe.jar:1.0-SNAPSHOT]
at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) ~[doris-fe.jar:1.0-SNAPSHOT]
at org.apache.doris.common.util.Daemon.run(Daemon.java:116) ~[doris-fe.jar:1.0-SNAPSHOT]
2023-01-04 13:40:03,552 WARN (Load etl checker|41) [LoadJob.unprotectedExecuteCancel():632] LOAD_JOB=70036, transaction_id={1637032}, error_msg={Failed to execute load with error: errCode = 2, detailMessage = spark etl job failed. msg: }
2023-01-04 13:40:03,555 INFO (Load etl checker|41) [TxnStateCallbackFactory.removeCallback():44] remove callback of txn state : 70036. current callback size: 0
2023-01-04 13:40:03,555 DEBUG (Load etl checker|41) [LoadJob.unprotectedExecuteCancel():662] LOAD_JOB=70036, transaction_id={1637032}, msg={begin to abort txn}
在Spark任务运行完成后,Doris 会使用 /usr/hdp/3.1.0.0-78/hadoop-yarn/bin/yarn --config /usr/local/doris/fe/lib/yarn-config/spark_0103_v1 application -status application_1672723495864_1882 命令获取Spark任务的运行最终状态,由 FE 日志可看出运行上面命令出现搓错误:ERROR: JAVA_HOME is not set and could not be found,单独将上述命令运行在服务器上是可以运行,查看源码 Doris 运行机制为如下代码:org.apache.doris.common.util 包下的 Util 类的 executeCommand 方法
// cmd 就是需要执行的命令
public static CommandResult executeCommand(String cmd, String[] envp, long timeoutMs) {
CommandResult result = new CommandResult();
List cmdList = shellSplit(cmd);
String[] cmds = cmdList.toArray(new String[0]);
try {
Process p = Runtime.getRuntime().exec(cmds, envp);
CmdWorker cmdWorker = new CmdWorker(p);
cmdWorker.start();
查看服务器 JAVA_HOME 环境变量是已配置的,猜想可能是没有识别到环境变量,在 /usr/hdp/3.1.0.0-78/hadoop-yarn/bin/yarn 中新增JAVA_HOME配置解决,如下:
#!/bin/bash
export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/3.1.0.0-78/hadoop}
export HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-/usr/hdp/3.1.0.0-78/hadoop-mapreduce}
export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-/usr/hdp/3.1.0.0-78/hadoop-yarn}
export HADOOP_LIBEXEC_DIR=${HADOOP_HOME}/libexec
export HDP_VERSION=${HDP_VERSION:-3.1.0.0-78}
export HADOOP_OPTS="${HADOOP_OPTS} -Dhdp.version=${HDP_VERSION}"
## 新增 JAVA_HOME 配置
export JAVA_HOME="${JAVA_HOME:-/usr/local/jdk1.8.0_241}"
exec /usr/hdp/3.1.0.0-78//hadoop-yarn/bin/yarn.distro "$@"