Doris Spark Load 记录

环境

doris : selectdb-doris-1.2.0-1-caeed14
spark : 2.4.6
Hadoop : 3.1.0
步骤
● 配置环境:参考官网 Doris Spark Load
● 配置示例
○ fe.conf

enable_spark_load=true
# 配置 Spark 目录 (目前Doris使用的是 2.4.6)
spark_home_default_dir=/usr/hdp/3.1.0.0-78/spark-2.4.6
# 配置 spark-jars.zip 地址
spark_resource_path=/usr/hdp/3.1.0.0-78/spark-2.4.6/jars/spark-jars.zip
# 配置 Yarn 执行路径
yarn_client_path=/usr/hdp/3.1.0.0-78/hadoop-yarn/bin/yarn

○ Spark Resource 创建

CREATE EXTERNAL RESOURCE "spark_0103_v1"
PROPERTIES
(
  "type" = "spark",
  "spark.master" = "yarn",
  "spark.submit.deployMode" = "cluster",
  "spark.executor.memory" = "8g",
  "spark.yarn.queue" = "default",
  "spark.hadoop.yarn.resourcemanager.address" = "hdfs://xxxx:8032",
  "spark.hadoop.fs.defaultFS" = "hdfs://xxxx:8020",
  "working_dir" = "hdfs://xxxx:8020/tmp/doris_1229",
  "broker" = "broker_name",
  "broker.username" = "hdfs",
  "broker.password" = ""
);

○ 导入脚本

-- hive 外部表导入 其他导入可参考官网地址
LOAD LABEL test_1229.a_test_202112211_ext_10
(
    DATA FROM TABLE a_test_202112211_ext
    INTO TABLE a_test_202112211
)
WITH RESOURCE 'spark_0103_v1'
(   
    "spark.driver.memory"="8g",
    "spark.shuffle.compress" = "true",
    "spark.memory.offHeap.size"="8G",
    "spark.executor.memory"="16G",
    "spark.executor.cores" = "4",
    "spark.driver.cores"="4"
)
PROPERTIES
(
    "timeout" = "36000"
);

-- 查看导入 
show load order by createtime desc limit 1\G

问题汇总

  1. 现象:Spark 任务在 Yarn 任务运行成功,但是Doris无法获取Yarn任务状导致Spark任务重复运行,3次后SparkLoad对应任务显示失败,
    对应任务状态
 JobId: 63038
         Label: a_test_202112211_ext_1
         State: CANCELLED
      Progress: ETL:N/A; LOAD:N/A
          Type: SPARK
       EtlInfo: NULL
      TaskInfo: cluster:spark_0103_v1; timeout(s):36000; max_filter_ratio:0.0
      ErrorMsg: type:ETL_SUBMIT_FAIL; msg:errCode = 2, detailMessage = start spark app failed. error: Waiting too much time to get appId from handle. spark app state: UNKNOWN, loadJobId:63038
    CreateTime: 2023-01-03 18:24:40
  EtlStartTime: NULL
 EtlFinishTime: NULL
 LoadStartTime: NULL
LoadFinishTime: 2023-01-03 18:27:10
           URL: NULL
    JobDetails: {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All backends":{},"FileNumber":0,"FileSize":0}
 TransactionId: 0
  ErrorTablets: {}
10 rows in set (0.02 sec)

调试相关代码得到如下信息,

public void run() {
           ......
    
            BufferedReader outReader = null;
            String line = null;
            long startTime = System.currentTimeMillis();
            try {
                outReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
                while (!isStop && (line = outReader.readLine()) != null) {
                    if (outputStream != null) {
                        outputStream.write((line + "\n").getBytes());
                    }
                    SparkLoadAppHandle.State oldState = handle.getState();
                    SparkLoadAppHandle.State newState = oldState;
                    // parse state and appId
                    if (line.contains(STATE)) {
                        // 1. state
                        String state = regexGetState(line);
                        if (state != null) {
                            YarnApplicationState yarnState = YarnApplicationState.valueOf(state);
                            newState = fromYarnState(yarnState);
                            if (newState != oldState) {
                                handle.setState(newState);
                            }
                        }
                        // 2. appId
                        String appId = regexGetAppId(line);
                        if (appId != null) {
                            if (!appId.equals(handle.getAppId())) {
                                handle.setAppId(appId);
                            }
                        }

Doris FE 会在 org.apache.doris.load.loadv2 包 的 SparkLauncherMonitor 类 下的 LogMonitor 中有一个 run 方法里面,获取Spark任务的相关状态根据代码可知有从提交的相关Spark任务的日志中去获取对应的 AppID任务State(猜想主要是获取对应的 APPID,后续根据Yarn命令行去获取状态),该日志猜想就是 spark_launcher_log 文件夹下的对应 SparkLoad 的日志,排查发现 由于环境问题,SparkLoad对应的日志输出不全导致Doris无法获取到Spark任务的状态
异常任务日志不全

23/01/03 14:04:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/01/03 14:04:59 WARN DependencyUtils: Skip remote jar hdfs://xxxx:8020/tmp/doris/1928832733/__spark_repository__spark0/__archive_1.0.0/__lib_ca97ae5fbe7d6bcbf334923633db4c9d_spark-dpp-1.0.0-jar-with-dependencies.jar.

修复这个问题后的正常日志 , 可以正常获取APPID进行下一步

2023-01-04 13:42:06,003 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2023-01-04 13:42:06,202 WARN deploy.DependencyUtils: Skip remote jar hdfs://xxxx
:8020/tmp/doris_1229/1928832733/__spark_repository__spark_0103_v1/__archive_1.0.0/__lib_ca97ae5fbe7d6bcbf334923633db4c9d_spark-dpp-1.0.0-jar-with-dependencies.jar.
2023-01-04 13:42:06,658 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
2023-01-04 13:42:07,101 INFO impl.TimelineClientImpl: Timeline service address: http://xxxx:8188/ws/v1/timeline/
2023-01-04 13:42:07,230 INFO yarn.Client: Requesting a new application from cluster with 30 NodeManagers
2023-01-04 13:42:07,294 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (63488 MB per container)
2023-01-04 13:42:07,295 INFO yarn.Client: Will allocate AM container, with 45056 MB memory including 4096 MB overhead
2023-01-04 13:42:07,295 INFO yarn.Client: Setting up container launch context for our AM
2023-01-04 13:42:07,297 INFO yarn.Client: Setting up the launch environment for our AM container
2023-01-04 13:42:07,302 INFO yarn.Client: Preparing resources for our AM container
2023-01-04 13:42:07,340 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://xxxx:8020/tmp/doris_1229/1928832733/__spark_repository__spark_0103_v1/__archive_1.0.0/__lib_9012439d621dab595f7134265922c458_spark-2x.zip
2023-01-04 13:42:07,387 INFO yarn.Client: Source and destination file systems are the same. Not copying hdfs://xxxx:8020/tmp/doris_1229/1928832733/__spark_repository__spark_0103_v1/__archive_1.0.0/__lib_ca97ae5fbe7d6bcbf334923633db4c9d_spark-dpp-1.0.0-jar-with-dependencies.jar
2023-01-04 13:42:07,480 INFO yarn.Client: Uploading resource file:/tmp/spark-59166bba-7d28-47ca-83ab-34b8a5c800ef/__spark_conf__6036209221290820800.zip -> hdfs://xxxx:8020/user/root/.sparkStaging/application_1672723495864_1884/__spark_conf__.zip
2023-01-04 13:42:07,732 WARN yarn.Client: spark.yarn.am.extraJavaOptions will not take effect in cluster mode
2023-01-04 13:42:07,742 INFO spark.SecurityManager: Changing view acls to: root
2023-01-04 13:42:07,743 INFO spark.SecurityManager: Changing modify acls to: root
2023-01-04 13:42:07,743 INFO spark.SecurityManager: Changing view acls groups to:
2023-01-04 13:42:07,743 INFO spark.SecurityManager: Changing modify acls groups to:
2023-01-04 13:42:07,744 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls enabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
2023-01-04 13:42:08,509 INFO yarn.Client: Submitting application application_1672723495864_1884 to ResourceManager
2023-01-04 13:42:08,732 INFO impl.YarnClientImpl: Submitted application application_1672723495864_1884
2023-01-04 13:42:09,737 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:09,741 INFO yarn.Client:
         client token: N/A
         diagnostics: AM container is launched, waiting for AM container to Register with RM
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1672810928526
         final status: UNDEFINED
         tracking URL: http://xxxx:8088/proxy/application_1672723495864_1884/
         user: root
2023-01-04 13:42:10,744 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:11,746 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:12,749 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:13,752 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:14,755 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:15,758 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:16,761 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:17,764 INFO yarn.Client: Application report for application_1672723495864_1884 (state: ACCEPTED)
2023-01-04 13:42:18,766 INFO yarn.Client: Application report for application_1672723495864_1884 (state: RUNNING)
  1. 现象: Spark 任务在 Yarn 任务运行成功,但是Doris获取到的任务最终运行状态失败,导致SparkLoad对应任务显示失败

Yarn 任务运行状态
Doris Spark Load 记录_第1张图片

SparkLoad 任务状态

JobId: 70036
         Label: a_test_202112211_ext_13
         State: CANCELLED
      Progress: ETL:N/A; LOAD:N/A
          Type: SPARK
       EtlInfo: NULL
      TaskInfo: cluster:spark_0103_v1; timeout(s):36000; max_filter_ratio:0.0
      ErrorMsg: type:ETL_RUN_FAIL; msg:errCode = 2, detailMessage = spark etl job failed. msg:
    CreateTime: 2023-01-04 13:38:48
  EtlStartTime: 2023-01-04 13:39:09
 EtlFinishTime: NULL
 LoadStartTime: NULL
LoadFinishTime: 2023-01-04 13:40:03
           URL: NULL
    JobDetails: {"Unfinished backends":{},"ScannedRows":0,"TaskNumber":0,"LoadBytes":0,"All backends":{},"FileNumber":0,"FileSize":0}
 TransactionId: 1637032
  ErrorTablets: {}

排查FE日志发现

2023-01-04 13:40:03,491 INFO (Load etl checker|41) [SparkEtlJobHandler.getEtlJobStatus():200] /usr/hdp/3.1.0.0-78/hadoop-yarn/bin/yarn --config /usr/local/doris/fe/lib/yarn-config/spark_0103_v1 application -status application_1672723495864_1882
2023-01-04 13:40:03,491 INFO (Load etl checker|41) [SparkEtlJobHandler.getEtlJobStatus():213] getEtlJobStatus,appId:application_1672723495864_1882, loadJobId:70036, env:[LC_ALL=zh_CN.UTF-8],resource:{
  "clazz": "SparkResource",
  "sparkConfigs": {
    "spark.executor.memory": "36G",
    "spark.master": "yarn",
    "spark.driver.memory": "40g",
    "spark.yarn.stage.dir": "hdfs://xxxx:8020/tmp/doris_1229",
    "spark.driver.cores": "4",
    "spark.memory.offHeap.size": "8G",
    "spark.yarn.archive": "hdfs://xxxx:8020/tmp/doris_1229/1928832733/__spark_repository__spark_0103_v1/__archive_1.0.0/__lib_9012439d621dab595f7134265922c458_spark-2x.zip",
    "spark.executor.cores": "4",
    "spark.submit.deployMode": "cluster",
    "spark.shuffle.compress": "true",
    "spark.hadoop.yarn.resourcemanager.address": "hdfs://xxxx:8032",
    "spark.executor.extraJavaOptions": "-XX:+PrintGCDetails -XX:+PrintGCTimeStamps",
    "spark.hadoop.fs.defaultFS": "hdfs://xxxx:8020",
    "spark.yarn.queue": "default"
  },
  "workingDir": "hdfs://xxxx:8020/tmp/doris_1229",
  "broker": "broker_name",
  "brokerProperties": {
    "broker.password": "",
    "broker.username": "hdfs"
  },
  "envConfigs": {},
  "name": "spark_0103_v1",
  "type": "SPARK",
  "references": {}
}
2023-01-04 13:40:03,551 WARN (Load etl checker|41) [SparkEtlJobHandler.getEtlJobStatus():224] yarn application status failed. spark app id: application_1672723495864_1882, load job id: 70036, timeout: 30000, msg: WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
ERROR: JAVA_HOME is not set and could not be found.

2023-01-04 13:40:03,552 WARN (Load etl checker|41) [LoadManager.lambda$processEtlStateJobs$7():412] update load job etl status failed. job id: 70036
org.apache.doris.common.LoadException: errCode = 2, detailMessage = spark etl job failed. msg:
        at org.apache.doris.load.loadv2.SparkLoadJob.updateEtlStatus(SparkLoadJob.java:310) ~[doris-fe.jar:1.0-SNAPSHOT]
        at org.apache.doris.load.loadv2.LoadManager.lambda$processEtlStateJobs$7(LoadManager.java:406) ~[doris-fe.jar:1.0-SNAPSHOT]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) ~[?:1.8.0_241]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_241]
        at java.util.concurrent.ConcurrentHashMap$ValueSpliterator.forEachRemaining(ConcurrentHashMap.java:3566) ~[?:1.8.0_241]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482) ~[?:1.8.0_241]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472) ~[?:1.8.0_241]
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_241]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_241]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_241]
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_241]
        at org.apache.doris.load.loadv2.LoadManager.processEtlStateJobs(LoadManager.java:404) ~[doris-fe.jar:1.0-SNAPSHOT]
        at org.apache.doris.load.loadv2.LoadEtlChecker.runAfterCatalogReady(LoadEtlChecker.java:43) ~[doris-fe.jar:1.0-SNAPSHOT]
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) ~[doris-fe.jar:1.0-SNAPSHOT]
        at org.apache.doris.common.util.Daemon.run(Daemon.java:116) ~[doris-fe.jar:1.0-SNAPSHOT]
2023-01-04 13:40:03,552 WARN (Load etl checker|41) [LoadJob.unprotectedExecuteCancel():632] LOAD_JOB=70036, transaction_id={1637032}, error_msg={Failed to execute load with error: errCode = 2, detailMessage = spark etl job failed. msg: }
2023-01-04 13:40:03,555 INFO (Load etl checker|41) [TxnStateCallbackFactory.removeCallback():44] remove callback of txn state : 70036. current callback size: 0
2023-01-04 13:40:03,555 DEBUG (Load etl checker|41) [LoadJob.unprotectedExecuteCancel():662] LOAD_JOB=70036, transaction_id={1637032}, msg={begin to abort txn}

在Spark任务运行完成后,Doris 会使用 /usr/hdp/3.1.0.0-78/hadoop-yarn/bin/yarn --config /usr/local/doris/fe/lib/yarn-config/spark_0103_v1 application -status application_1672723495864_1882 命令获取Spark任务的运行最终状态,由 FE 日志可看出运行上面命令出现搓错误:ERROR: JAVA_HOME is not set and could not be found,单独将上述命令运行在服务器上是可以运行,查看源码 Doris 运行机制为如下代码:org.apache.doris.common.util 包下的 Util 类的 executeCommand 方法

// cmd 就是需要执行的命令
public static CommandResult executeCommand(String cmd, String[] envp, long timeoutMs) {
        CommandResult result = new CommandResult();
        List cmdList = shellSplit(cmd);
        String[] cmds = cmdList.toArray(new String[0]);

        try {
            Process p = Runtime.getRuntime().exec(cmds, envp);
            CmdWorker cmdWorker = new CmdWorker(p);
            cmdWorker.start();

查看服务器 JAVA_HOME 环境变量是已配置的,猜想可能是没有识别到环境变量,在 /usr/hdp/3.1.0.0-78/hadoop-yarn/bin/yarn 中新增JAVA_HOME配置解决,如下:

#!/bin/bash

export HADOOP_HOME=${HADOOP_HOME:-/usr/hdp/3.1.0.0-78/hadoop}
export HADOOP_MAPRED_HOME=${HADOOP_MAPRED_HOME:-/usr/hdp/3.1.0.0-78/hadoop-mapreduce}
export HADOOP_YARN_HOME=${HADOOP_YARN_HOME:-/usr/hdp/3.1.0.0-78/hadoop-yarn}
export HADOOP_LIBEXEC_DIR=${HADOOP_HOME}/libexec
export HDP_VERSION=${HDP_VERSION:-3.1.0.0-78}
export HADOOP_OPTS="${HADOOP_OPTS} -Dhdp.version=${HDP_VERSION}"
## 新增 JAVA_HOME 配置
export JAVA_HOME="${JAVA_HOME:-/usr/local/jdk1.8.0_241}"

exec /usr/hdp/3.1.0.0-78//hadoop-yarn/bin/yarn.distro "$@"

你可能感兴趣的:(spark,hadoop,大数据)