Spark_Spark on YARN 提交配置文件,读取配置文件

 

Spark ON YARN 的官方文档,基于Spark 2.1.1

http://spark.apache.org/docs/2.1.1/running-on-yarn.html

Spark_Spark on YARN 提交配置文件,读取配置文件_第1张图片

To use a custom log4j configuration for the application master or executors, here are the options:

  • upload a custom log4j.properties using spark-submit, by adding it to the --files list of files to be uploaded with the application.
  • add -Dlog4j.configuration= to spark.driver.extraJavaOptions (for the driver) or spark.executor.extraJavaOptions (for executors). Note that if using a file, the file: protocol should be explicitly provided, and the file needs to exist locally on all the nodes.
  • update the $SPARK_CONF_DIR/log4j.properties file and it will be automatically uploaded along with the other configurations. Note that other 2 options has higher priority than this option if multiple options are specified.

Note that for the first option, both executors and the application master will share the same log4j configuration, which may cause issues when they run on the same node (e.g. trying to write to the same log file).

 

单个文件

通过以上的描述,可以看到可以通过 --files 传预设值的配置

例如 :

--files  log4j.config 这种

 

多个文件 

那么如果提交多个文件怎么办呢,此时我们需要用 ( , ) 逗号进行分割。

--files redis.config,mysql.config

 

我们将提交多个文件写成脚本 :

ROOT_PATH=$(dirname $(readlink -f $0))

## config , Job config files
config=""
for file in ${ROOT_PATH}/config/*
do
        config="${file},${config}"
done


nohup /usr/bin/spark2-submit \
    --class ${class_name} \
    --name ${JOB_NAME} \
    --files ${config} \
    --master yarn \
    --driver-memory 2G \
    --driver-cores 1 \
    --num-executors 3 \
    --executor-cores 2 \
    --executor-memory 2G \
    --jars ${classpath} \
    ${ROOT_PATH}/libs/${APP_NAME}-${libVersion}-SNAPSHOT.jar online ${config} \
    > ${ROOT_PATH}/logs/start.error 2> ${ROOT_PATH}/logs/start.log &

 

可以看到真实的路径为如下描述:

--- Test ---
config files : /data-hdd/00/project/cloudera-scm/spark-workspace/onlineJob/TD-clickImp-blacklist-redis/0.0.6/config/redis_cluster.conf,/data-hdd/00/project/cloudera-scm/spark-workspace/onlineJob/TD-clickImp-blacklist-redis/0.0.6/config/LAN_ip,/data-hdd/00/project/cloudera-scm/spark-workspace/onlineJob/TD-clickImp-blacklist-redis/0.0.6/config/kafka_cluster.conf,

拆分为这3个文件:

/data-hdd/00/project/cloudera-scm/spark-workspace/onlineJob/TD-clickImp-blacklist-redis/0.0.6/config/redis_cluster.conf,

/data-hdd/00/project/cloudera-scm/spark-workspace/onlineJob/TD-clickImp-blacklist-redis/0.0.6/config/LAN_ip,

/data-hdd/00/project/cloudera-scm/spark-workspace/onlineJob/TD-clickImp-blacklist-redis/0.0.6/config/kafka_cluster.conf,
 

 

 

那么如何获取到这几个文件呢,网上有一些方法,这里我针对于Java开发了一套比较通用的方法:

 

 

首先,注意我们把这些路径通过参数的方式传递进 main 方法中。

//获取上传的文件名

Set fileFullPathContainer = null;

if (args.length < 2 && (GlobalVariable.env != EnvType.IDE)) {
    return;
} else {
    fileFullPathContainer = FileUtil.filesSplit(args[1], ",");
}

filesSplit 方法:

public static Set filesSplit(String filesString, String separator) {

	Set resultSet = new HashSet<>();
	String[] tmpArr = filesString.split(separator);
	resultSet.addAll(Arrays.asList(tmpArr));

	return resultSet;
}

 

 

这样我们就可以通过绝对路径读取到配置文件了,下面是一个示例:

if (GlobalVariable.env == EnvType.IDE) {
	ConfigCenter.init(GlobalVariable.env);
	configCenter = ConfigCenter.getInstance();
} else {
	ConfigCenter.init(null);
	configCenter = ConfigCenter.getInstance();

	//读取Kafka 配置
	Properties kafkaProps = new Properties();
	kafkaProps.load(new FileInputStream(FileUtil.findFileFullPath(fileFullPathContainer, "kafka_cluster.conf")));
	configCenter.setKafkaConfig(kafkaProps);

	//读取Redis 配置
	Properties redisProps = new Properties();
	redisProps.load(new FileInputStream(FileUtil.findFileFullPath(fileFullPathContainer, "redis_cluster.conf")));
	configCenter.setRedisConfig(redisProps);
}

 

findFileFullPath 方法:

public static String findFileFullPath(Set container, String shortName) {

	for (String tmpString : container) {
		if (tmpString.contains(shortName)) {
			return tmpString;
		}
	}

	return null;
}

 

你可能感兴趣的:(Spark)