ChunJun源码分析——任务提交
-
- 任务提交的参数配置解析
- ChunJun任务-SYNC
- ChunJun任务-SQL
- 版本 ChunJun 1.12
- 注:阅读过Spark任务提交、Flink任务提交源码的朋友,应该可以看出“从SHELL提交到任务,到反射执行实际的FLINK应用代码(main方法)”部分和Spark、Flink非常相似
任务提交的参数配置解析
- 任务提交命令样例
sh bin/chunjun-local.sh -job my-examples/task_script_multi_table.json
start command: /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.292.b10-1.el7_9.x86_64/bin/java -cp /home/chunjun/chunjun-dist/../lib/* com.dtstack.chunjun.client.Launcher -job my-examples/task_script_multi_table.json -mode local -jobType sync -chunjunDistDir /home/chunjun/chunjun-dist
- 该命令会执行Java,运行class
com.dtstack.chunjun.client.Launcher
的main方法
package com.dtstack.chunjun.client;
public class Launcher {
public static void main(String[] args) throws Exception {
OptionParser optionParser = new OptionParser(args);
Options launcherOptions = optionParser.getOptions();
findDefaultConfigDir(launcherOptions);
List<String> argList = optionParser.getProgramExeArgList();
HashMap<String, String> temp = new HashMap<>(16);
for (int i = 0; i < argList.size(); i += 2) {
temp.put(argList.get(i), argList.get(i + 1));
}
argList.clear();
for (int i = 0; i < temp.size(); i++) {
argList.add(temp.keySet().toArray()[i].toString());
argList.add(temp.values().toArray()[i].toString());
}
JobDeployer jobDeployer = new JobDeployer(launcherOptions, argList);
ClusterClientHelper clusterClientHelper;
switch (ClusterMode.getByName(launcherOptions.getMode())) {
case local:
clusterClientHelper = new LocalClusterClientHelper();
break;
case standalone:
clusterClientHelper = new StandaloneClusterClientHelper();
break;
case yarnSession:
clusterClientHelper = new YarnSessionClusterClientHelper();
break;
case yarnPerJob:
clusterClientHelper = new YarnPerJobClusterClientHelper();
break;
case yarnApplication:
throw new ClusterDeploymentException(
"Application Mode not supported by Yarn deployments.");
case kubernetesSession:
clusterClientHelper = new KubernetesSessionClusterClientHelper();
break;
case kubernetesPerJob:
throw new ClusterDeploymentException(
"Per-Job Mode not supported by Kubernetes deployments.");
case kubernetesApplication:
clusterClientHelper = new KubernetesApplicationClusterClientHelper();
break;
default:
throw new ClusterDeploymentException(
launcherOptions.getMode() + " Mode not supported.");
}
URLClassLoader urlClassLoader = (URLClassLoader) Launcher.class.getClassLoader();
List<URL> jarUrlList = ExecuteProcessHelper.getExternalJarUrls(launcherOptions.getAddjar());
ClassLoaderManager.loadExtraJar(jarUrlList, urlClassLoader);
clusterClientHelper.submit(jobDeployer);
}
}
- 当调用
optionParser.getProgramExeArgList()
时,会通过job
参数对应的值(即文件路径)读取文件的内容(sync、sql脚本)
package com.dtstack.chunjun.options;
public class OptionParser {
@VisibleForTesting protected static final String OPTION_JOB = "job";
public List<String> getProgramExeArgList() throws Exception {
Map<String, Object> mapConf = MapUtil.objectToMap(properties);
List<String> args = new ArrayList<>();
for (Map.Entry<String, Object> one : mapConf.entrySet()) {
String key = one.getKey();
Object value = one.getValue();
if (value == null) {
continue;
} else if (OPTION_JOB.equalsIgnoreCase(key)) {
File file = new File(value.toString());
try (FileInputStream in = new FileInputStream(file)) {
byte[] fileContent = new byte[(int) file.length()];
in.read(fileContent);
value =
URLEncoder.encode(
new String(fileContent, StandardCharsets.UTF_8),
StandardCharsets.UTF_8.name());
}
}
args.add("-" + key);
args.add(value.toString());
}
return args;
}
}
- 不同类型的任务会走不同的
ClusterClientHelper
,例如LocalClusterClientHelper
package com.dtstack.chunjun.client.local;
public class LocalClusterClientHelper implements ClusterClientHelper {
@Override
public ClusterClient submit(JobDeployer jobDeployer) throws Exception {
String[] args = jobDeployer.getProgramArgs().toArray(new String[0]);
Main.main(args);
return null;
}
}
- 接着
LocalClusterClientHelper
会执行Main
中的main
方法,传入参数(其他模式下也是类似的,会利用PluginInfoUtil.getMainClass()
获取到要执行的class)
- 之前提交的
args
会随着调用传进main
方法,经过解析、处理后,再分别按SQL、SYNC区分任务种类,将replacedJob(即我们编写的任务脚本)传入
package com.dtstack.chunjun;
public class Main {
public static Logger LOG = LoggerFactory.getLogger(Main.class);
public static void main(String[] args) throws Exception {
LOG.info("------------program params-------------------------");
Arrays.stream(args).forEach(arg -> LOG.info("{}", arg));
LOG.info("-------------------------------------------");
Options options = new OptionParser(args).getOptions();
String job = URLDecoder.decode(options.getJob(), StandardCharsets.UTF_8.name());
String replacedJob = JobUtil.replaceJobParameter(options.getP(), job);
Properties confProperties = PropertiesUtil.parseConf(options.getConfProp());
StreamExecutionEnvironment env = EnvFactory.createStreamExecutionEnvironment(options);
StreamTableEnvironment tEnv =
EnvFactory.createStreamTableEnvironment(env, confProperties, options.getJobName());
LOG.info(
"Register to table configuration:{}",
tEnv.getConfig().getConfiguration().toString());
switch (EJobType.getByName(options.getJobType())) {
case SQL:
exeSqlJob(env, tEnv, replacedJob, options);
break;
case SYNC:
exeSyncJob(env, tEnv, replacedJob, options);
break;
default:
throw new ChunJunRuntimeException(
"unknown jobType: ["
+ options.getJobType()
+ "], jobType must in [SQL, SYNC].");
}
LOG.info("program {} execution success", options.getJobName());
}
}
ChunJun任务-SYNC
- 以
SYNC
为例,接着会调用exeSyncJob
- 对于任务脚本的解析:会再依次调用
parseConf
和SyncConf.parseJob
,最终利用Gson
将任务脚本解析为com.dtstack.chunjun.conf.SyncConf
对象。关键代码如下:
package com.dtstack.chunjun;
public class Main {
private static void exeSyncJob(
StreamExecutionEnvironment env,
StreamTableEnvironment tableEnv,
String job,
Options options)
throws Exception {
SyncConf config = parseConf(job, options);
}
}
package com.dtstack.chunjun;
public class Main {
public static SyncConf parseConf(String job, Options options) {
SyncConf config;
try {
config = SyncConf.parseJob(job);
} catch (Exception e) {
throw new ChunJunRuntimeException(e);
}
return config;
}
}
package com.dtstack.chunjun.conf;
public class SyncConf {
public static SyncConf parseJob(String jobJson) {
SyncConf config = GsonUtil.GSON.fromJson(jobJson, SyncConf.class);
checkJob(config);
return config;
}
}
com.dtstack.chunjun.conf.SyncConf
中有成员变量JobConf job
等
package com.dtstack.chunjun.conf;
public class SyncConf implements Serializable {
private static final long serialVersionUID = 1L;
private JobConf job;
private String pluginRoot;
private String remotePluginPath;
private String savePointPath;
private List<String> syncJarList;
}
com.dtstack.chunjun.conf.JobConf
中则有我们配置的任务脚本中的content
、setting
- 可以看到
content
虽然在脚本中是JSONArray,但解析时目前只处理了第一条配置
package com.dtstack.chunjun.conf;
public class JobConf implements Serializable {
private static final long serialVersionUID = 1L;
private LinkedList<ContentConf> content;
private SettingConf setting = new SettingConf();
public OperatorConf getReader() {
return content.get(0).getReader();
}
public OperatorConf getWriter() {
return content.get(0).getWriter();
}
public CdcConf getCdcConf() {
return content.get(0).getRestoration();
}
public MappingConf getNameMapping() {
return content.get(0).getNameMapping();
}
public TransformerConf getTransformer() {
return content.get(0).getTransformer();
}
public LinkedList<ContentConf> getContent() {
return content;
}
public void setContent(LinkedList<ContentConf> content) {
this.content = content;
}
public SettingConf getSetting() {
return setting;
}
public void setSetting(SettingConf setting) {
this.setting = setting;
}
@Override
public String toString() {
return "JobConf{" + "content=" + content + ", setting=" + setting + '}';
}
}
- 跳回
com.dtstack.chunjun.Main
,再看exeSyncJob
方法
package com.dtstack.chunjun;
public class Main {
private static void exeSyncJob(
StreamExecutionEnvironment env,
StreamTableEnvironment tableEnv,
String job,
Options options)
throws Exception {
SyncConf config = parseConf(job, options);
configStreamExecutionEnvironment(env, options, config);
SourceFactory sourceFactory = DataSyncFactoryUtil.discoverSource(config, env);
DataStream<RowData> dataStreamSource = sourceFactory.createSource();
SpeedConf speed = config.getSpeed();
if (speed.getReaderChannel() > 0) {
dataStreamSource =
((DataStreamSource<RowData>) dataStreamSource)
.setParallelism(speed.getReaderChannel());
}
dataStreamSource = addMappingOperator(config, dataStreamSource);
if (null != config.getCdcConf()
&& (null != config.getCdcConf().getDdl()
&& null != config.getCdcConf().getCache())) {
CdcConf cdcConf = config.getCdcConf();
DDLHandler ddlHandler = DataSyncFactoryUtil.discoverDdlHandler(cdcConf, config);
CacheHandler cacheHandler = DataSyncFactoryUtil.discoverCacheHandler(cdcConf, config);
dataStreamSource =
dataStreamSource.flatMap(
new RestorationFlatMap(ddlHandler, cacheHandler, cdcConf));
}
DataStream<RowData> dataStream;
boolean transformer =
config.getTransformer() != null
&& StringUtils.isNotBlank(config.getTransformer().getTransformSql());
if (transformer) {
dataStream = syncStreamToTable(tableEnv, config, dataStreamSource);
} else {
dataStream = dataStreamSource;
}
if (speed.isRebalance()) {
dataStream = dataStream.rebalance();
}
SinkFactory sinkFactory = DataSyncFactoryUtil.discoverSink(config);
DataStreamSink<RowData> dataStreamSink = sinkFactory.createSink(dataStream);
if (speed.getWriterChannel() > 0) {
dataStreamSink.setParallelism(speed.getWriterChannel());
}
JobExecutionResult result = env.execute(options.getJobName());
if (env instanceof MyLocalStreamEnvironment) {
PrintUtil.printResult(result.getAllAccumulatorResults());
}
}
}
DataSyncFactoryUtil.discoverSource
处,利用任务脚本中配置的reader名称拿到了插件的ClassName,再利用反射构建出了对应的SourceFactory
SourceFactory
对应每个connector插件,都有自己的具体实现,例如MysqlSourceFactory
、KafkaSourceFactory
、Elasticsearch7SourceFactory
等
DataSyncFactoryUtil
下还有一个discoverSink
方法,用于构建SinkFactory
,原理一样
public static SourceFactory discoverSource(SyncConf config, StreamExecutionEnvironment env) {
try {
String pluginName = config.getJob().getReader().getName();
String pluginClassName = PluginUtil.getPluginClassName(pluginName, OperatorType.source);
return ClassLoaderManager.newInstance(
config.getSyncJarList(),
cl -> {
Class<?> clazz = cl.loadClass(pluginClassName);
Constructor<?> constructor =
clazz.getConstructor(
SyncConf.class, StreamExecutionEnvironment.class);
return (SourceFactory) constructor.newInstance(config, env);
});
} catch (Exception e) {
throw new ChunJunRuntimeException(e);
}
}
public static SinkFactory discoverSink(SyncConf config) {
}
- 接着
exeSyncJob
中调用sourceFactory.createSource()
,创建了Flink的DataStream
- 查看
createSource
方法内部,可以发现其原理是调用了Flink的env.addSource(...)
构建Source
- 接着
exeSyncJob
中根据任务脚本中的speed.readerChannel
配置,对DataStreamSource
的并行度进行调整
- 默认使用的是
channel
配置,由exeSyncJob
方法中调用的configStreamExecutionEnvironment
方法可以看出
- 接着
exeSyncJob
中调用了addMappingOperator
,利用任务脚本中的nameMapping
对表的元数据做转换
- 接着
exeSyncJob
中做了对CDC配置的处理
- 接着
exeSyncJob
中,对原始数据做了转换处理。用的转换逻辑是任务脚本中配置的transformSql
,代码中主要用到了Flink的tableEnv,如下
- 将前面创建的
DataStreamSource
注册为Table,并创建一个视图表名(任务脚本中的reader对应的tableName)
- 调用
tableEnv.sqlQuery(...)
,传入transformSql
,对前一步的视图表进行转换
- 最后将得到的DataStream再次创建为一个视图表名(任务脚本中的writer对应的tableName)。注:目前看起来这个逻辑没什么用,后续没用到。
- 接着
exeSyncJob
中,调用DataSyncFactoryUtil.discoverSink
,生成对应的SinkFactory
。再调用SinkFactory.createSink
利用FlinkAPI(dataSet.addSink(..)
),完成了数据输出的逻辑构建。
- 最终,调用
env.execute(options.getJobName())
,开始执行Flink任务
ChunJun任务-SQL
- 以
SQL
为例,接着会调用exeSqlJob
package com.dtstack.chunjun;
class Main {
private static void exeSqlJob(
StreamExecutionEnvironment env,
StreamTableEnvironment tableEnv,
String job,
Options options) {
try {
configStreamExecutionEnvironment(env, options, null);
List<URL> jarUrlList = ExecuteProcessHelper.getExternalJarUrls(options.getAddjar());
StatementSet statementSet = SqlParser.parseSql(job, jarUrlList, tableEnv);
TableResult execute = statementSet.execute();
if (env instanceof MyLocalStreamEnvironment) {
Optional<JobClient> jobClient = execute.getJobClient();
if (jobClient.isPresent()) {
PrintUtil.printResult(jobClient.get().getAccumulators().get());
}
}
} catch (Exception e) {
throw new ChunJunRuntimeException(e);
} finally {
FactoryUtil.getFactoryHelperThreadLocal().remove();
TableFactoryService.getFactoryHelperThreadLocal().remove();
}
}
}
exeSqlJob
方法中最关键的是调用了SqlParser.parseSql(job, jarUrlList, tableEnv)
,生成Flink的StatementSet
package com.dtstack.chunjun.sql.parser;
public class SqlParser {
private static final char SQL_DELIMITER = ';';
public static StatementSet parseSql(
String sql, List<URL> urlList, StreamTableEnvironment tableEnvironment) {
if (StringUtils.isBlank(sql)) {
throw new IllegalArgumentException("SQL must be not empty!");
}
sql = DtStringUtil.dealSqlComment(sql);
StatementSet statement = tableEnvironment.createStatementSet();
Splitter splitter = new Splitter(SQL_DELIMITER);
List<String> stmts = splitter.splitEscaped(sql);
AbstractStmtParser stmtParser = createParserChain();
stmts.stream()
.filter(stmt -> !Strings.isNullOrEmpty(stmt.trim()))
.forEach(
stmt -> {
try {
stmtParser.handleStmt(stmt, tableEnvironment, statement, urlList);
} catch (Exception e) {
throw new ChunJunSqlParseException(
PwdUtil.desensitization(stmt), e.getMessage(), e);
}
});
return statement;
}
}
parseSql
中对SQL做了一系列解析
- 调用
DtStringUtil.dealSqlComment(sql)
,删除注释内容
- 调用Flink的API,创建
StatementSet
,后续返回,用于最后执行
- 使用
Splitter
,用符号;
将sql字符串进行拆分,生成了多个独立sql语句
- 然后用责任链设计模式构建了ParserChain:UploadFileStmtParser -> CreateFunctionStmtParser -> InsertStmtParser
- 最后用
stream
按顺序将每条独立的sql进行解析处理(上面的解析链)
- 调用
createParserChain
生成解析链,处理sql字符串时会依次执行:UploadFileStmtParser -> CreateFunctionStmtParser -> InsertStmtParser
- UploadFileStmtParser
- 字符串需符合条件:正则
"(?i).*add\\s+file\\s+.+|(?i).*add\\s+jar\\s+.+"
- 执行内容:目前是空的
- CreateFunctionStmtParser
- 字符串需符合条件:正则
"(?i)\\s*CREATE\\s+(scalar|table|aggregate)\\s+FUNCTION\\s+(\\S+)\\s+WITH\\s+(\\S+)"
- 执行内容:利用ClassLoader加载class,注册自定义函数UDF
- InsertStmtParser
- 字符串需符合条件:以
insert
作为起始,即INSERT
语句
- 执行内容:执行
statementSet.addInsertSql(sql)
,将sql添加到StatementSet
中
- 当sql字符串与前面所有的Parser都不符时,那么会调用
tEnv.executeSql(stmt)
,直接执行(例如CREATE
语句)
- 当执行完对于所有sql的解析处理后,会将包含
INSERT
语句的StatementSet
返回,最终调用statementSet.execute()
- sql在
parseSql
阶段,已经先将CREATE
语句执行完成
- 在最后才将留下来的
INSERT
语句进行处理
- 这样就不会出现“先
INSERT表
,后CREATE表
”的逻辑问题