如果这样写
new SparkConf().setMaster("yarn-client")
在idea内调试会报错:
Exception in thread "main" java.lang.IllegalStateException: Library directory '....../data-platform-task/assembly/target/scala-2.11/jars' does not exist; make sure Spark is built.
at org.apache.spark.launcher.CommandBuilderUtils.checkState(CommandBuilderUtils.java:248)
查看Spark官方文档,需要设置spark.yarn.jars或者spark.yarn.archive。
修改程序:
new SparkConf().setMaster("yarn-client")
.set("spark.yarn.archive", getProperty(HDFS_SPARK_ARCHIVE))
在idea内调试,报错:
Caused by: java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.x$334 of type org.apache.spark.api.java.function.PairFunction in instance of org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2287)
这是因为找不到任务依赖的类。
继续查阅文档,有
继续修改程序:
new SparkConf().setMaster("yarn-client")
.set("spark.yarn.archive", getProperty(HDFS_SPARK_ARCHIVE))
.set("spark.yarn.dist.jars", getProperty(TASK_JARS))
调试,可以执行了。
19/06/27 12:53:12 INFO yarn.YarnAllocator: Will request 2 executor container(s), each with 1 core(s) and 1408 MB memory (including 384 MB of overhead)
19/06/27 12:53:12 INFO yarn.YarnAllocator: Submitted 2 unlocalized container requests.
19/06/27 12:53:12 INFO yarn.ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
19/06/27 12:53:12 INFO impl.AMRMClientImpl: Received new token for : leishu-OptiPlex-7060:39105
19/06/27 12:53:12 INFO yarn.YarnAllocator: Launching container container_1561543784696_0031_01_000002 on host leishu-OptiPlex-7060 for executor with ID 1
19/06/27 12:53:13 INFO yarn.YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
19/06/27 12:53:13 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
19/06/27 12:53:13 INFO impl.ContainerManagementProtocolProxy: Opening proxy : leishu-OptiPlex-7060:39105
19/06/27 12:53:13 INFO yarn.YarnAllocator: Launching container container_1561543784696_0031_01_000003 on host leishu-OptiPlex-7060 for executor with ID 2
19/06/27 12:53:13 INFO yarn.YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them.
19/06/27 12:53:13 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0
19/06/27 12:53:13 INFO impl.ContainerManagementProtocolProxy: Opening proxy : leishu-OptiPlex-7060:39105
19/06/27 12:53:16 INFO yarn.YarnAllocator: Received 1 containers from YARN, launching executors on 0 of them.
19/06/27 12:53:18 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
19/06/27 12:53:18 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 172.16.209.105:33251
19/06/27 12:53:18 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. 172.16.209.105:33251
19/06/27 12:53:18 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0
19/06/27 12:53:18 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
19/06/27 12:53:18 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
19/06/27 12:53:18 INFO yarn.ApplicationMaster: Deleting staging directory file:/home/.../.sparkStaging/application_1561543784696_0031
19/06/27 12:53:18 INFO util.ShutdownHookManager: Shutdown hook called
其中,对于spark.yarn.archive参数,我把需要的jar包(/spark-2.4.3-bin-hadoop2.7/jars目录下的全部文件)压缩成一个zip文件,上传到了hdfs。
使用使用代码,实现上传到hdfs的功能:
public class SparkJar2Hdfs {
public static void main(String[] args) throws Exception {
//要上传的源文件所在路径
Path src = new Path(getProperty(SPARK_JARS_ZIP));
Path dst = new Path(getProperty(HDFS_SPARK_JARS_PATH));
removeDir(dst);
if (createDir(dst) && uploadPath(src, dst)) {
listStatus(dst);
}
}
private static FileSystem getCorSys() {
FileSystem coreSys = null;
Configuration conf = new Configuration();
try {
return FileSystem.get(URI.create(getProperty(HDFS_SPARK_ROOT)), conf);
} catch (Exception e) {
e.printStackTrace();
}
return coreSys;
}
//创建目录
private static boolean createDir(Path path) {
try (FileSystem coreSys = getCorSys()) {
if (coreSys.exists(path)) {
return true;
} else {
return coreSys.mkdirs(path);
}
} catch (IOException e) {
e.printStackTrace();
return false;
}
}
//删除目录
private static boolean removeDir(Path path) {
try (FileSystem coreSys = getCorSys()) {
if (coreSys.exists(path)) {
return true;
} else {
return coreSys.delete(path, true);
}
} catch (IOException e) {
e.printStackTrace();
return false;
}
}
//文件上传
private static boolean uploadPath(Path srcPath, Path desPath) {
try (FileSystem coreSys = getCorSys()) {
if (coreSys.isDirectory(desPath)) {
coreSys.copyFromLocalFile(srcPath, desPath);
return true;
} else {
throw new IOException("desPath is not exist");
}
} catch (IOException e) {
e.printStackTrace();
return false;
}
}
//文件列表
private static void listStatus(Path desPath) {
try (FileSystem coreSys = getCorSys()) {
FileStatus files[] = coreSys.listStatus(desPath);
for (int i = 0; i < files.length; i++) {
System.out.println(files[i].getPath());
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
执行,会输出文件URL。
Connected to the target VM, address: '127.0.0.1:39539', transport: 'socket'
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
hdfs://localhost:9000/user/.../spark-libs/spark-2.4.3-hadoop2.7.7.zip
Disconnected from the target VM, address: '127.0.0.1:39539', transport: 'socket'
Process finished with exit code 0
对于spark.yarn.dist.jars参数,可以使用maven-shade-plugin:
<plugin>
<groupId>org.apache.maven.pluginsgroupId>
<artifactId>maven-shade-pluginartifactId>
<version>3.2.1version>
<configuration>
<shadedArtifactAttached>falseshadedArtifactAttached>
<outputFile>${project.build.directory}/shaded/data-platform-task-${project.version}-shaded.jar
outputFile>
<artifactSet>
<includes>
<include>com.alibaba:druidinclude>
<include>com.aliyun:emr-coreinclude>
<include>com.google.inject:guiceinclude>
<include>log4j:log4jinclude>
<include>org.postgresql:postgresqlinclude>
<include>org.slf4j:slf4j-apiinclude>
<include>org.slf4j:slf4j-log4j12include>
<include>org.projectlombok.lombokinclude>
<include>org.springframework:spring-jdbcinclude>
includes>
artifactSet>
configuration>
<executions>
<execution>
<phase>packagephase>
<goals>
<goal>shadegoal>
goals>
execution>
executions>
plugin>
这样,任务所在工程的类,以及依赖的全部第三方jar包可以打成一个jar包。