参考文章
Kyuubi实践 | 编译 Spark3.1 以适配 CDH5 并集成 Kyuubi-技术圈 (proginn.com)https://jishuin.proginn.com/p/763bfbd67cf6
https://issues.apache.org/jira/browse/SPARK-35758https://jishuin.proginn.com/p/763bfbd67cf6
[SPARK-35758][DOCS] Update the document about building Spark with Hadoop for Hadoop 2.x and 3.x by sarutak · Pull Request #32917 · apache/spark · GitHubhttps://github.com/apache/spark/pull/32917
由于spark3不再直接支持hadoop2.6以下的低版本,而我们生产环境仍然使用的 CDH 5.16.2(hadoop-2.6.0-cdh5.16.2)的内核版本较低,需要自行编译spark3。
已经使用本文方法成功编译{saprk3.0.3,spark3.1.1,spark3.1.2,spark3.1.3,spark3.2.1},因决定使用次新版本作为生产环境的spark版本,故行文以spark3.1.3为例
提前准备好java,scala,maven环境
java -version #1.8.0_311
mvn -v #Apache Maven 3.6.3
scala -version #2.12.10
增加一个环境变量(/etc/profile),让Maven在编译时可以使用更多的内存:
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"
#创建目录
sudo mkdir /bi_bigdata/user_shell/spark
#下载安装包到目录
wget https://archive.apache.org/dist/spark/spark-3.1.3/spark-3.1.3.tgz -P /bi_bigdata/user_shell/spark
#解压到指定文件夹
tar -zxvf /bi_bigdata/user_shell/spark/spark-3.1.3.tgz -C /bi_bigdata/user_shell/spark/
cd /bi_bigdata/user_shell/spark/spark-3.1.3
主要针对hadoop版本低于2.6.4 的修改,主要根据报错进行调整的
vim resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
/*注释
sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern =>
try {
val logAggregationContext = Records.newRecord(classOf[LogAggregationContext])
logAggregationContext.setRolledLogsIncludePattern(includePattern)
sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern =>
logAggregationContext.setRolledLogsExcludePattern(excludePattern)
}
appContext.setLogAggregationContext(logAggregationContext)
} catch {
case NonFatal(e) =>
logWarning(s"Ignoring ${ROLLED_LOG_INCLUDE_PATTERN.key} because the version of YARN " +
"does not support it", e)
}
}
appContext.setUnmanagedAM(isClientUnmanagedAMEnabled)
sparkConf.get(APPLICATION_PRIORITY).foreach { appPriority =>
appContext.setPriority(Priority.newInstance(appPriority))
}
appContext
}
*/
/*替换*/
sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern =>
try {
val logAggregationContext = Records.newRecord(classOf[LogAggregationContext])
// These two methods were added in Hadoop 2.6.4, so we still need to use reflection to
// avoid compile error when building against Hadoop 2.6.0 ~ 2.6.3.
val setRolledLogsIncludePatternMethod =
logAggregationContext.getClass.getMethod("setRolledLogsIncludePattern", classOf[String])
setRolledLogsIncludePatternMethod.invoke(logAggregationContext, includePattern)
sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern =>
val setRolledLogsExcludePatternMethod =
logAggregationContext.getClass.getMethod("setRolledLogsExcludePattern", classOf[String])
setRolledLogsExcludePatternMethod.invoke(logAggregationContext, excludePattern)
}
appContext.setLogAggregationContext(logAggregationContext)
} catch {
case NonFatal(e) =>
logWarning(s"Ignoring ${ROLLED_LOG_INCLUDE_PATTERN.key} because the version of YARN " +
"does not support it", e)
}
}
appContext
}
vim core/src/main/scala/org/apache/spark/util/Utils.scala
//注释掉
//import org.apache.hadoop.util.{RunJar, StringUtils}
//替换为
import org.apache.hadoop.util.{RunJar}
def unpack(source: File, dest: File): Unit = {
// StringUtils 在hadoop2.6.0中引用不到,所以取消此import,然后修改为相似的功能
// val lowerSrc = StringUtils.toLowerCase(source.getName)
if (source.getName == null) {
throw new NullPointerException
}
val lowerSrc = source.getName.toLowerCase()
if (lowerSrc.endsWith(".jar")) {
RunJar.unJar(source, dest, RunJar.MATCH_ANY)
} else if (lowerSrc.endsWith(".zip")) {
FileUtil.unZip(source, dest)
} else if (
lowerSrc.endsWith(".tar.gz") || lowerSrc.endsWith(".tgz") || lowerSrc.endsWith(".tar")) {
FileUtil.unTar(source, dest)
} else {
logWarning(s"Cannot unpack $source, just copying it to $dest.")
copyRecursive(source, dest)
}
}
vim core/src/main/scala/org/apache/spark/ui/HttpSecurityFilter.scala
private val parameterMap: Map[String, Array[String]] = {
super.getParameterMap().asScala.map { case (name, values) =>
//Unapplied methods are only converted to functions when a function type is expected.
//You can make this conversion explicit by writing `stripXSS _` or `stripXSS(_)` instead of `stripXSS`.
// stripXSS(name) -> values.map(stripXSS)
stripXSS(name) -> values.map(stripXSS(_))
}.toMap
}
vim pom.xml
central
Maven Repository
https://mvnrepository.com/repos/central
true
false
cloudera
cloudera Repository
https://repository.cloudera.com/artifactory/cloudera-repos/
cloudera
Cloudera Repositories
https://repository.cloudera.com/artifactory/cloudera-repos/
在spark的解压目录中进行编译可以分发的二进制压缩包
根据官方的提示,在编译hadoop2.x版本时,指定-Phadoop-2.7
[SPARK-35758][DOCS] Update the document about building Spark with Hadoop for Hadoop 2.x and 3.x by sarutak · Pull Request #32917 · apache/spark · GitHub
./dev/make-distribution.sh --name 2.6.0-cdh5.16.2 --pip --tgz -Phive -Phive-thriftserver -Pmesos -Pyarn -Pkubernetes -Phadoop-2.7 -Dhadoop.version=2.6.0-cdh5.16.2 -Dscala.version=2.12.10 -X
编译完成后,可以在当前目录看到对应可分发的tgz安装包,接下来就可以部署到生产环境了。
#查看生成的tgz安装包
ll -h |grep tgz |grep spark
#spark-3.1.3-bin-2.6.0-cdh5.16.2.tgz