自编译Spark3.X,支持CDH 5.16.2(hadoop-2.6.0-cdh5.16.2)

参考文章

Kyuubi实践 | 编译 Spark3.1 以适配 CDH5 并集成 Kyuubi-技术圈 (proginn.com)https://jishuin.proginn.com/p/763bfbd67cf6

https://issues.apache.org/jira/browse/SPARK-35758https://jishuin.proginn.com/p/763bfbd67cf6

[SPARK-35758][DOCS] Update the document about building Spark with Hadoop for Hadoop 2.x and 3.x by sarutak · Pull Request #32917 · apache/spark · GitHubhttps://github.com/apache/spark/pull/32917

由于spark3不再直接支持hadoop2.6以下的低版本,而我们生产环境仍然使用的 CDH 5.16.2(hadoop-2.6.0-cdh5.16.2)的内核版本较低,需要自行编译spark3。

已经使用本文方法成功编译{saprk3.0.3,spark3.1.1,spark3.1.2,spark3.1.3,spark3.2.1},因决定使用次新版本作为生产环境的spark版本,故行文以spark3.1.3为例

1)编译环境准备

提前准备好java,scala,maven环境

java -version   #1.8.0_311
mvn -v          #Apache Maven 3.6.3
scala -version  #2.12.10

增加一个环境变量(/etc/profile),让Maven在编译时可以使用更多的内存:

export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=1g"

2)下载源码

#创建目录
sudo mkdir /bi_bigdata/user_shell/spark
#下载安装包到目录
wget https://archive.apache.org/dist/spark/spark-3.1.3/spark-3.1.3.tgz -P /bi_bigdata/user_shell/spark

#解压到指定文件夹
tar -zxvf /bi_bigdata/user_shell/spark/spark-3.1.3.tgz -C /bi_bigdata/user_shell/spark/
cd /bi_bigdata/user_shell/spark/spark-3.1.3

3)修改部分不兼容代码

        主要针对hadoop版本低于2.6.4 的修改,主要根据报错进行调整的

①第一处修改yarn模块

vim resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
 /*注释
    sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern =>
      try {
        val logAggregationContext = Records.newRecord(classOf[LogAggregationContext])
        logAggregationContext.setRolledLogsIncludePattern(includePattern)
        sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern =>
          logAggregationContext.setRolledLogsExcludePattern(excludePattern)
        }
        appContext.setLogAggregationContext(logAggregationContext)
      } catch {
        case NonFatal(e) =>
          logWarning(s"Ignoring ${ROLLED_LOG_INCLUDE_PATTERN.key} because the version of YARN " +
            "does not support it", e)
      }
    }
    appContext.setUnmanagedAM(isClientUnmanagedAMEnabled)

    sparkConf.get(APPLICATION_PRIORITY).foreach { appPriority =>
      appContext.setPriority(Priority.newInstance(appPriority))
    }
    appContext
  }
*/

/*替换*/
    sparkConf.get(ROLLED_LOG_INCLUDE_PATTERN).foreach { includePattern =>
      try {
        val logAggregationContext = Records.newRecord(classOf[LogAggregationContext])
 
        // These two methods were added in Hadoop 2.6.4, so we still need to use reflection to
        // avoid compile error when building against Hadoop 2.6.0 ~ 2.6.3.
        val setRolledLogsIncludePatternMethod =
          logAggregationContext.getClass.getMethod("setRolledLogsIncludePattern", classOf[String])
        setRolledLogsIncludePatternMethod.invoke(logAggregationContext, includePattern)
 
        sparkConf.get(ROLLED_LOG_EXCLUDE_PATTERN).foreach { excludePattern =>
          val setRolledLogsExcludePatternMethod =
            logAggregationContext.getClass.getMethod("setRolledLogsExcludePattern", classOf[String])
          setRolledLogsExcludePatternMethod.invoke(logAggregationContext, excludePattern)
        }
 
        appContext.setLogAggregationContext(logAggregationContext)
      } catch {
        case NonFatal(e) =>
          logWarning(s"Ignoring ${ROLLED_LOG_INCLUDE_PATTERN.key} because the version of YARN " +
            "does not support it", e)
      }
      
    }
        appContext
  }

②第二处修改Utils 模块

vim core/src/main/scala/org/apache/spark/util/Utils.scala
//注释掉
//import org.apache.hadoop.util.{RunJar, StringUtils}
//替换为
import org.apache.hadoop.util.{RunJar}

def unpack(source: File, dest: File): Unit = {
    // StringUtils 在hadoop2.6.0中引用不到,所以取消此import,然后修改为相似的功能
    // val lowerSrc = StringUtils.toLowerCase(source.getName)
    if (source.getName == null) {
      throw new NullPointerException
    }
    val lowerSrc = source.getName.toLowerCase()
    if (lowerSrc.endsWith(".jar")) {
      RunJar.unJar(source, dest, RunJar.MATCH_ANY)
    } else if (lowerSrc.endsWith(".zip")) {
      FileUtil.unZip(source, dest)
    } else if (
      lowerSrc.endsWith(".tar.gz") || lowerSrc.endsWith(".tgz") || lowerSrc.endsWith(".tar")) {
      FileUtil.unTar(source, dest)
    } else {
      logWarning(s"Cannot unpack $source, just copying it to $dest.")
      copyRecursive(source, dest)
    }
  }

③第三处修改HttpSecurityFilter 模块

vim core/src/main/scala/org/apache/spark/ui/HttpSecurityFilter.scala
private val parameterMap: Map[String, Array[String]] = {
     super.getParameterMap().asScala.map { case (name, values) =>
    //Unapplied methods are only converted to functions when a function type is expected.
    //You can make this conversion explicit by writing `stripXSS _` or `stripXSS(_)` instead of `stripXSS`.
    //   stripXSS(name) -> values.map(stripXSS)
       stripXSS(name) -> values.map(stripXSS(_))
     }.toMap
   }

4)修改 Spark 的父 pom 文件,将原来的central替换为https://mvnrepository.com/repos/central ,并添加CDH仓库

vim pom.xml
	
      
      central
      Maven Repository
      https://mvnrepository.com/repos/central
      
      
        true
      
      
        false
      
    


  cloudera
  cloudera Repository
  https://repository.cloudera.com/artifactory/cloudera-repos/


	
      cloudera
      Cloudera Repositories
      https://repository.cloudera.com/artifactory/cloudera-repos/
    

5)编译生成安装包

在spark的解压目录中进行编译可以分发的二进制压缩包

根据官方的提示,在编译hadoop2.x版本时,指定-Phadoop-2.7

[SPARK-35758][DOCS] Update the document about building Spark with Hadoop for Hadoop 2.x and 3.x by sarutak · Pull Request #32917 · apache/spark · GitHub

./dev/make-distribution.sh --name 2.6.0-cdh5.16.2 --pip --tgz -Phive -Phive-thriftserver  -Pmesos -Pyarn -Pkubernetes -Phadoop-2.7 -Dhadoop.version=2.6.0-cdh5.16.2 -Dscala.version=2.12.10 -X

 编译完成后,可以在当前目录看到对应可分发的tgz安装包,接下来就可以部署到生产环境了。

#查看生成的tgz安装包 
ll -h  |grep tgz |grep spark
#spark-3.1.3-bin-2.6.0-cdh5.16.2.tgz

你可能感兴趣的:(spark,jira,big,data,hadoop)