windows下spark2.1源码编译及修改

Windows编译spark源码过程
对spark源码修改后需要重新编译spark源码,由于当前linux虚拟机上无法通过代理联网,公司提供的maven仓库也ping不通,只能在windows上编译spark源码。
编译过程如下:
1. 在spark官网下载spark源码http://spark.apache.org/downloads.html
 
选择2.1.0源码下载。
2. 然后在idea中导入spark源码项目(idea maven配置正确),然后对spark项目build。Build成功后在进行编译。
Build过程中遇到问题:
1) 无法找到SparkFlumeProtocol类,原因是spark flume模块是外部的,构建过程加载不到类。
进入view=>tool window=>maven project中找到Spark Project External Flume Sink模块,右键选择Generate Sources and update Folders,并在lifecycle中compile改模块。
3. 在git bash中编译spark源码
spark编译要在 bash环境下进行,直接在windows下编译会报错不支持bash命令:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project spark-core_2.11: An Ant BuildException has occured: Execute failed: java.io.IOException: Cannot run program "bash" (in directory "D:\workspace\spark-2.1.0\core"): CreateProcess error=2, 系统找不到指定的文件。
在git bash中切换到spark源码目录:
设置java虚拟机内存,编译时会占用很大内存,太小时会内存溢出。
export MAVEN_OPTS="-Xmx4g -XX:MaxPermSize=1000m -XX:ReservedCodeCacheSize=1000m"
然后指定Hadoop版本开始编译:
mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -DskipTests clean package
然后是漫长编译过程,看到如下信息时,说明编译完成。
 [INFO] Spark Project Parent POM .......................... SUCCESS [3.035s]
[INFO] Spark Project Tags ................................ SUCCESS [5.896s]
[INFO] Spark Project Sketch .............................. SUCCESS [9.240s]
[INFO] Spark Project Networking .......................... SUCCESS [10.402s]
[INFO] Spark Project Shuffle Streaming Service ........... SUCCESS [7.100s]
[INFO] Spark Project Unsafe .............................. SUCCESS [11.549s]
[INFO] Spark Project Launcher ............................ SUCCESS [8.769s]
[INFO] Spark Project Core ................................ SUCCESS [2:46.378s]
[INFO] Spark Project ML Local Library .................... SUCCESS [29.300s]
[INFO] Spark Project GraphX .............................. SUCCESS [36.614s]
[INFO] Spark Project Streaming ........................... SUCCESS [1:05.139s]
[INFO] Spark Project Catalyst ............................ SUCCESS [2:45.713s]
[INFO] Spark Project SQL ................................. SUCCESS [3:32.211s]
[INFO] Spark Project ML Library .......................... SUCCESS [2:21.122s]
[INFO] Spark Project Tools ............................... SUCCESS [6.362s]
[INFO] Spark Project Hive ................................ SUCCESS [2:31.883s]
[INFO] Spark Project REPL ................................ SUCCESS [15.956s]
[INFO] Spark Project YARN Shuffle Service ................ SUCCESS [6.727s]
[INFO] Spark Project YARN ................................ SUCCESS [33.774s]
[INFO] Spark Project Assembly ............................ SUCCESS [6.532s]
[INFO] Spark Project External Flume Sink ................. SUCCESS [14.954s]
[INFO] Spark Project External Flume ...................... SUCCESS [19.958s]
[INFO] Spark Project External Flume Assembly ............. SUCCESS [2.034s]
[INFO] Spark Integration for Kafka 0.8 ................... SUCCESS [26.434s]
[INFO] Spark Project Examples ............................ SUCCESS [33.904s]
[INFO] Spark Project External Kafka Assembly ............. SUCCESS [10.769s]
[INFO] Spark Integration for Kafka 0.10 .................. SUCCESS [26.168s]
[INFO] Spark Integration for Kafka 0.10 Assembly ......... SUCCESS [10.089s]
[INFO] Kafka 0.10 Source for Structured Streaming ........ SUCCESS [27.096s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 21:05.732s
[INFO] Finished at: Fri May 05 10:31:06 CST 2017
[INFO] Final Memory: 135M/1845M
[INFO] ------------------------------------------------------------------------

4. 修改spark-sql源码后重新打包编译spark-sql module

公司业务需求通过sparksql向hbase中load/update/delete数据,因此需要对spark-sql源码进行修改。

对spark-sql支持load/update/delete操作时,需要修改spark-sql部分源码,修改后不需要对整个spark项目进行重新编译,而是用maven重新打包spark-sql项目即可。
在Idea的终端中,进入spark-sql =》 core,执行mvn clean install –DskipTests
然后在spark源码目录spark-2.1.0\sql\core\target下就可以找到编译后的spark-sql包了。
在针对spark hbase连接器修改源码时,需要引用hbase相关jar包,因此需要在spark-sql模块的pom文件引入hbase包配置:

  org.apache.hbase
  hbase-common
  1.1.2
 
   
      asm
      asm
   

   
      org.jboss.netty
      netty
   

   
      io.netty
      netty
   

   
      commons-logging
      commons-logging
   

   
      org.jruby
      jruby-complete
   

 



  org.apache.hbase
  hbase-server
  1.1.2
 
   
      asm
      asm
   

   
      org.jboss.netty
      netty
   

   
      io.netty
      netty
   

   
      commons-logging
      commons-logging
   

   
      org.jruby
      jruby-complete
   

 




5. 替换spark2安装环境jar包测试修改后代码
在spark2安装目录/usr/hdp/2.3.4.0-3485/spark2/jars/找到spark-sql_2.11-2.1.0.jar包,备份后,将编译后的spark-sql包替换掉原有spark-sql包。
然后进入spark-sql测试修改功能。

你可能感兴趣的:(大数据)