maven环境下使用java、scala混合开发spark应用

熟悉java的开发者在开发spark应用时,常常会遇到spark对java的接口文档不完善或者不提供对应的java接口的问题。这个时候,如果在java项目中能直接使用scala来开发spark应用,同时使用java来处理项目中的其它需求,将在一定程度上降低开发spark项目的难度。下面就来探索一下java、scala、spark、maven这一套开发环境要怎样来搭建。

 

 

1、下载scala sdk

 

http://www.scala-lang.org/download/ 直接到这里下载sdk,目前最新的稳定版为2.11.7,下载后解压就行

(后面在intellijidea中创建.scala后缀源代码时,ide会智能感知并提示你设置scala sdk,按提示指定sdk目录为解压目录即可)

 也可以手动配置scala SDK:ideal =>File =>project struct.. =>library..=> +...

2、下载scala forintellij idea的插件

maven环境下使用java、scala混合开发spark应用_第1张图片

 

如上图,直接在plugins里搜索Scala,然后安装即可,如果不具备上网环境,或网速不给力。也可以直接到http://plugins.jetbrains.com/plugin/?idea&id=1347手动下载插件的zip包,手动下载时,要特别注意版本号,一定要跟本机的intellij idea的版本号匹配,否则下载后无法安装。下载完成后,在上图中,点击“Install plugin from disk...”,选择插件包的zip即可。

3、如何跟maven整合

使用maven对项目进行打包的话,需要在pom文件中配置scala-maven-plugin这个插件。同时,由于是spark开发,jar包需要打包为可执行java包,还需要在pom文件中配置maven-assembly-plugin和maven-shade-plugin插件并设置mainClass。经过实验摸索,下面贴出一个可用的pom文件,使用时只需要在包依赖上进行增减即可使用。

 
  1. xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

  2. 4.0.0

  3. my-project-groupid

  4. sparkTest

  5. jar

  6. 1.0-SNAPSHOT

  7. sparkTest

  8. http://maven.apache.org

  9. UTF-8

  10. 0.98.3

  11. 1.6.0

  12. 1.7

  13. 2.10.5

  14. repo1.maven.org

  15. http://repo1.maven.org/maven2

  16. true

  17. false

  18.  
  19. repository.jboss.org

  20. http://repository.jboss.org/nexus/content/groups/public/

  21. false

  22.  
  23. cloudhopper

  24. Repository for Cloudhopper

  25. http://maven.cloudhopper.com/repos/third-party/

  26. true

  27. false

  28.  
  29. mvnr

  30. Repository maven

  31. http://mvnrepository.com/

  32. true

  33. false

  34.  
  35. scala

  36. Scala Tools

  37. https://mvnrepository.com/

  38. true

  39. false

  40.  
  41.  
  42. scala

  43. Scala Tools

  44. https://mvnrepository.com/

  45. true

  46. false

  47.  
  48. org.scala-lang

  49. scala-library

  50. ${scala.version}

  51. compile

  52. org.scala-lang

  53. scala-compiler

  54. ${scala.version}

  55. compile

  56. javax.mail

  57. javax.mail-api

  58. 1.4.7

  59. junit

  60. junit

  61. 3.8.1

  62. test

  63. org.apache.spark

  64. spark-core_2.10

  65. ${spark.version}

  66.  
  67. org.apache.spark

  68. spark-sql_2.10

  69. ${spark.version}

  70. org.apache.spark

  71. spark-streaming_2.10

  72. ${spark.version}

  73. org.apache.spark

  74. spark-mllib_2.10

  75. ${spark.version}

  76. org.apache.spark

  77. spark-hive_2.10

  78. ${spark.version}

  79. org.apache.spark

  80. spark-graphx_2.10

  81. ${spark.version}

  82. mysql

  83. mysql-connector-java

  84. 5.1.30

  85. com.google.guava

  86. guava

  87. 14.0.1

  88. org.apache.hadoop

  89. hadoop-common

  90. 2.6.0

  91. org.apache.hadoop

  92. hadoop-client

  93. 2.6.0

  94. org.apache.spark

  95. spark-hive_2.10

  96. ${spark.version}

  97. com.alibaba

  98. fastjson

  99. 1.2.3

  100. p6spy

  101. p6spy

  102. 1.3

  103. org.apache.commons

  104. commons-math3

  105. 3.3

  106.  
  107. org.jdom

  108. jdom

  109. 2.0.2

  110.  
  111. com.google.guava

  112. guava

  113. 14.0.1

  114. org.apache.hadoop

  115. hadoop-common

  116. 2.6.0

  117. org.apache.hadoop

  118. hadoop-hdfs

  119. 2.6.0

  120. redis.clients

  121. jedis

  122. 2.6.0

  123. org.apache.hbase

  124. hbase-client

  125. 0.98.6-hadoop2

  126. org.apache.hbase

  127. hbase

  128. 0.98.6-hadoop2

  129. pom

  130. org.apache.hbase

  131. hbase-common

  132. 0.98.6-hadoop2

  133. org.apache.hbase

  134. hbase-server

  135. 0.98.6-hadoop2

  136. org.testng

  137. testng

  138. 6.8.8

  139. test

  140. mysql

  141. mysql-connector-java

  142. 5.1.30

  143. com.fasterxml.jackson.jaxrs

  144. jackson-jaxrs-json-provider

  145. 2.4.4

  146. com.fasterxml.jackson.core

  147. jackson-databind

  148. 2.4.4

  149. net.sf.json-lib

  150. json-lib

  151. 2.4

  152. jdk15

  153. javax.mail

  154. javax.mail-api

  155. 1.4.7

  156.  
  157. junit

  158. junit

  159. 3.8.1

  160. test

  161.  
  162. maven-assembly-plugin

  163. false

  164. jar-with-dependencies

  165. rrkd.dt.sparkTest.HelloWorld

  166. make-assembly

  167. package

  168. assembly

  169. org.apache.maven.plugins

  170. maven-compiler-plugin

  171. 3.1

  172. ${jdk.version}

  173. ${jdk.version}

  174. ${project.build.sourceEncoding}

  175. org.apache.maven.plugins

  176. maven-shade-plugin

  177. 2.1

  178. false

  179. package

  180. shade

  181. true

  182. allinone

  183. *:*

  184.  
  185. *:*

  186. META-INF/*.SF

  187. META-INF/*.DSA

  188. META-INF/*.RSA

  189.  
  190. implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">

  191. reference.conf

  192. rrkd.dt.sparkTest.HelloWorld

  193.  
  194. net.alchim31.maven

  195. scala-maven-plugin

  196. 3.2.0

  197. compile-scala

  198. compile

  199. add-source

  200. compile

  201. test-compile-scala

  202. test-compile

  203. add-source

  204. testCompile

  205. ${scala.version}

  206.  
  207.  
  208.  

主要是build部分的配置,其它的毋须过多关注。

 

项目的目录结构,大体跟maven的默认约定一样,只是src下多了一个scala目录,主要还是为了便于组织java源码和scala源码,如下图:

 

maven环境下使用java、scala混合开发spark应用_第2张图片

 

在java目录下建立HelloWorld类HelloWorld.class:

 
  1. package test;

  2.  
  3. import test.Hello;

  4. /**

  5. * Created by L on 2017/1/5.

  6. */

  7. public class HelloWorld {

  8.  
  9. public static void main(String[] args){

  10. System.out.print("test");

  11. Hello.sayHello("scala");

  12. Hello.runSpark();

  13. }

  14. }

 

在scala目录下建立hello类hello.scala:

 
  1. package test

  2.  
  3. import org.apache.spark.graphx.{Graph, Edge, VertexId, GraphLoader}

  4. import org.apache.spark.rdd.RDD

  5. import org.apache.spark.{SparkContext, SparkConf}

  6. import breeze.linalg.{Vector, DenseVector, squaredDistance}

  7. /**

  8. * Created by L on 2017/1/5.

  9. */

  10. object Hello {

  11. def sayHello(x: String): Unit = {

  12. println("hello," + x);

  13. }

  14.  
  15. // def main(args: Array[String]) {

  16. def runSpark() {

  17. val sparkConf = new SparkConf().setAppName("SparkKMeans").setMaster("local[*]")

  18. val sc = new SparkContext(sparkConf)

  19. // Create an RDD for the vertices

  20. val users: RDD[(VertexId, (String, String))] =

  21. sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),

  22. (5L, ("franklin", "prof")), (2L, ("istoica", "prof")),

  23. (4L, ("peter", "student"))))

  24. // Create an RDD for edges

  25. val relationships: RDD[Edge[String]] =

  26. sc.parallelize(Array(Edge(3L, 7L, "collab"), Edge(5L, 3L, "advisor"),

  27. Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"),

  28. Edge(4L, 0L, "student"), Edge(5L, 0L, "colleague")))

  29. // Define a default user in case there are relationship with missing user

  30. val defaultUser = ("John Doe", "Missing")

  31. // Build the initial Graph

  32. val graph = Graph(users, relationships, defaultUser)

  33. // Notice that there is a user 0 (for which we have no information) connected to users

  34. // 4 (peter) and 5 (franklin).

  35. graph.triplets.map(

  36. triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1

  37. ).collect.foreach(println(_))

  38. // Remove missing vertices as well as the edges to connected to them

  39. val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")

  40. // The valid subgraph will disconnect users 4 and 5 by removing user 0

  41. validGraph.vertices.collect.foreach(println(_))

  42. validGraph.triplets.map(

  43. triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1

  44. ).collect.foreach(println(_))

  45. sc.stop()

  46. }

  47. }

这样子,在scala项目中调用spark的接口来运行一些spark应用,在java项目中再调用scala。

4、scala项目maven的编译打包

java/scala混合的项目,怎么先编译scala再编译java,可以使用以下maven 命令来进行编译打包:

mvn clean scala:compile assembly:assembly

5、spark项目的jar包的运行问题

在开发时,我们可能会以local模式在IDEA中运行,然后使用了上面的命令进行打包。打包后的spark项目必须要放到spark集群下以spark-submit的方式提交运行。

--------------------- 本文来自 大愚若智_ 的CSDN 博客 ,全文地址请点击:https://blog.csdn.net/zbc1090549839/article/details/54290233?utm_source=copy

你可能感兴趣的:(Java,spark,scala)