使用IDEA编写Spark程序的前提条件是IDEA已经配置好Scala的编写环境,可以参考Scala–IDEA配置及maven项目创建
在这里,我们以hadoop的wordcount为例,编写Scala程序,以本地模式和Yarn模式分别测试程序。Spark程序在开发的时候,使用IDEA编写程序及调试过程如下:
1、创建Scala的Maven项目,pom.xml文件如下所示:
<properties>
<log4j.version>1.2.17log4j.version>
<slf4j.version>1.7.22slf4j.version>
<spark.version>2.1.1spark.version>
<scala.version>2.11.8scala.version>
properties>
<dependencies>
<dependency>
<groupId>org.slf4jgroupId>
<artifactId>jcl-over-slf4jartifactId>
<version>${slf4j.version}version>
dependency>
<dependency>
<groupId>org.slf4jgroupId>
<artifactId>slf4j-apiartifactId>
<version>${slf4j.version}version>
dependency>
<dependency>
<groupId>org.slf4jgroupId>
<artifactId>slf4j-log4j12artifactId>
<version>${slf4j.version}version>
dependency>
<dependency>
<groupId>log4jgroupId>
<artifactId>log4jartifactId>
<version>${log4j.version}version>
dependency>
<dependency>
<groupId>org.scala-langgroupId>
<artifactId>scala-libraryartifactId>
<version>${scala.version}version>
dependency>
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-core_2.11artifactId>
<version>${spark.version}version>
dependency>
dependencies>
<build>
<finalName>wordcountfinalName>
<plugins>
<plugin>
<groupId>org.apache.maven.pluginsgroupId>
<artifactId>maven-assembly-pluginartifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.m.jd.WordCountmainClass>
manifest>
archive>
<descriptorRefs>
<descriptorRef>jar-with-dependenciesdescriptorRef>
descriptorRefs>
configuration>
plugin>
<plugin>
<groupId>org.apache.maven.pluginsgroupId>
<artifactId>maven-compiler-pluginartifactId>
<version>3.6.1version>
<configuration>
<source>1.8source>
<target>1.8target>
configuration>
plugin>
<plugin>
<groupId>net.alchim31.mavengroupId>
<artifactId>scala-maven-pluginartifactId>
<version>3.2.2version>
<executions>
<execution>
<goals>
<goal>compilegoal>
<goal>testCompilegoal>
goals>
execution>
executions>
plugin>
plugins>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.pluginsgroupId>
<artifactId>maven-assembly-pluginartifactId>
<version>3.0.0version>
<executions>
<execution>
<id>make-assemblyid>
<phase>packagephase>
<goals>
<goal>singlegoal>
goals>
execution>
executions>
plugin>
plugins>
pluginManagement>
build>
注意:scala目录必须被标记为Soucrce
例如我的WordCount在com.m.jd目录下,完整代码如下:
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}
object WordCount extends App {
private val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("wordCount")
private val sc = new SparkContext(sparkConf)
private val dataFile: RDD[String] = sc.textFile("hdfs://hadoop0:9000/README.md")
private val words: RDD[String] = dataFile.flatMap(_.split(" "))
private val word2Count: RDD[(String, Int)] = words.map((_,1))
private val result: RDD[(String, Int)] = word2Count.reduceByKey(_+_)
result.saveAsTextFile("hdfs://hadoop0:9000/out")
//关闭和Spark的连接
sc.stop()
}
上述代码大家对比下Spark-Shell里的命令,就简单明了。
sc.textFile("hdfs://hadoop0:9000/README.md").flatMap(_.split(" ")) .map((_,1)).reduceByKey(_+_).saveAsTextFile("hdfs://hadoop0:9000/out")
注意下面这行代码,代码里已经指明了–master为local[*] 本地模式,如代码里没指明,需在其它地方指明,比如Spark-submit的命令行里的–master
private val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("wordCount")
调试准备:
1)开启Hadoop集群
2 上传需要WordCount的文件到HDFS上,比如我上传的是Spark安装包里的README.md到HDFS的/目录下
调试:
直接在WordCount里右键run或者debug都行,在这里先直接run,看控制台是否有错误日志,有错解决错误,没错误,去HDFS上查看结果
$ hadoop fs -cat /out/*
可能出现的问题:
1)访问HDFS的时候权限问题, 在IDEA里配置HADOOP_USER_NAME=root,运行程序的时候,就会以root用户来运行。
在IDEA里将WordCount项目package打成Jar包,将Jar包上传到Hadoop集群的主机上,比如我的是hadoop0为主机,hadoop1和hadoop2为从机,运行如下命令:
/opt/module/spark-2.1.1-bin-hadoop2.7/bin/spark-submit \
--class com.m.jd.WordCount \
--master yarn \
--deploy-mode client \
/opt/spark-jar/wordcount.jar \
/opt/spark-jar/wordcount.jar是我上传的jar路径,
–class com.m.jd.WordCount 是我这个wordcount.jar里的运行主类
运行上述命令后,如没报错,查看输出结果
$ hadoop fs -cat /out/*
可能出现的问题:
1)hdfs上的/out目录已经存在,之前在本地模式时,已经成功输出了结果,所以/out目录已经存在,在YARN模式调试时,可删除/out目录
$ hadoop fs -rm -r /out
2)/directory目录不存在,这是因为我的镜像里之前搭建Spark集群的时候配置了History Service服务,spark-default.conf里面配置了spark.eventLog.dir
spark.eventLog.dir hdfs://hadoop0:9000/directory
所以假如HDFS上没有/directory存在,先创建该目录
$ hadoop fs -mkdir /directory
以上就是完整的在IDEA里编写Spark的WordCount程序,并在Spark的本地模式和Yarn模式下测试运行的例子。