Spark系列(五)IDEA编写及调试Spark的WordCount程序

使用IDEA编写Spark程序的前提条件是IDEA已经配置好Scala的编写环境,可以参考Scala–IDEA配置及maven项目创建

在这里,我们以hadoop的wordcount为例,编写Scala程序,以本地模式Yarn模式分别测试程序。Spark程序在开发的时候,使用IDEA编写程序及调试过程如下:

一、项目创建

1、创建Scala的Maven项目,pom.xml文件如下所示:

    <properties>
        <log4j.version>1.2.17log4j.version>
        <slf4j.version>1.7.22slf4j.version>
        <spark.version>2.1.1spark.version>
        <scala.version>2.11.8scala.version>
    properties>

    <dependencies>
        
        <dependency>
            <groupId>org.slf4jgroupId>
            <artifactId>jcl-over-slf4jartifactId>
            <version>${slf4j.version}version>
        dependency>
        <dependency>
            <groupId>org.slf4jgroupId>
            <artifactId>slf4j-apiartifactId>
            <version>${slf4j.version}version>
        dependency>
        <dependency>
            <groupId>org.slf4jgroupId>
            <artifactId>slf4j-log4j12artifactId>
            <version>${slf4j.version}version>
        dependency>
        <dependency>
            <groupId>log4jgroupId>
            <artifactId>log4jartifactId>
            <version>${log4j.version}version>
        dependency>
        

        <dependency>
            <groupId>org.scala-langgroupId>
            <artifactId>scala-libraryartifactId>
            <version>${scala.version}version>
            
        dependency>

        <dependency>
            <groupId>org.apache.sparkgroupId>
            <artifactId>spark-core_2.11artifactId>
            <version>${spark.version}version>
            
        dependency>
    dependencies>

    <build>
        <finalName>wordcountfinalName>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.pluginsgroupId>
                <artifactId>maven-assembly-pluginartifactId>
                <configuration>
                    <archive>
                        <manifest>
                         <mainClass>com.m.jd.WordCountmainClass>
                        manifest>
                    archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependenciesdescriptorRef>
                    descriptorRefs>
                configuration>
            plugin>
            <plugin>
                <groupId>org.apache.maven.pluginsgroupId>
                <artifactId>maven-compiler-pluginartifactId>
                <version>3.6.1version>
                <configuration>
                    <source>1.8source>
                    <target>1.8target>
                configuration>
            plugin>
            <plugin>
                <groupId>net.alchim31.mavengroupId>
                <artifactId>scala-maven-pluginartifactId>
                <version>3.2.2version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compilegoal>
                            <goal>testCompilegoal>
                        goals>
                    execution>
                executions>
            plugin>
        plugins>

        <pluginManagement>
            <plugins>
                <plugin>
                    <groupId>org.apache.maven.pluginsgroupId>
                    <artifactId>maven-assembly-pluginartifactId>
                    <version>3.0.0version>
                    <executions>
                        <execution>
                            <id>make-assemblyid>
                            <phase>packagephase>
                            <goals>
                                <goal>singlegoal>
                            goals>
                        execution>
                    executions>
                plugin>
            plugins>
        pluginManagement>
    build>

二、编写WordCount程序

2.1、在scala的目录下编写WordCount

注意:scala目录必须被标记为Soucrce

例如我的WordCount在com.m.jd目录下,完整代码如下:

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object WordCount extends App {

  private val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("wordCount")

  private val sc = new SparkContext(sparkConf)

  private val dataFile: RDD[String] = sc.textFile("hdfs://hadoop0:9000/README.md")

  private val words: RDD[String] = dataFile.flatMap(_.split(" "))

  private val word2Count: RDD[(String, Int)] = words.map((_,1))

  private val result: RDD[(String, Int)] = word2Count.reduceByKey(_+_)

  result.saveAsTextFile("hdfs://hadoop0:9000/out")

  //关闭和Spark的连接
  sc.stop()

}

上述代码大家对比下Spark-Shell里的命令,就简单明了。

sc.textFile("hdfs://hadoop0:9000/README.md").flatMap(_.split(" ")) .map((_,1)).reduceByKey(_+_).saveAsTextFile("hdfs://hadoop0:9000/out")

三 本地模式调试

注意下面这行代码,代码里已经指明了–master为local[*] 本地模式,如代码里没指明,需在其它地方指明,比如Spark-submit的命令行里的–master

private val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("wordCount")

调试准备:
1)开启Hadoop集群
2 上传需要WordCount的文件到HDFS上,比如我上传的是Spark安装包里的README.md到HDFS的/目录下

调试:
直接在WordCount里右键run或者debug都行,在这里先直接run,看控制台是否有错误日志,有错解决错误,没错误,去HDFS上查看结果

$ hadoop fs -cat /out/*

可能出现的问题:
1)访问HDFS的时候权限问题, 在IDEA里配置HADOOP_USER_NAME=root,运行程序的时候,就会以root用户来运行。


四、Yarn模式调试

在IDEA里将WordCount项目package打成Jar包,将Jar包上传到Hadoop集群的主机上,比如我的是hadoop0为主机,hadoop1和hadoop2为从机,运行如下命令:

/opt/module/spark-2.1.1-bin-hadoop2.7/bin/spark-submit \
--class com.m.jd.WordCount \
--master yarn \
--deploy-mode client \
/opt/spark-jar/wordcount.jar \

/opt/spark-jar/wordcount.jar是我上传的jar路径,
–class com.m.jd.WordCount 是我这个wordcount.jar里的运行主类

运行上述命令后,如没报错,查看输出结果

$ hadoop fs -cat /out/*

可能出现的问题:

1)hdfs上的/out目录已经存在,之前在本地模式时,已经成功输出了结果,所以/out目录已经存在,在YARN模式调试时,可删除/out目录

$ hadoop fs -rm -r /out

2)/directory目录不存在,这是因为我的镜像里之前搭建Spark集群的时候配置了History Service服务,spark-default.conf里面配置了spark.eventLog.dir

spark.eventLog.dir      hdfs://hadoop0:9000/directory

所以假如HDFS上没有/directory存在,先创建该目录

$ hadoop fs -mkdir /directory

以上就是完整的在IDEA里编写Spark的WordCount程序,并在Spark的本地模式和Yarn模式下测试运行的例子。

你可能感兴趣的:(Spark)