centos7 spark平台搭建+sbt打包实现词频统计!

实验内容包含以下几点:

  • 安装Scala
  • 安装spark
  • 使用spark shell
  • a//读取本地文件
  • b// 读取hdfs文件
  • c//编写wordcount程序
  • 额外附加安装sbt打包,实现词频统计
centos7机器信息:
192.168.189.135 bigdata128
192.168.189.136 bigdata129
192.168.189.137 bigdata131

1、安装Scala
下载地址:https://www.scala-lang.org/download/
centos7 spark平台搭建+sbt打包实现词频统计!_第1张图片
上传linux系统(略)

解压:tar zxvf scala-2.12.8.tgz
重命名:mv scala-2.11.8 scala
环境变量配置:vi /etc/profile
添加如下:(export SCALA_HOME=/root/scala //你Scala包解压后的地址)

#scala
export SCALA_HOME=/root/scala
export PATH=$SCALA_HOME/bin:$PATH

:wq!保存退出,执行如下命令,使更改生效

source /etc/profile

执行 : scala
出现下列信息,则安装成功!

Welcome to Scala 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_111).
Type in expressions for evaluation. Or try :help.

安装完成之后,安装同样的步骤安装到另外两台slave机器上!!!

2、安装spark
下载spark:http://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.4.2/spark-2.4.2-bin-hadoop2.7.tgz
centos7 spark平台搭建+sbt打包实现词频统计!_第2张图片
上传linux系统(略)
解压:tar -zxvf spark-2.4.2-bin-hadoop2.7.tgz
重命名:mv spark-2.4.2-bin-hadoop2.7 spark
修改添加配置文件:vi /etc/profile

#根据自己解压的文件路径变换
export SPARK_HOME=/opt/module/spark
export PATH=$PATH:$SPARK_HOME/bin

生效:source /etc/profile

修改:spark-env.sh
进入spark安装目录操作(spark)

cp conf/spark-env.sh.template conf/spark-env.sh
vi conf/spark-env.sh
#Java环境变量
export JAVA_HOME=/opt/module/jdk1.8.0_121
#Scala环境变量
export SCALA_HOME=/root/scala
#Hadoop环境变量
export HADOOP_HOME=/opt/module/hadoop-2.7.3/
    
#定义管理端口
export SPARK_MASTER_WEBUI_PORT=8080
#定义master域名和端口
export SPARK_MASTER_HOST=spark-master
export SPARK_MASTER_PORT=7077
#定义master的地址slave节点使用
export SPARK_MASTER_IP=spark-master
#定义work节点的管理端口.work节点使用
export SPARK_WORKER_WEBUI_PORT=8080
#每个worker节点能够最大分配给exectors的内存大小 
export SPARK_WORKER_MEMORY=4g

配置slaves:

cp conf/slaves.template conf /slaves

vi conf/slaves

添加:(三台虚拟机的主机名)

bigdata128
bigdata129
bigdata131

修改spark-defaults.conf:

 vi conf/spark-defaults.conf
spark.eventLog.enabled=true
spark.eventLog.compress=true
#保存于本地
#spark.eventLog.dir=file://opt/module/hadoop-2.7.3/logs/userlogs
#spark.history.fs.logDirectory=file://opt/module/hadoop-2.7.3/logs/userlogs
#保存于hdfs
spark.eventLog.dir=hdfs://bigdata128:9000/tmp/logs/root/logs
spark.history.fs.logDirectory=hdfs://bigdata128:9000/tmp/logs/root/logs
spark.yarn.historyServer.address=spark-master:18080

启动spark:(由于脚本名和Hadoop启动脚本名一致,所以在spark安装目录中启动时指定sbin目录)

sbin/start-all.sh 

出现如下,即成功(执行jps,主节点上多了master和Worker两个节点):

starting org.apache.spark.deploy.master.Master, logging to /opt/module/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-bigdata128.out
bigdata129: starting org.apache.spark.deploy.worker.Worker, logging to /opt/module/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-bigdata129.out
bigdata131: starting org.apache.spark.deploy.worker.Worker, logging to /opt/module/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-bigdata131.out
jps

7952 NodeManager
7666 SecondaryNameNode
8275 Master
7381 NameNode
8405 Jps
7833 ResourceManager
8345 Worker
7484 DataNode

访问:http://192.168.189.135:8080/
centos7 spark平台搭建+sbt打包实现词频统计!_第3张图片
3.使用spark-shell
执行 spark-shell
centos7 spark平台搭建+sbt打包实现词频统计!_第4张图片
a//加载本地文件(“file:///opt/module/code/wordcount/word.txt")centos7系统中必须存在该文件!!

scala>val textFile = sc.textFile("file:///opt/module/code/wordcount/word.txt")

在这里插入图片描述
读取文件:

scala>textFile.first()

在这里插入图片描述
b//读取hdfs中文件(/user/root/word.txt hdfs中必须存在,若不存在则手动创建)

val textFile = sc.textFile("hdfs://bigdata128:9000/user/root/word.txt")

在这里插入图片描述

textFile.first()

在这里插入图片描述
c//词频统计:

val wordCount = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)

在这里插入图片描述

wordCount.collect()

在这里插入图片描述
4、安装sbt打包(可能会显示网络不可达,但是我的不影响结果,耐心等待):

curl https://bintray.com/sbt/rpm/rpm > bintray-sbt-rpm.repo

mv bintray-sbt-rpm.repo /etc/yum.repos.d/

yum install sbt

centos7 spark平台搭建+sbt打包实现词频统计!_第5张图片
运行sbt:

sbt

centos7 spark平台搭建+sbt打包实现词频统计!_第6张图片
centos7 spark平台搭建+sbt打包实现词频统计!_第7张图片
安装sbt成功:CTRL+C退出sbt
回到root目录,创建一个文件夹test

cd ~
mkdir test

创建WordCount.scala:

cd test 
vi WordCount.scala
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
 
 
object WordCount {
  def main(args: Array[String]): Unit = {
    val inputPath="hdfs://master:9000/test/kmeans_data.txt"
    val outputPath="hdfs://master:9000/test/result"
    val sc = new SparkContext()
    val texts = sc.textFile(inputPath)
    println(sc.master)
    val wordCounts = texts.flatMap{a => a.split(" ")}
      .map(word => (word,1))
      .reduceByKey(_+_)
    wordCounts.saveAsTextFile(outputPath)
  }
}

创建word-count.sbt:

vi word-count.sbt
#放入下列数据
name := "wordcount"
version := "0.1.0"
scalaVersion := "2.12.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.1" % "provided"

执行打包(此过程首次执行可能很漫长,建议提前泡一杯咖啡(sbt package这个命令也是打包命令,但是很慢很慢,反正我等了将近一个多小时,建议使用sbt clean package,这个快的多)):

[root@bigdata128 test]# sbt clean package
[info] Updated file /root/test/project/build.properties: set sbt.version to 1.2.8
[info] Loading project definition from /root/test/project
[info] Updating ProjectRef(uri("file:/root/test/project/"), "test-build")...
[info] Done updating.
[info] Loading settings for project test from word-count.sbt ...
[info] Set current project to wordcount (in build file:/root/test/)
[success] Total time: 1 s, completed 2019-5-6 23:48:37
[info] Updating ...
[info] downloading https://repo1.maven.org/maven2/org/apache/avro/avro/1.8.2/avro-1.8.2.jar ...
[info] downloading https://repo1.maven.org/maven2/org/apache/avro/avro-ipc/1.8.2/avro-ipc-1.8.2.jar ...
[info] 	[SUCCESSFUL ] org.apache.avro#avro-ipc;1.8.2!avro-ipc.jar (21970ms)
[info] 	[SUCCESSFUL ] org.apache.avro#avro;1.8.2!avro.jar (22810ms)
[info] Done updating.
[info] Compiling 1 Scala source to /root/test/target/scala-2.12/classes ...
[info] Done compiling.
[info] Packaging /root/test/target/scala-2.12/wordcount_2.12-0.1.0.jar ...
[info] Done packaging.
[success] Total time: 149 s, completed 2019-5-6 23:51:08
打包完成!!!

刚刚打的jar包已经保存于/test/target/scala-2.12下面,将这个jar包复制与spark安装目录/opt/module/spark/bin下面,每个人的安装目录或许不一样,视情况而定。

cd  /test/target/scala-2.12
cp jar包全名 /opt/module/spark/bin
cd /opt/module/spark/bin

执行spark-submit --class "WordCount" jar包名称
报错如下:
centos7 spark平台搭建+sbt打包实现词频统计!_第8张图片
查看我们先前写的程序:

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
 
 
object WordCount {
  def main(args: Array[String]): Unit = {
  
  #hdfs存储系统中/user/root/文件夹下必须事先上传需要统计的文件word.txt
    val inputPath="hdfs://master:9000/user/root/word.txt"
    
  #test02在hdfs中必须没有这个文件夹
    val outputPath="hdfs://master:9000/test02"
    val sc = new SparkContext()
    val texts = sc.textFile(inputPath)
    println(sc.master)
    val wordCounts = texts.flatMap{a => a.split(" ")}
      .map(word => (word,1))
      .reduceByKey(_+_)
    wordCounts.saveAsTextFile(outputPath)
  }
}

上面的错误显示我的hdfs中已经存在test02该文件夹(程序中的输出路径是程序本身自己创建,若hdfs中有则报错),所以报错,解决方式:删除hdfs文件系统中的test02文件夹(删除文件夹命令:hadoop fs -rm -r -skipTrash /test02),继续运行:spark-submit --class "WordCount" jar包名称

界面如下,则成功:

Warning: Ignoring non-spark config property: park.eventLog.enabled=true
19/05/07 00:00:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/05/07 00:00:23 INFO SparkContext: Running Spark version 2.4.2
19/05/07 00:00:24 INFO SparkContext: Submitted application: WordCount
19/05/07 00:00:24 INFO SecurityManager: Changing view acls to: root
19/05/07 00:00:24 INFO SecurityManager: Changing modify acls to: root
19/05/07 00:00:24 INFO SecurityManager: Changing view acls groups to: 
19/05/07 00:00:24 INFO SecurityManager: Changing modify acls groups to: 
19/05/07 00:00:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
19/05/07 00:00:24 INFO Utils: Successfully started service 'sparkDriver' on port 42229.
19/05/07 00:00:24 INFO SparkEnv: Registering MapOutputTracker
19/05/07 00:00:24 INFO SparkEnv: Registering BlockManagerMaster
19/05/07 00:00:24 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
19/05/07 00:00:24 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
19/05/07 00:00:25 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-7e0e2460-2395-44ff-9d66-2563fee1578c
19/05/07 00:00:25 INFO MemoryStore: MemoryStore started with capacity 413.9 MB
19/05/07 00:00:25 INFO SparkEnv: Registering OutputCommitCoordinator
19/05/07 00:00:25 INFO Utils: Successfully started service 'SparkUI' on port 4040.
19/05/07 00:00:25 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://bigdata128:4040
19/05/07 00:00:25 INFO SparkContext: Added JAR file:/opt/module/spark/bin/wordcount_2.12-0.1.0.jar at spark://bigdata128:42229/jars/wordcount_2.12-0.1.0.jar with timestamp 1557158425829
19/05/07 00:00:26 INFO Executor: Starting executor ID driver on host localhost
19/05/07 00:00:26 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39319.
19/05/07 00:00:26 INFO NettyBlockTransferService: Server created on bigdata128:39319
19/05/07 00:00:26 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/05/07 00:00:26 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, bigdata128, 39319, None)
19/05/07 00:00:26 INFO BlockManagerMasterEndpoint: Registering block manager bigdata128:39319 with 413.9 MB RAM, BlockManagerId(driver, bigdata128, 39319, None)
19/05/07 00:00:26 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, bigdata128, 39319, None)
19/05/07 00:00:26 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, bigdata128, 39319, None)
19/05/07 00:00:28 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 236.7 KB, free 413.7 MB)
19/05/07 00:00:28 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.9 KB, free 413.7 MB)
19/05/07 00:00:28 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on bigdata128:39319 (size: 22.9 KB, free: 413.9 MB)
19/05/07 00:00:28 INFO SparkContext: Created broadcast 0 from textFile at WordCount.scala:10
local[*]
19/05/07 00:00:30 INFO FileInputFormat: Total input paths to process : 1
19/05/07 00:00:30 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
19/05/07 00:00:30 INFO HadoopMapRedCommitProtocol: Using output committer class org.apache.hadoop.mapred.FileOutputCommitter
19/05/07 00:00:30 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/05/07 00:00:30 INFO SparkContext: Starting job: runJob at SparkHadoopWriter.scala:78
19/05/07 00:00:31 INFO DAGScheduler: Registering RDD 3 (map at WordCount.scala:13)
19/05/07 00:00:31 INFO DAGScheduler: Got job 0 (runJob at SparkHadoopWriter.scala:78) with 1 output partitions
19/05/07 00:00:31 INFO DAGScheduler: Final stage: ResultStage 1 (runJob at SparkHadoopWriter.scala:78)
19/05/07 00:00:31 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
19/05/07 00:00:31 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 0)
19/05/07 00:00:31 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:13), which has no missing parents
19/05/07 00:00:31 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.6 KB, free 413.7 MB)
19/05/07 00:00:31 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.3 KB, free 413.7 MB)
19/05/07 00:00:31 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on bigdata128:39319 (size: 3.3 KB, free: 413.9 MB)
19/05/07 00:00:31 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
19/05/07 00:00:32 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:13) (first 15 tasks are for partitions Vector(0))
19/05/07 00:00:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
19/05/07 00:00:32 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, ANY, 7369 bytes)
19/05/07 00:00:32 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
19/05/07 00:00:32 INFO Executor: Fetching spark://bigdata128:42229/jars/wordcount_2.12-0.1.0.jar with timestamp 1557158425829
19/05/07 00:00:32 INFO TransportClientFactory: Successfully created connection to bigdata128/192.168.189.135:42229 after 126 ms (0 ms spent in bootstraps)
19/05/07 00:00:32 INFO Utils: Fetching spark://bigdata128:42229/jars/wordcount_2.12-0.1.0.jar to /tmp/spark-aa22f921-0a8b-427a-95ca-c808fa16442b/userFiles-15969c6f-3f19-4b84-9257-a7128b9e10e3/fetchFileTemp4516604376544410216.tmp
19/05/07 00:00:32 INFO Executor: Adding file:/tmp/spark-aa22f921-0a8b-427a-95ca-c808fa16442b/userFiles-15969c6f-3f19-4b84-9257-a7128b9e10e3/wordcount_2.12-0.1.0.jar to class loader
19/05/07 00:00:33 INFO HadoopRDD: Input split: hdfs://bigdata128:9000/user/root/word.txt:0+60
19/05/07 00:00:34 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1157 bytes result sent to driver
19/05/07 00:00:34 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 2531 ms on localhost (executor driver) (1/1)
19/05/07 00:00:34 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
19/05/07 00:00:34 INFO DAGScheduler: ShuffleMapStage 0 (map at WordCount.scala:13) finished in 2.820 s
19/05/07 00:00:34 INFO DAGScheduler: looking for newly runnable stages
19/05/07 00:00:34 INFO DAGScheduler: running: Set()
19/05/07 00:00:34 INFO DAGScheduler: waiting: Set(ResultStage 1)
19/05/07 00:00:34 INFO DAGScheduler: failed: Set()
19/05/07 00:00:34 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at WordCount.scala:15), which has no missing parents
19/05/07 00:00:34 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 73.5 KB, free 413.6 MB)
19/05/07 00:00:34 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 26.8 KB, free 413.6 MB)
19/05/07 00:00:34 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on bigdata128:39319 (size: 26.8 KB, free: 413.9 MB)
19/05/07 00:00:34 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1161
19/05/07 00:00:34 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at WordCount.scala:15) (first 15 tasks are for partitions Vector(0))
19/05/07 00:00:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
19/05/07 00:00:34 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, ANY, 7141 bytes)
19/05/07 00:00:34 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
19/05/07 00:00:35 INFO BlockManagerInfo: Removed broadcast_1_piece0 on bigdata128:39319 in memory (size: 3.3 KB, free: 413.9 MB)
19/05/07 00:00:35 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks including 1 local blocks and 0 remote blocks
19/05/07 00:00:35 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 37 ms
19/05/07 00:00:35 INFO HadoopMapRedCommitProtocol: Using output committer class org.apache.hadoop.mapred.FileOutputCommitter
19/05/07 00:00:35 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/05/07 00:00:36 INFO FileOutputCommitter: Saved output of task 'attempt_20190507000030_0005_m_000000_0' to hdfs://bigdata128:9000/test02/_temporary/0/task_20190507000030_0005_m_000000
19/05/07 00:00:36 INFO SparkHadoopMapRedUtil: attempt_20190507000030_0005_m_000000_0: Committed
19/05/07 00:00:36 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1508 bytes result sent to driver
19/05/07 00:00:36 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 1413 ms on localhost (executor driver) (1/1)
19/05/07 00:00:36 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
19/05/07 00:00:36 INFO DAGScheduler: ResultStage 1 (runJob at SparkHadoopWriter.scala:78) finished in 1.487 s
19/05/07 00:00:36 INFO DAGScheduler: Job 0 finished: runJob at SparkHadoopWriter.scala:78, took 5.558129 s
19/05/07 00:00:36 INFO SparkHadoopWriter: Job job_20190507000030_0005 committed.
19/05/07 00:00:36 INFO SparkContext: Invoking stop() from shutdown hook
19/05/07 00:00:36 INFO SparkUI: Stopped Spark web UI at http://bigdata128:4040
19/05/07 00:00:36 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/05/07 00:00:36 INFO MemoryStore: MemoryStore cleared
19/05/07 00:00:36 INFO BlockManager: BlockManager stopped
19/05/07 00:00:36 INFO BlockManagerMaster: BlockManagerMaster stopped
19/05/07 00:00:36 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/05/07 00:00:36 INFO SparkContext: Successfully stopped SparkContext
19/05/07 00:00:36 INFO ShutdownHookManager: Shutdown hook called
19/05/07 00:00:36 INFO ShutdownHookManager: Deleting directory /tmp/spark-8a2d49ab-f86e-4aae-8015-b0e92e92efd6
19/05/07 00:00:36 INFO ShutdownHookManager: Deleting directory /tmp/spark-aa22f921-0a8b-427a-95ca-c808fa16442b

此时访问:http://192.168.189.135:50070
打开test02, part-00000文件即是程序的输出文件:
centos7 spark平台搭建+sbt打包实现词频统计!_第9张图片
下载part-00000,以记事本打开:
centos7 spark平台搭建+sbt打包实现词频统计!_第10张图片
done!!!
口号很重要:我是万能的零号阿波罗!

你可能感兴趣的:(计算机)