软件版本:
JDK:1.7.0_67
Scala:2.10.4
Hadoop:2.5.0-cdh5.3.6
Spark:1.6.1
Spark运行模式:
Local:本地运行
Standalone:使用Spark自带的资源管理框架,运行Spark的应用
Yarn:将Spark应用类似MapReduce一样,提交到Yarn上运行
Mesos:类似Yarn的一种资源管理框架
(1)解压spark安装包
$ cd /opt/softwares/cdh
cdh]$ tar -zxf spark-1.6.1-bin-2.5.0-cdh5.3.6.tgz -C /opt/cdh-5.3.6/
(2)修改配置文件/opt/cdh-5.3.6/spark-1.6.1-bin-2.5.0-cdh5.3.6/conf/spark-env.sh
$ cd /opt/cdh-5.3.6/spark-1.6.1-bin-2.5.0-cdh5.3.6/conf
conf]$ cp spark-env.sh.template spark-env.sh
conf]$ vim /opt/cdh-5.3.6/spark-1.6.1-bin-2.5.0-cdh5.3.6/conf/spark-env.sh
JAVA_HOME=/opt/modules/jdk1.7.0_67
SCALA_HOME=/opt/modules/scala-2.10.4
HADOOP_CONF_DIR=/opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/etc/hadoop
SPARK_LOCAL_IP=bigdata-senior.ibeifeng.com
(3)运行spark自带的示例程序
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ bin/run-example SparkPi
19/01/24 01:46:41 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:36, took 3.507869 s
Pi is roughly 3.14112
bin]$ ./run-example mllib.LinearRegression
For example, the following command runs this app on a synthetic dataset:
bin/spark-submit --class org.apache.spark.examples.mllib.LinearRegression \
examples/target/scala-*/spark-examples-*.jar \
data/mllib/sample_linear_regression_data.txt
(4)启动namenode,datanode
(5)启动spark-shell应用
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ bin/spark-shell
Started SparkUI at http://192.168.74.132:4040
Spark context available as sc.
SQL context available as sqlContext.
(6)测试Spark Local环境
1)文件行数统计
scala> val textFile = sc.textFile("README.md")
scala> textFile.count
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://bigdata-senior.ibeifeng.com:8020/user/beifeng/README.md
因为配置了Hadoop环境HADOOP_CONF_DIR=/opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6/etc/hadoop
,配置文件core-site.xml中指定fs.defaultFS为HDFS文件系统,所以spark会在HDFS上找README.md文件,文件为相对路径,则会在当前用户beifeng的HDFS主目录/user/beifeng/中寻找README.md文件,因为没有向HDFS上传该文件,所以报错找不到文件。
设置查找本地路径下的README.md文件:
scala> val rdd = sc.textFile("file:///opt/cdh-5.3.6/spark-1.6.1-bin-2.5.0-cdh5.3.6/README.md")
scala> rdd.count
res3: Long = 95
向HDFS上传README.md文件:
cd /opt/cdh-5.3.6/hadoop-2.5.0-cdh5.3.6
hadoop-2.5.0-cdh5.3.6]$ bin/hdfs dfs -mkdir -p /user/beifeng/spark/core/data
hadoop-2.5.0-cdh5.3.6]$ bin/hdfs dfs -put /opt/cdh-5.3.6/spark-1.6.1-bin-2.5.0-cdh5.3.6/README.md /user/beifeng/spark/core/data
hadoop-2.5.0-cdh5.3.6]$ bin/hdfs dfs -ls /user/beifeng/spark/core/data
19/01/24 02:44:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r-- 1 beifeng supergroup 3359 2019-01-24 02:44 /user/beifeng/spark/core/data/README.md
scala> val textFile = sc.textFile("/user/beifeng/spark/core/data/README.md")
scala> textFile.count
res4: Long = 95
scala> textFile.first
res5: String = # Apache Spark
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[6] at filter at :29
scala> linesWithSpark.count
res7: Long = 17
scala> linesWithSpark.collect //列出带"Spark"字符的行
res8: Array[String] = Array(# Apache Spark, Spark is a fast and general cluster computing system for Big Data. It provides, rich set of higher-level tools including Spark SQL for SQL and DataFrames,, and Spark Streaming for stream processing., You can find the latest Spark documentation, including a programming, ## Building Spark, Spark is built using [Apache Maven](http://maven.apache.org/)., To build Spark and its example programs, run:, ["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html)., The easiest way to start using Spark is through the Scala shell:, Spark also comes with several sample programs in the `examples` directory., " ./bin/run-example SparkPi", " MASTER=spark://host:7077 ./bin/run-example SparkPi", Testing first requires [building Spark](#b...
scala> rdd.map(line => line.split(" "))
res9: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[7] at map at :30
scala> rdd.map(line => line.split(" ")).count
res11: Long = 95
scala> rdd.flatMap(line => line.split(" "))
res12: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at flatMap at :30
scala> rdd.flatMap(line => line.split(" ")).count
res13: Long = 507
scala> exit
(1)WordCount
1)读取hdfs上的文件形成RDD:
scala> val lines = sc.textFile("/beifeng/spark/data/word.txt")
2)转换处理:
scala> val words = lines.flatMap(line => line.split(" "))
3)类似MR来做(性能不好,有可能出现Out Of Memory异常):
scala> words.groupBy(word => word).map(t => (t._1,t._2.toList.size)).take(10)
4)使用reduceByKey API:
scala> val words2 = words.map(word => (word,1))
scala> val wordCountRDD= words2.reduceByKey(_ + _)
5)结果保存(要求输出文件夹不存在) :
scala> wordCountRDD.saveAsTextFile("/beifeng/spark/core/resulut0")
(2)获取Top10 word单词
1)scala> wordCountRDD.sortBy(t => t._ 2 * -1).take(10)
2)scala> wordCountRDD.map(t => (t._2 * -1, t)).sortByKey().map(t => t._2).take(10)
3)自定义排序规则
wordCountRDD.map(_.swap).top(10).map(_.swap)
wordCountRDD.top(10)(ord = new scala.math.Ordering[(String,Int)]{
override def compare(x: (String,Int), y: (String,Int)): Int = {
x._2.compare(y._2)
}
})
获取出现次数最少的10个单词:
wordCountRDD.top(10)(ord = new scala.math.Ordering[(String,Int)]{
override def compare(x: (String,Int), y: (String,Int)): Int = {
y._2.compare(x._2)
}
})
MapReduce中如何实现TopN的程序?
- 所有数据排序,然后获取前多少个
分区器:分区 ==> 将所有数据分到一个区(一个ReduceTask中)
排序器:排序 ==> 数据按照降序排列
分组器:分组 ==> 将所有数据分到同一组
在reduce方法中获取前N个数据输出即可- 优化
MapTask:在当前jvm中维持一个集合,集合大小为N+1,存储的是当前task中数据排序后最大的前N+1个数据(优先级队列),在cleanup方法中输出前N个数据
ReduceTask:全部数据聚合到一个reducetask,然后进行和MapTask功能类似的操作,结果在cleanup进行输出即可得到最终数据
(1)修改配置文件/opt/cdh-5.3.6/spark-1.6.1-bin-2.5.0-cdh5.3.6/conf/spark-env.sh
SPARK_MASTER_IP=bigdata-senior.ibeifeng.com
SPARK_MASTER_PORT=7070
SPARK_MASTER_WEBUI_PORT=8080
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_PORT=7071
SPARK_WORKER_WEBUI_PORT=8081
SPARK_WORKER_INSTANCES=2
(2)配置从节点主机
$ cd /opt/cdh-5.3.6/spark-1.6.1-bin-2.5.0-cdh5.3.6/conf
conf]$ cp slaves.template slaves
conf]$ vim slaves
bigdata-senior.ibeifeng.com
(3)启动master和slaves
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ sbin/start-master.sh
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ sbin/start-slaves.sh
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ sbin/start-all.sh
$ jps
4301 Master
2440 DataNode
4403 Worker
2368 NameNode
4496 Jps
4443 Worker
(4)杀掉一个worker进程,单独启动slave
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ kill -9 4443
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ sbin/start-slave.sh spark://bigdata-senior.ibeifeng.com:7070
$ jps
4301 Master
2440 DataNode
4403 Worker
2368 NameNode
4786 Jps
4733 Worker
(5)Spark On Standalone测试
格式:--master MASTER_URL,如--master spark://host:port, mesos://host:port, yarn, or local.
启动spark-shell应用:
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ bin/spark-shell --master spark://bigdata-senior.ibeifeng.com:7070
(6)停止master和slaves
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ sbin/stop-master.sh
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ sbin/stop-slaves.sh
参考文献:http://spark.apache.org/docs/1.6.1/spark-standalone.html#high-availability
(1)Single-Node Recovery with Local File System
基于本地文件系统的单个Master的恢复机制:
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=FILESYSTEM -Dspark.deploy.recoveryDirectory=/tmp"
(2)Standby Masters with ZooKeeper
基于zk的master HA配置(热备):
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=hadoop1:2181,hadoop2:2181,hadoop3:2181 -Dspark.deploy.zookeeper.dir=/spark"
(1)创建HDFS上存储spark应用执行日志的文件夹
hadoop-2.5.0-cdh5.3.6]$ bin/hdfs dfs -mkdir -p /user/beifeng/spark/history
(2)修改配置文件/opt/cdh-5.3.6/spark-1.6.1-bin-2.5.0-cdh5.3.6/conf/spark-defaults.conf,开启日志聚集功能
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ cp conf/spark-defaults.conf.template conf/spark-defaults.conf
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ vim conf/spark-defaults.conf
spark.eventLog.enabled true
spark.eventLog.dir hdfs://bigdata-senior.ibeifeng.com:8020/user/beifeng/spark/history
(3)修改配置文件/opt/cdh-5.3.6/spark-1.6.1-bin-2.5.0-cdh5.3.6/conf/spark-env.sh,添加SPARK_HISTORY_OPTS参数
$ cd /opt/cdh-5.3.6/spark-1.6.1-bin-2.5.0-cdh5.3.6/conf
conf]$ vim spark-env.sh
SPARK_MASTER_IP=bigdata-senior.ibeifeng.com
SPARK_MASTER_PORT=7070
SPARK_MASTER_WEBUI_PORT=8080
SPARK_WORKER_CORES=2
SPARK_WORKER_MEMORY=2g
SPARK_WORKER_PORT=7071
SPARK_WORKER_WEBUI_PORT=8081
SPARK_WORKER_INSTANCES=2
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=hdfs://bigdata-senior.ibeifeng.com:8020/user/beifeng/spark/history"
(4)启动spark的jobhistory服务
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ sbin/start-history-server.sh
$ jps
4301 Master
2440 DataNode
4403 Worker
5101 Jps
2368 NameNode
5046 HistoryServer
4733 Worker
(5)查看web界面
http://bigdata-senior.ibeifeng.com:18080/
(6)Spark Job History Rest API:
http://bigdata-senior:18080/api/v1/applications
http://bigdata-senior:18080/api/v1/applications/local-1494752327417/jobs
http://bigdata-senior:18080/api/v1/applications/local-1494752327417/jobs/logs
(7)停止spark的jobhistory服务
spark-1.6.1-bin-2.5.0-cdh5.3.6]$ sbin/stop-history-server.sh