相当于 hadoop jar 命令 ---> 提交MapReduce任务(jar文件 )
提交Spark的任务(jar文件 )
# java python r resources scala
#resources ----> 测试数据(格式:txt json avro parquet列式存储文件) --> Spark SQL中
[root@BigData11 spark-2.1.0-bin-hadoop2.7]# bin/spark-submit --master spark://BigData11:7077 --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.1.0.jar 100
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/10/12 21:22:36 INFO SparkContext: Running Spark version 2.1.0
Pi is roughly 3.141484157074208
(*)本地模式 bin/spark-shell
Spark context Web UI available at
Spark context available as 'sc' (master = local[*], app id = local-1528291341116).
Spark session available as 'spark'.
(*)集群模式 : 连接到集群,在集群执行任务,类似Storm的集群模式
bin/spark-shell --master spark://bigdata111:7077
Spark context Web UI available at
Spark context available as 'sc' (master = spark://bigdata111:7077, app id = app-20180606212511-0001).
Spark session available as 'spark'.
sc.textFile读取HDFS: sc.textFile("hdfs://bigdata111:9000/input/data.txt")
sc.textFile读取本地: sc.textFile("/root/temp/data.txt")
sc.textFile("hdfs://bigdata111:9000/input/data.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("hdfs://bigdata111:9000/output/0606/spark")
var rdd1 = sc.textFile("hdfs://bigdata111:9000/input/data.txt"): 延时读取数据
var rdd2 = rdd1.flatMap(_.split(" ")): 将每句话进行分词,再合并到一个集合(Array)
var rdd3 = rdd2.map((_,1)) : 每个单词记一次数
完整: rdd2.map(word=>(word,1))
var rdd4 = rdd3.reduceByKey(_+_) 把相同的key的value进行累加
注意:reduceByKey(_+_) 完整: reduceByKey((a,b)=>a+b)
1+2 = 3
3+4 = 7
scala> var rdd1 = sc.textFile("hdfs://")
rdd1: org.apache.spark.rdd.RDD[String] = hdfs:// MapPartitionsRDD[1] at textFile at :24
scala> rdd1.
scala> rdd1.collect
res0: Array[String] = Array(I love Beijing, I love China, Beijing is the captial of the China)
scala> var rdd2=rdd1.flatMap(_.split(" "))
rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at :26
scala> rdd2.collect
res2: Array[String] = Array(I, love, Beijing, I, love, China, Beijing, is, the, captial, of, the, China)
scala> var rdd3 = rdd2.map((_,1))
rdd3: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at :28
scala> rdd3.collect
res3: Array[(String, Int)] = Array((I,1), (love,1), (Beijing,1), (I,1), (love,1), (China,1), (Beijing,1), (is,1), (the,1), (captial,1), (of,1), (the,1), (China,1))
scala> rdd3.
scala> var rdd4 = rdd3.reduceByKey(_+_)
rdd4: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at :30
scala> rdd4.collect
res4: Array[(String, Int)] = Array((is,1), (love,2), (captial,1), (Beijing,2), (China,2), (I,2), (of,1), (the,2))
scala> var rdd5= rdd4.saveAsTextFile("hdfs://")
saveAsTextFile("hdfs://") #目录不能提前存在,否则抛异常
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs:// already exists
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)
查看hdfs 上的数据,我们发现有两个分区
scala> rdd4.r
scala> rdd4.repartition(1).saveAsTextFile("hdfs://")