./bin/spark-shell
scala> val textFile = spark.read.textFile("README.md")
textFile: org.apache.spark.sql.Dataset[String] = [value: string]
scala> textFile.count() // Number of items in this Dataset
res0: Long = 126 // May be different from yours as README.md will change over time, similar to other outputs
scala> textFile.first() // First item in this Dataset
res1: String = # Apache Spark
filter
返回一个新的数据集,其中包含文件中的项目子集。scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]
//filter函数将distData RDD转换成新的RDD
scala> val distDataFiletered=distData.filter(e=>e>2)
distDataFiletered: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[6] at filter at :25
//触发action操作(后面我们会讲),查看过滤后的内容
//注意collect只适合数据量较少时使用
scala> distDataFiltered.collect
res3: Array[Int] = Array(3, 4, 5)
转换操作会将一个RDD转换成一个新的RDD,需要特别注意的是所有的转变都是懒惰的,如果对斯卡拉中的懒惰了解的人都知道,改造之后它不会立马执行,而只是会记住对对应数据集的转换,而到到真正被使用的时候才会执行,例如distData.filter(e => e> 2)转换后,它不会立即执行,而是等到distDataFiltered.collect方法执行时才被执行。
transformations 和 actions 一起
scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
linesWithSpark: org.apache.spark.sql.Dataset[String] = [value: string]
reduce
在该数据集上调用以查找最大字数。参数map
和reduce
Scala函数文字(闭包),并可以使用任何语言功能或Scala / Java库。scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a > b) a else b)
res4: Long = 15
scala> import java.lang.Math
import java.lang.Math
scala> textFile.map(line => line.split(" ").size).reduce((a, b) => Math.max(a, b))
res5: Int = 15
scala> val wordCounts = textFile.flatMap(line => line.split(" ")).groupByKey(identity).count()
wordCounts: org.apache.spark.sql.Dataset[(String, Long)] = [value: string, count(1): bigint]
flatMap
将行数据集转换为单词数据集,然后组合groupByKey
并count
计算文件中的单词计数作为(字符串,长整数)对的数据集。要在我们的shell中收集单词count,我们可以调用collect
:scala> wordCounts.collect()
res6: Array[(String, Int)] = Array((means,1), (under,2), (this,3), (Because,1), (Python,2), (agree,1), (cluster.,1), ...)
Spark还支持将数据集提取到群集范围的内存缓存中。标记linesWithSpark
要缓存的数据集。
scala> linesWithSpark.cache()
res7: linesWithSpark.type = [value: string]
scala> linesWithSpark.count()
res8: Long = 15
scala> linesWithSpark.count()
res9: Long = 15
/* SimpleApp.scala */
import org.apache.spark.sql.SparkSession
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
注意,应用程序应定义main()
方法而不是扩展scala.App
。子类scala.App
可能无法正常工作。
该程序只计算包含'a'的行数和Spark README中包含'b'的数字。
注意,需要将YOUR_SPARK_HOME替换为安装Spark的位置。与之前使用Spark shell(初始化自己的SparkSession)的示例不同,我们将SparkSession初始化为程序的一部分。我们调用SparkSession.builder
构造[[SparkSession]],然后设置应用程序名称,最后调用getOrCreate
获取[[SparkSession]]实例。
应用程序依赖于Spark API,因此还将包含一个sbt配置文件 build.sbt
,它解释了Spark是一个依赖项。此文件还添加了Spark依赖的存储库:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
对于SBT正常工作,我们需要布局SimpleApp.scala
并build.sbt
根据典型的目录结构。一旦到位,我们可以用sbt创建一个包含应用程序代码的JAR包,然后使用该spark-submit
脚本运行我们的程序。
# Your directory layout should look like this
$ find .
.
./build.sbt
./src
./src/main
./src/main/scala
./src/main/scala/SimpleApp.scala
# Package a jar containing your application
$ sbt package
...
[info] Packaging {..}/{..}/target/scala-2.11/simple-project_2.11-1.0.jar
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
--class "SimpleApp" \
--master local[4] \
target/scala-2.11/simple-project_2.11-1.0.jar
...
Lines with a: 46, Lines with b: 23