[Spark]直接调用RDD的方式实现SparkSQL的Filter操作

使用SQL实现数据过滤

import org.apache.spark.sql.SparkSession

object SqlExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName("Spark sql whole stage example")
      .master("local")
      .config("spark.sql.codegen.wholeStage", false)
      .getOrCreate()

    spark.read.json("/tmp/people.json")
    ds.createOrReplaceTempView("people")
    val data = spark.sql("select * from people where age > 5")
    data.show()
    spark.stop()
  }
}

程序输出

+---+------+
|age|  name|
+---+------+
| 30|   AAA|
| 30|   BBB|
| 30|   CCC|
| 30|   DDD|
| 30|  Andy|
| 19|Justin|
+---+------+

直接使用RDDfilter函数实现过滤

import org.apache.spark.sql.SparkSession

object SqlExample {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .appName("Spark sql whole stage example")
      .master("local")
      .config("spark.sql.codegen.wholeStage", false)
      .getOrCreate()

    val ds = spark.read.json("/tmp/people.json")
    val ds_rdd = ds.rdd
    val filter_rdd = ds_rdd.filter(
      i => i.getAs[Long](0) > 5
    )
    for( i <- filter_rdd){
      println(i.get(0),i.get(1))
    }
    spark.stop()
  }
}

程序输出

(30,AAA)
(30,BBB)
(30,CCC)
(30,DDD)
(30,Andy)
(19,Justin)

people.json

{"name":"AAA", "age":30}
{"name":"BBB", "age":30}
{"name":"CCC", "age":30}
{"name":"DDD", "age":30}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
{"name":"Anar", "age":3}
{"name":"Michael"}

RDD中每行的数据结构如下
[Spark]直接调用RDD的方式实现SparkSQL的Filter操作_第1张图片UnsafeRow示例
[Spark]直接调用RDD的方式实现SparkSQL的Filter操作_第2张图片

参考资料
UnSafeRow

你可能感兴趣的:(scala,spark,Linux)