1.sparksql的通过 case class 创建 DataFrames(反射)

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

object TestDataFrame1 {
  def main(args: Array[String]): Unit = {
    val conf=new SparkConf().setAppName("RDDToDataFrame").setMaster("local")
    val sc=new SparkContext(conf)
    val context=new SQLContext(sc)

    //将本地的数据读入RDD,并将RDD与case class关联
    val peopleRDD=sc.textFile("D:\\Users\\shashahu\\Desktop\\work\\spark-2.4.4\\examples\\src\\main\\resources\\people.txt").map(line => People(line.split(",")(0), line.split(",")(1).trim.toInt))
    //将txt的第一列读出来,将txt文件中的第二列读出来
    import context.implicits._
    //隐式转换
    val df=peopleRDD.toDF() //将RDD转换成DataFrames

    //将DataFrame创建成一个临时的视图:
    df.createOrReplaceTempView("people") //在内存中创建一个临时视图

    //使用SQL语句进行查询
    context.sql("select * from people").show()
  }
  case class People(var name:String,var age:Int)
}
/**使用官方的sparksql进行处理的txt的数据
  * 显示结果如下所示:
  * 
  * 
  * 19/11/12 20:07:13 INFO FileInputFormat: Total input paths to process : 1
  * 19/11/12 20:07:13 INFO SparkContext: Starting job: show at TestDataFrame1.scala:21
  * 19/11/12 20:07:13 INFO DAGScheduler: Got job 0 (show at TestDataFrame1.scala:21) with 1 output partitions
  * 19/11/12 20:07:13 INFO DAGScheduler: Final stage: ResultStage 0 (show at TestDataFrame1.scala:21)
  * 19/11/12 20:07:13 INFO DAGScheduler: Parents of final stage: List()
  * 19/11/12 20:07:13 INFO DAGScheduler: Missing parents: List()
  * 19/11/12 20:07:13 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[6] at show at TestDataFrame1.scala:21), which has no missing parents
  * 19/11/12 20:07:13 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 10.0 KB, free 4.1 GB)
  * 19/11/12 20:07:13 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.8 KB, free 4.1 GB)
  * 19/11/12 20:07:13 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on dst001825.cn1.global.ctrip.com:53625 (size: 4.8 KB, free: 4.1 GB)
  * 19/11/12 20:07:13 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1161
  * 19/11/12 20:07:13 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[6] at show at TestDataFrame1.scala:21) (first 15 tasks are for partitions Vector(0))
  * 19/11/12 20:07:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
  * 19/11/12 20:07:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7947 bytes)
  * 19/11/12 20:07:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
  * 19/11/12 20:07:13 INFO HadoopRDD: Input split: file:/D:/Users/shashahu/Desktop/work/spark-2.4.4/examples/src/main/resources/people.txt:0+32
  * 19/11/12 20:07:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1264 bytes result sent to driver
  * 19/11/12 20:07:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 123 ms on localhost (executor driver) (1/1)
  * 19/11/12 20:07:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
  * 19/11/12 20:07:13 INFO DAGScheduler: ResultStage 0 (show at TestDataFrame1.scala:21) finished in 0.222 s
  * 19/11/12 20:07:13 INFO DAGScheduler: Job 0 finished: show at TestDataFrame1.scala:21, took 0.260352 s
  * +-------+---+
  * |   name|age|
  * +-------+---+
  * |Michael| 29|
  * |   Andy| 30|
  * | Justin| 19|
  * +-------+---+
  *
  * 19/11/12 20:07:13 INFO SparkContext: Invoking stop() from shutdown hook
  */

使用官网的txt进行转换到表,相关代码和演示结果如上所示:

你可能感兴趣的:(大数据组件)