24-SparkSQL04

Function

functions.scala

hobbies.txt

alice  jogging,Coding,cooking  3

lina    travel,dance            2

case class Likes(name:String,likes:String)

val likes = spark.sparkContext.textFile("file:///home/hadoop/data/hobbies.txt")

val likeDF = likes.map(_.split("\t")).map(x=>Likes(x(0),x(1))).toDF()

likeDF.createOrReplaceTempView("t_likes")

spark.udf.register("likes_num", (x:String)=>x.split(",").size)

spark.sql("select name,likes,likes_num(likes) from t_likes").show(false)

Spark SQL愿景

1)write less code

2)read less data

3)Let the optimizer do the hard work

json

{"name":"zhangsan", "gender":"F", "height":160}

{"name":"lisi", "gender":"M", "height":175, "age":30}

{"name":"wangwu", "gender":"M", "height":180.3}

case class Person(name:String, age:Int, salary:Double)

工资大于30000  只需要name,不需要age和salary

sc.textFile(path).map(x=>{

val Array(name,age,salary) = x.split(",")

Person(name,age,salary)

}).map()

select name

from (

select name,salary from xxx

) t

where t.salary>30000;

select name

from (

select name,salary from xxx where salary>30000

)

自动优化:基于Spark SQL,不是Core

Spark2.x

Catalog

DF vs DS

DataFrame = Dataset[Row]

tableName  id,name

                    SQL              DF          DS

Syntax Errors      runtime        Compile        Compile

Analysis Errors    runtime        Runtime        Compile

seletc name from xx

df.seletc("name")

df.select("nname")

ds.seletc("name")

ds.map(_.nname)

Analysis Errors reported before a distributed job starts.

1) LogApp ==> DF/DS

XXXXXX

DESC

DOMAIN1

TOP10

DOMAIN2

TOP10

DOMAIN3

TOP10

2) option:text的外部数据源

你可能感兴趣的:(24-SparkSQL04)