Function
functions.scala
hobbies.txt
alice jogging,Coding,cooking 3
lina travel,dance 2
case class Likes(name:String,likes:String)
val likes = spark.sparkContext.textFile("file:///home/hadoop/data/hobbies.txt")
val likeDF = likes.map(_.split("\t")).map(x=>Likes(x(0),x(1))).toDF()
likeDF.createOrReplaceTempView("t_likes")
spark.udf.register("likes_num", (x:String)=>x.split(",").size)
spark.sql("select name,likes,likes_num(likes) from t_likes").show(false)
Spark SQL愿景
1)write less code
2)read less data
3)Let the optimizer do the hard work
json
{"name":"zhangsan", "gender":"F", "height":160}
{"name":"lisi", "gender":"M", "height":175, "age":30}
{"name":"wangwu", "gender":"M", "height":180.3}
case class Person(name:String, age:Int, salary:Double)
工资大于30000 只需要name,不需要age和salary
sc.textFile(path).map(x=>{
val Array(name,age,salary) = x.split(",")
Person(name,age,salary)
}).map()
select name
from (
select name,salary from xxx
) t
where t.salary>30000;
select name
from (
select name,salary from xxx where salary>30000
)
自动优化:基于Spark SQL,不是Core
Spark2.x
Catalog
DF vs DS
DataFrame = Dataset[Row]
tableName id,name
SQL DF DS
Syntax Errors runtime Compile Compile
Analysis Errors runtime Runtime Compile
seletc name from xx
df.seletc("name")
df.select("nname")
ds.seletc("name")
ds.map(_.nname)
Analysis Errors reported before a distributed job starts.
1) LogApp ==> DF/DS
XXXXXX
DESC
DOMAIN1
TOP10
DOMAIN2
TOP10
DOMAIN3
TOP10
2) option:text的外部数据源