例一份数据如下:
anne 22 NY
joe 39 CO
alison 35 NY
mike 69 VA
marie 27 OR
jim 21 OR
bob 71 CA
mary 53 NY
dave 36 VA
dude 50 CA
1.显示
df.show()
【显示全部数据】
df.show(5)
【传入数字n,只显示 n 行数据】
2.select 查询
调用select(列名)
import session.implicits._
person_df_rdd.select($"name",$"age").show(); //引入当前Session的隐式转换类
+------+---+
| name|age|
+------+---+
| anne| 22|
| joe| 39|
|alison| 35|
| mike| 69|
| marie| 27|
| jim| 21|
| bob| 71|
| mary| 53|
| dave| 36|
| dude| 50|
+------+---+
person_df_rdd.select($"name",$"age"+2).show(); //将列的别名执行一些运算
+------+---------+
| name|(age + 2)|
+------+---------+
| anne| 24|
| joe| 41|
|alison| 37|
| mike| 71|
| marie| 29|
| jim| 23|
| bob| 73|
| mary| 55|
| dave| 38|
| dude| 52|
+------+---------+
3.条件查询where()
//范围查询:
person_df_rdd.select($"name",$"age"+2).where($"age" > 45).show(3)
person_df_rdd.select($"name",$"age"+2).where("age > 45").show(3)
+----+---------+
|name|(age + 2)|
+----+---------+
|mike| 71|
| bob| 73|
|mary| 55|
+----+---------+
only showing top 3 rows
//等值查询:(字符串要加单引)
person_df_rdd.select($"name",$"age"+2).where("address='NY'").show(2)
person_df_rdd2.select("name","age").where("age=35").show()
+------+---------+
| name|(age + 2)|
+------+---------+
| anne| 24|
|alison| 37|
+------+---------+
only showing top 2 rows
+------+---+
| name|age|
+------+---+
|alison| 35|
+------+---+
//like查询:
person_df_rdd.select($"name",$"address").where("name like '%d%'").show()
person_df_rdd.select("name","address").where("name like '%d%'").show()
+----+-------+
|name|address|
+----+-------+
|dave| VA|
|dude| CA|
+----+-------+
4.过滤查询filter()
person_df_rdd.filter("name='anne'").show()
+----+---+-------+
|name|age|address|
+----+---+-------+
|anne| 22| NY|
+----+---+-------+
5.分组查询groupBy
person_df_rdd.groupBy($"address").count().show()
+-------+-----+
|address|count|
+-------+-----+
| OR| 2|
| VA| 2|
| CA| 2|
| NY| 3|
| CO| 1|
+-------+-----+
6.排序操作orderBy
person_df_rdd.orderBy("age").show()
+------+---+-------+
| name|age|address|
+------+---+-------+
| jim| 21| OR|
| anne| 22| NY|
| marie| 27| OR|
|alison| 35| NY|
| dave| 36| VA|
| joe| 39| CO|
| dude| 50| CA|
| mary| 53| NY|
| mike| 69| VA|
| bob| 71| CA|
+------+---+-------+
1.临时表:使用范围为当前会话有效
val df_rdd =session1.createDataFrame(rdd2)
df_rdd.createOrReplaceTempView("person") //创建临时表person
session1.sql("select * from person where age=21").show() //当前会话直接写sql语句
session1.sql("select * from person where name like '%d%'").show()
session1.newSession().sql("select * from person where name like '%d%'").show()//当前会话再创建新的会话执行sql
+----+---+-------+
|name|age|address|
+----+---+-------+
| jim| 21| OR|
+----+---+-------+
+----+---+-------+
|name|age|address|
+----+---+-------+
|dave| 36| VA|
|dude| 50| CA|
+----+---+-------+
抛出异常:Exception in thread "main" org.apache.spark.sql.AnalysisException: Table or view not found: person; line 1 pos 15
2.全局表:使用范围为所有会话有效
注意:在使用全局表时,要加 global_temp
再调用
df_rdd.createGlobalTempView("gperson")
session1.sql("select * from global_temp.gperson where age>40").show()
session1.sql("select age,name from global_temp.gperson where name like 'a%'").show()
+----+---+-------+
|name|age|address|
+----+---+-------+
|mike| 69| VA|
| bob| 71| CA|
|mary| 53| NY|
|dude| 50| CA|
+----+---+-------+
+---+------+
|age| name|
+---+------+
| 22| anne|
| 35|alison|
+---+------+
//举例分析:
case class Person (name:String,age:Int,address:String) //样例类
def main(args: Array[String]): Unit = {
val session1 = SparkSession.builder().master("local").appName("df注册表").getOrCreate()
val rdd1 = session1.sparkContext.textFile("file:///f:\\dataTest\\users.txt")
val rdd2 = rdd1.map(x=>x.split(" ")).map(x=>new Person(x(0),x(1).toInt,x(2)))
val df_rdd =session1.createDataFrame(rdd2) //创建 DataFrame
val _rdd2=df_rdd.rdd //_rdd2类型无法转换成为 rdd2的类型,DataFrame的缺陷
import session1.implicits._
val ds_rdd = session1.createDataset(rdd2) //创建 DataSet
val _rdd22 = ds_rdd.rdd //_rdd22 转换后与 rdd2类型相同 (将之前rdd的类型保留下来)
}
之间关系:
RDD[Person]----(case反射机制)-------DataFrame[ROW]------------------DataSet[Person]
Pesron ROW["name","age","address"] {
Person:("name","age","address")}
Pesron ROW["name","age","address"] {
Person:("name","age","address")}
Pesron ROW["name","age","address"] {
Person:("name","age","address")}
Pesron ROW["name","age","address"] {
Person:("name","age","address")}
Pesron ROW["name","age","address"] {
Person:("name","age","address")}
Pesron ROW["name","age","address"] {
Person:("name","age","address")}
三者之间相互转换:
a.RDD-->DataFrame SparkSession.createDataFrame()
b.RDD-->DataSet SparkSession.createDataSet()
c.DF,DS-->RDD DF.rdd-->RDD[ROW];DS,rdd-->RDD[Person]
d.DataFrame-->DataSet SparkSession.createDataSet(DF.rdd)--> DataSet[ROW]
e.DataSet-->DataFrame DS.toDF()
总结:DataFrame是DataSet特殊的集合情况,DataFrame仅仅是Dataset的RowS表示
Spark SQL的Scala接口支持自动的将一个包含case class的RDD转换为 DataFrame。case class定义了表结构,Caseclass的参数名是通过反射机制读取,然后变成列名。Caseclass可以嵌套或者包含像Seq或Array之类 的复杂类型。这个RDD可以隐式的转换为一个DataFrame,然后被注册为 一张表。这个表可以随后被SQL的statement使用。
读取txt文件 -> 切分文件 -> 将切分后的内容作为参数传递给Person类构建对象 -> 转为dataset
步骤如下:
1.从原来的RDD创建一个Row格式的RDD
val session = SparkSession.builder().master("local[2]").appName("DF").getOrCreate()
val rdd = session.sparkContext.textFile("file:///F:\\测试数据\\users.txt")
val row_rdd = rdd.map(x=>x.split(" ")).map(x=>Row(x(0),x(1),x(2)))
2.创建与RDD中Rows结构匹配的StructType,通过该StructType创 建表示RDD的Schema
val schemaString = "name age address"
val fields = schemaString.split(" ").map(fieldName =>
StructField(fieldName, StringType, nullable = true))
【或者写死(一般不用这种):
val fields = List(
StructField("name1", StringType, nullable = true),
StructField("age1", StringType, nullable = true),
StructField("address1", StringType, nullable = true)
)
】
val schema = StructType(fields)
3.通过SparkSession提供的createDataFrame方法创建DataFrame, 方法参数为RDD的Schema
val df = session.createDataFrame(row_rdd,schema)
df.printSchema()
//输出:
root
|-- : string (nullable = true)
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- address: string (nullable = true)