在Spark中DataFrame是一种以RDD为基础的分布式数据集,类似于传统数据库的二维表格。和python的Pandas的DataFrame非常类似。DataFrame和RDD的区别主要在于,DataFrame带有Schema元信息,即DataFrame锁表示的二维表格数据集的每一列都带有名称和类型。
DataSet是分布式的数据集合,在Spark1.6中添加的一个新的抽象,是DataFrame的一个扩展。
DataFrame本质上也是一种DataSet,它是DataSet的一个特例,DataFrame=DataSet[Row],他们之间可以相互转换
<dependency>
<groupId>org.apache.sparkgroupId>
<artifactId>spark-sql_2.12artifactId>
<version>3.3.0version>
dependency>
SparkSession是操作SparkSQL的入口
import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf}
//创建conf
val conf:SparkConf = new SparkConf()
conf.setMaster("local").setAppName("SparkSQLTest")
//创建SparkSession
val sparkSession:SparkSession = SparkSession.builder().config(conf).getOrCreate()
//读取json文件
val df = sparkSession.read.json("data/sensor.json")
df.show()
//结果如下:
+-------+-----------+----------+
| id|temperature| timestamp|
+-------+-----------+----------+
|sensor1| 12.1|1639968630|
|sensor2| 13.1|1639968620|
+-------+-----------+----------+
//读取csv文件
val df = sparkSession.read.csv("data/sensor.csv")
df.show()
//结果如下:
+--------+----------+----+
| _c0| _c1| _c2|
+--------+----------+----+
|sensor_1|1639968630|13.4|
+--------+----------+----+
//txt
val input = "data/sensor.txt"
val df2 = sparkSession.read.text(input)
df2.show()
//结果如下,文件的每一行为DataFrame的一行,作为一列存储
+--------------------+
| value|
+--------------------+
|sensor_1,16399686...|
|sensor_1,16399686...|
+--------------------+
//读取jdbc
val mysql_jdbc_url = "jdbc:mysql://[email protected]:3307/news"
val properties = new Properties()
properties.put("root","root")
properties.put("password","root")
val df4 = sparkSession.read.jdbc(mysql_jdbc_url,"sensor",properties)
df4.show()
//结果如下
+-------+------------+-----------+
| id| timestamp|temperature|
+-------+------------+-----------+
|sensor1|159262321421| 32.00|
+-------+------------+-----------+
package com.hjt.yxh.hw.sparksql
import com.hjt.yxh.hw.rdd.SensorReading
import org.apache.spark.{SparkConf}
import org.apache.spark.sql.{DataFrame, SQLContext, SparkSession}
object SparkDataFrameApp {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
conf.setMaster("local")
conf.setAppName("Spark SQL basic example")
//1.创建sparkSession
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
import spark.implicits._
val inpath = "D:\\java_workspace\\BigData\\Spark\\SparkApp\\SparkLearn\\src\\main\\resources\\sensor.txt"
val sensorDf:DataFrame = spark.sparkContext.textFile(inpath)
.filter(_.nonEmpty)
.map(data=>{
val arrs = data.split(",")
SensorReading(arrs(0),arrs(1).toLong,arrs(2).toDouble)
}).toDF()
spark.stop()
}
}
Tips:注意,需要引入spark.implicits._ 隐式转换
val df1:DataFrame= sparkSession.read
.schema(
StructType(Array(
StructField("id",DataTypes.StringType,false),
StructField("temperature",DataTypes.DoubleType,false),
StructField("timestamp",DataTypes.LongType,false),
))
)
.json("data/sensor.json")
指定Scahema的好处就是能够对DataFrame的每一列的数据类型和约束进行规约,以提升性能。比如对于数字10,从文件中读取时,如果不指定schema,那么DataFrame的类型就是bigint,实际上我们可能只需要指定int就可以满足我们的条件。
val sparkSession:SparkSession = SparkSession.builder().config(conf).getOrCreate()
val df1:DataFrame= sparkSession.read.json("data/sensor.json")
//显示
df1.show(5)
//显示前5行
df1.head(5)
//显示后5行
df1.tail(5)
var df2 = df1.select("id","timestamp")
df2.show()
import sparkSession.implicits._
df2 = df1.select($"id",$"timestamp")
df2.show()
df2 = df1.select(df1.col("id"),df1.col("timestamp"))
df2.show()
//列运算,并取别名
df2 = df1.select(df1.col("id"),df1.col("temperature")*10 as("temp"))
df2.show()
//条件过滤
var resultdf1 = df1.filter(df1.col("temperature")>20)
resultdf1 = df1.filter("temperature>30 and id='sensor1'")
// resultdf1 = df1.where("$temperature > 30")
//使用lamada函数的方式
resultdf1 = df1.filter(data=>{
data.getString(0).equals("sensor1")&& data.getDouble(1)>30.0
})
import org.apache.spark.api.java.function.FilterFunction
class MyFilterFunction extends FilterFunction[Row]{
override def call(value: Row): Boolean = {
value.getString(0).equals("sensor1")&& value.getDouble(1)>30.0
}
}
//使用自定义函数的方式
resultdf1 = df1.filter(new MyFilterFunction)
resultdf1.show()
import sparkSession.implicits._
//分组聚合查询
var df2 = df1.groupBy(df1.col("id")).count().select($"id",$"count" as "ttl")
df2.show()
df2 = df1.groupBy($"id").max("temperature").select($"id",$"max(temperature)" as "max_temp")
df2.show()
df2 = df1.groupBy($"id").avg("temperature").select($"id",$"avg(temperature)" as "avg_temp")
df2.show()
//聚合函数
import org.apache.spark.sql.functions._
df2 = df1.groupBy($"id").agg($"id",max("temperature"),min("temperature"))
df2.show()
df2 = df1.groupBy($"id").agg(Map(
"temperature"->"max",
"timestamp"->"min",
))
df2.show()
//注册成临时的视图,只在当前会话中可见
df1.createOrReplaceTempView("sensor")
var df2 = sparkSession.sql("SELECT * FROM SENSOR WHERE ID = 'sensor1'")
df2.show()
Tips:使用df.createOrReplaceTempView创建的表有一下特征:
如果需要全局使用,则需要创建全局表
df1.createOrReplaceGlobalTempView("sensor")
我们知道,DataFrame其实就是DataSet的一种特例,DataSet是一种强类型的数据集合,需要提供数据类型。所以DataSet的创建和DataFrame的创建非常类似。
import sparkSession.implicits._
val df1:Dataset[SensorReading]= sparkSession.read
.schema(
StructType(Array(
StructField("id",DataTypes.StringType,false),
StructField("temperature",DataTypes.DoubleType,false),
StructField("timestamp",DataTypes.LongType,false),
))
)
.json("data/sensor.json").as[SensorReading]
df1.foreach(data=>{
println("id\t"+data.id+"temperature\t"+data.temperature+"timestamp\t"+data.timestamp)
})
在 SparkSQL 中 Spark 为我们提供了两个新的抽象,分别是 DataFrame 和 DataSet。他们
和 RDD 有什么区别呢?首先从版本的产生上来看:
三者的共性:
三者的互转:
val input:String = "data/sensor.txt"
val rdd:RDD[SensorReading] = sparkSession.sparkContext
.textFile(input)
.filter(_.nonEmpty)
.map(data=>{
val arr = data.split(",")
SensorReading(arr(0),arr(1).toLong,arr(2).toDouble)
})
//RDD 转换成DataFrame
import sparkSession.implicits._
val df1:DataFrame = rdd.toDF()
df1.show()
import sparkSession.implicits._
//RDD转Dataset
val ds:Dataset[SensorReading] = rdd.toDS()
ds.toDF()
ds.rdd
ds.toJavaRDD
df1.as[SensorReading]
df1.rdd
df1.toJavaRDD