目录
1. RDD转DataFrame
2. RDD转DataSet
3. DataFrame/Dataset 转RDD
4. DataFrame转Dataset
5. Dataset转DataFrame
1. 构建schema
主要有三步:
object RddToDataFrame {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("RddToDataFrame").master("local").getOrCreate()
val rdd = spark.sparkContext.textFile("file:///d:/data/words.txt")
.persist(StorageLevel.MEMORY_ONLY)
// 1. 构建RDD[Row],将一行的数据放入Row()中
val rdd2 = rdd.flatMap(_.split(",")).map(t =>{Row(t._2,t._1)})
// 2. 构建schema
val schema = StructType{
List(
StructField("id",LongType,true),
StructField("user",StringType,true)
)}
// 3. createDataFrame()
val df = spark.createDataFrame(rdd2,schema)
spark.stop()
}
}
2. 自动推断
将一行数据放入元组()中,toDF()中指定字段名,需要导入隐式转换。
object RddToDataFrame {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("RddToDataFrame").master("local").getOrCreate()
val rdd = spark.sparkContext.textFile("file:///d:/data/words.txt")
// 导入隐式转换
import spark.implicits._
val df = rdd.map{
x => {
val tmp = x.split(",")
(tmp(0).toInt, tmp(1))
}
}.toDF("id","name")
spark.stop()
}
}
3. 通过反射获取schema
跟自动推断差不多,不过需要创建一个case类,定义类属性。Spark通过反射将case类属性映射成Table表结构,字段名已经通过反射获取。需要导入隐式转换。
// case类
case class Words(id:Long,name:String) extends Serializable
object RddToDataFrame {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("RddToDataFrame").master("local").getOrCreate()
val rdd = spark.sparkContext.textFile("file:///d:/data/words.txt")
import spark.implicits._
val df = rdd.map{
x => {
val tmp = x.split(",")
Words(tmp(0).toInt, tmp(1))
}
}.toDF()
spark.stop()
}
}
跟转DataFrame差不多。
spark2.x以后,ScalaAPI中DataFrame只是Dataset[Row]类型的别名,所以转Dataset不用指定Row类型。
1. createDataset
object RddToDataset {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("RddToDataset").master("local").getOrCreate()
val rdd = spark.sparkContext.textFile("file:///d:/data/words.txt")
import spark.implicits._
val rdd2 = rdd.map(_.split(",")).map(x => (x(0), x(1)))
val ds = spark.createDataset(rdd2)
spark.stop()
}
}
2. 自动推断
object RddToDataset {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("RddToDataset").master("local").getOrCreate()
val rdd = spark.sparkContext.textFile("file:///d:/data/words.txt")
import spark.implicits._
val ds = rdd.map {
x => {
val tmp = x.split(",")
(tmp(0).toInt, tmp(1))
}
}.toDS()
spark.stop()
}
}
3. 反射获取schema
字段名已经通过反射获取。
object RddToDataset {
case class Words(id: Long, name: String) extends Serializable
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("RddToDataset").master("local").getOrCreate()
val rdd = spark.sparkContext.textFile("file:///d:/data/words.txt")
import spark.implicits._
val ds = rdd.map {
x => {
val tmp = x.split(",")
Words(tmp(0).toInt, tmp(1))
}
}.toDS()
spark.stop()
}
}
Dataset 转RDD
通过 .rdd,Dataset直接转即可。
object DatasetToRdd {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("DatasetToRdd")
.master("local").getOrCreate()
val rdd = spark.sparkContext.textFile("file:///d:/data/words.txt")
import spark.implicits._
val ds = rdd.map( x => {val tmp = x.split(",");(tmp(0).toInt, tmp(1))}).toDS()
// 注意RDD类型
val rdd2: RDD[(Int, String)] = ds.rdd
spark.stop()
}
}
DataFrame 转RDD
通过 .rdd,但是DataFrame转换后的是RDD[Row],需要通过map转换为RDD[String],或者直接通过 .getAs()获取。
object DataFrameToRdd {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("DataFrameToRdd")
.master("local").getOrCreate()
val df = spark.read.text("file:///d:/data/words.txt")
// RDD类型Row
val rdd: RDD[Row] = df.rdd
val rdd2: RDD[String] = rdd.map( _(0).toString)
val rdd4: RDD[String] = rdd.map(_.get(0).toString)
val rdd3: RDD[String] = rdd.map(_.getAs[String](0))
spark.stop()
}
}
object DataFrameToDataset {
case class Words(id: Long, name: String) extends Serializable
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("DataFrameToDataset")
.master("local").getOrCreate()
val df = spark.read.text("file:///d:/data/words.txt")
import spark.implicits._
val ds = df.as[Words]
spark.stop()
}
}
直接转即可,.toDF()。
object DatasetToDataFrame {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("DatasetToDataFrame")
.master("local").getOrCreate()
val rdd = spark.sparkContext.textFile("file:///d:/data/words.txt")
import spark.implicits._
val rdd2 = rdd.map(_.split(",")).map(x => (x(0), x(1)))
val ds = spark.createDataset(rdd2)
val df = ds.toDF
spark.stop()
}
}