SparkSession就是设计出来合并SparkContext和SQLContext的。我建议能用SparkSession就尽量用。如果发现有些API不在SparkSession中,你还是可以通过SparkSession来拿到SparkContext和SQLContex的。
val context: SparkContext = sparkSession.sparkContext
先来看一下官网吧
spark-sql
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
编写一段sparkSession入门代码
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder()
.master("local")
.getOrCreate()
val df = spark.read.text("file:///home/hadoop/IdeaProjects/sparksql-train/data/in.txt")
df.show()
spark.stop()
}
SparkSesion是在2.X之后统一的
1.X编程入口点:SqlContext,HiveContext
代码:
def main(args: Array[String]): Unit = {
val sparkConf: SparkConf = new SparkConf().setMaster("local").setAppName("SqlContextApp")
val sc = new SparkContext(sparkConf)
val sQLContext = new SQLContext(sc);
val frame = sQLContext.read.text("file:///home/hadoop/IdeaProjects/sparksql-train/data/in.txt");
frame.show()
sc.stop()
}
1:DS与DF关系?
type DataFrame = Dataset[Row]
读取文件:
val dataFrame = sparkSession.read.json("file")
df可以理解为表
查看df表结构
dataFrame.printSchema()
查询单个字段
dataFrame.select($"name").show()
where语句(过滤)
dataFrame.filter($"age">21).show()
groundBy分组:
dataFrame.groupBy($"age").count().show()
查询name和age字段,并且age+10
dataFrame.select($"name",($"age"+10).as("new_age")).show()
创建一个临时表
dataFrame.createOrReplaceTempView("people")
直接运行sql语句:
sparkSession.sql("select * from people").show()
取前N条方式:
dataFrame.head(3)
dataFrame.take(3)
dataFrame.first
字段重命名
dataFrame.withColumnRenamed("name","new_name");
根据年龄降序排序
dataFrame.orderBy(desc("age")).show()
使用case class 将df转化为ds
case class person(name:String,age:Long)
val df = sparkSession.read.json("file:///home/hadoop/IdeaProjects/sparksql-train/data/people.json")
val ds: Dataset[person] = df.as[person]