SparkSql API,Spark DataSet 和DataFrame使用

1.SparkSession

SparkSession就是设计出来合并SparkContext和SQLContext的。我建议能用SparkSession就尽量用。如果发现有些API不在SparkSession中,你还是可以通过SparkSession来拿到SparkContext和SQLContex的。

val context: SparkContext = sparkSession.sparkContext

先来看一下官网吧
spark-sql

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._

编写一段sparkSession入门代码

def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder()
      .master("local")
      .getOrCreate()

    val df = spark.read.text("file:///home/hadoop/IdeaProjects/sparksql-train/data/in.txt")
    df.show()

    spark.stop()
  }

2.了解SqlSession

SparkSesion是在2.X之后统一的
1.X编程入口点:SqlContext,HiveContext
代码:

def main(args: Array[String]): Unit = {
    val sparkConf: SparkConf = new SparkConf().setMaster("local").setAppName("SqlContextApp")
    val sc = new SparkContext(sparkConf)
    val sQLContext = new SQLContext(sc);

    val frame = sQLContext.read.text("file:///home/hadoop/IdeaProjects/sparksql-train/data/in.txt");
    frame.show()


    sc.stop()
  }

3.Spark DataSet 和DataFrame

1:DS与DF关系?

type DataFrame = Dataset[Row]

4.DataFrame基本api使用

读取文件:

val dataFrame = sparkSession.read.json("file")

df可以理解为表

查看df表结构

dataFrame.printSchema()

查询单个字段

dataFrame.select($"name").show()

where语句(过滤)

dataFrame.filter($"age">21).show()

groundBy分组:

dataFrame.groupBy($"age").count().show()

查询name和age字段,并且age+10

dataFrame.select($"name",($"age"+10).as("new_age")).show()

创建一个临时表

dataFrame.createOrReplaceTempView("people")

直接运行sql语句:

sparkSession.sql("select * from people").show()

取前N条方式:

dataFrame.head(3)
dataFrame.take(3)
dataFrame.first

字段重命名

dataFrame.withColumnRenamed("name","new_name");

根据年龄降序排序

dataFrame.orderBy(desc("age")).show()

4.DataSet基本api使用

使用case class 将df转化为ds

case class person(name:String,age:Long)
val df = sparkSession.read.json("file:///home/hadoop/IdeaProjects/sparksql-train/data/people.json")
val ds: Dataset[person] = df.as[person]

你可能感兴趣的:(学习,#,Spark,hadoop,spark,hdfs,大数据)