22-SparkSQL03

External Data Source API

外部数据源

MapReduce  Hive  Spark

加载数据

格式:json、parquet、text、jdbc......  + compression

user.json

id:1,name:xxx

id:"xx",name:xxx,session_id:xxx

FileSystem: HDFS、HBase、S3、OSS

hdfs://......

s3a://

s3n://

endpoint  ak  sk

HDFS join MySQL  <===  一句话

raw data ==> Ext DS API ==> DataFrame/Dataset

Core

SQL

Streaming

MLlib

....

==> RDD ==> DataFrame/Dataset 打通整个Spark生态栈的利器

json、csv、hdfs、hive、jdbc、s3、parquet、es、redis、cassandra、hbase.......

分为两大类:build-in、3th-party

https://spark-packages.org/

两大好处:

read

write

spark.read.format("csv/json/com.ruoze.com.XXX").load()

df.write.format("").save()

Parquet ==> Text

val emp = spark.read.format("jdbc")

.options(Map(

"url"->"jdbc:mysql://hadoop000:3306/sqoop?user=root&password=root",

"dbtable"->"emp",

"driver"->"com.mysql.jdbc.Driver")).load()

dept : hive/sparksql

emp  : mysql

CREATE TEMPORARY TABLE emp_mysql

USING org.apache.spark.sql.jdbc

OPTIONS (

  url "jdbc:mysql://hadoop000:3306/sqoop?user=root&password=root",

  dbtable "emp",

  driver "com.mysql.jdbc.Driver"

)

empDF.write.mode("overwrite").format("jdbc").option("url", "jdbc:mysql://hadoop000:3306/sqoop?user=root&password=root").option("dbtable", "test_emp").option("user", "root").option("password", "root").option("driver","com.mysql.jdbc.Driver").save()

abstract class BaseRelation <== 加载外部数据源的数据,定义数据的schema信息

trait RelationProvider <== 产生Relation

trait TableScan  <== 如何更有效的读取外部数据源的数据的

  def buildScan(): RDD[Row]


trait PrunedScan {

  def buildScan(requiredColumns: Array[String]): RDD[Row]


trait PrunedFilteredScan {

  def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]


trait InsertableRelation {

  def insert(data: DataFrame, overwrite: Boolean): Unit

}

emp1.txt ==> partitionBy("year")

xxxxx/year=1987/

1-1987.txt

2-1987.txt

xxxxx/year=1980/

1-1980.txt

2-1980.txt

emp2.txt ==> partitionBy("year")

幂等性问题

你可能感兴趣的:(22-SparkSQL03)