External Data Source API
外部数据源
MapReduce Hive Spark
加载数据
格式:json、parquet、text、jdbc...... + compression
user.json
id:1,name:xxx
id:"xx",name:xxx,session_id:xxx
FileSystem: HDFS、HBase、S3、OSS
hdfs://......
s3a://
s3n://
endpoint ak sk
HDFS join MySQL <=== 一句话
raw data ==> Ext DS API ==> DataFrame/Dataset
Core
SQL
Streaming
MLlib
....
==> RDD ==> DataFrame/Dataset 打通整个Spark生态栈的利器
json、csv、hdfs、hive、jdbc、s3、parquet、es、redis、cassandra、hbase.......
分为两大类:build-in、3th-party
https://spark-packages.org/
两大好处:
read
write
spark.read.format("csv/json/com.ruoze.com.XXX").load()
df.write.format("").save()
Parquet ==> Text
val emp = spark.read.format("jdbc")
.options(Map(
"url"->"jdbc:mysql://hadoop000:3306/sqoop?user=root&password=root",
"dbtable"->"emp",
"driver"->"com.mysql.jdbc.Driver")).load()
dept : hive/sparksql
emp : mysql
CREATE TEMPORARY TABLE emp_mysql
USING org.apache.spark.sql.jdbc
OPTIONS (
url "jdbc:mysql://hadoop000:3306/sqoop?user=root&password=root",
dbtable "emp",
driver "com.mysql.jdbc.Driver"
)
empDF.write.mode("overwrite").format("jdbc").option("url", "jdbc:mysql://hadoop000:3306/sqoop?user=root&password=root").option("dbtable", "test_emp").option("user", "root").option("password", "root").option("driver","com.mysql.jdbc.Driver").save()
abstract class BaseRelation <== 加载外部数据源的数据,定义数据的schema信息
trait RelationProvider <== 产生Relation
trait TableScan <== 如何更有效的读取外部数据源的数据的
def buildScan(): RDD[Row]
trait PrunedScan {
def buildScan(requiredColumns: Array[String]): RDD[Row]
}
trait PrunedFilteredScan {
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
}
trait InsertableRelation {
def insert(data: DataFrame, overwrite: Boolean): Unit
}
emp1.txt ==> partitionBy("year")
xxxxx/year=1987/
1-1987.txt
2-1987.txt
xxxxx/year=1980/
1-1980.txt
2-1980.txt
emp2.txt ==> partitionBy("year")
幂等性问题