1. 即使用sparkSession.read().format("jdbc").load()读取数据库的过程。
sparkSession.read().format("jdbc") .option("driver", "xxx") .option("url", "xxx") .option("user", "xxx") .option("password", "xxx") .option("dbtable", "dbtable") .option("fetchsize", 100) .option("numPartitions", 1) .option("customSchema", "ID DECIMAL(38,0),NAME STRING") .load().show();
相关参数说明:
dbtable:可以写表名,也可以写查询语句,但是要括起来加上表别名,即比如:
(select x1,x2... from xxx ) a
customSchema:
用于从JDBC连接器读取数据的自定义 schema。例如,id DECIMAL(38, 0), name STRING。您还可以指定部分字段,其他字段使用默认类型映射。 例如,id DECIMAL(38,0)。列名应与JDBC表的相应列名相同。用户可以指定Spark SQL的相应数据类型,而不是使用默认值。 此选项仅适用于读。
numPartitions:
表读取和写入中可用于并行的最大分区数,同时确定了最大并发的JDBC连接数。
fetchsize:
用于确定每次往返要获取的行数(例如Oracle是10行),可以用于提升JDBC驱动程序的性能。此选项仅适用于读。
2. 源码过程解析
1)DataFrameReader.format
/**
* Specifies the input data source format.
*
* @since 1.4.0
*/
def format(source: String): DataFrameReader = {
this.source = source
this
}
2)DataFrameReader.option
/**
* Adds an input option for the underlying data source.
*
* You can set the following option(s):
*
* - `timeZone` (default session local timezone): sets the string that indicates a timezone
* to be used to parse timestamps in the JSON/CSV datasources or partition values.
*
*
* @since 1.4.0
*/
def option(key: String, value: String): DataFrameReader = {
this.extraOptions += (key -> value)
this
}
3)DataFrameReader.load
/**
* Loads input in as a `DataFrame`, for data sources that don't require a path (e.g. external
* key-value stores).
*
* @since 1.4.0
*/
def load(): DataFrame = {
load(Seq.empty: _*) // force invocation of `load(...varargs...)`
}
/**
* Loads input in as a `DataFrame`, for data sources that support multiple paths.
* Only works if the source is a HadoopFsRelationProvider.
*
* @since 1.6.0
*/
@scala.annotation.varargs
def load(paths: String*): DataFrame = {
if (source.toLowerCase(Locale.ROOT) == DDLUtils.HIVE_PROVIDER) {
throw new AnalysisException("Hive data source can only be used with tables, you can not " +
"read files of Hive data source directly.")
}
val cls = DataSource.lookupDataSource(source, sparkSession.sessionState.conf)
if (classOf[DataSourceV2].isAssignableFrom(cls)) {
val ds = cls.newInstance()
val sessionOptions = DataSourceV2Utils.extractSessionConfigs(
ds = ds.asInstanceOf[DataSourceV2],
conf = sparkSession.sessionState.conf)
val options = new DataSourceOptions((sessionOptions ++ extraOptions).asJava)
// Streaming also uses the data source V2 API. So it may be that the data source implements
// v2, but has no v2 implementation for batch reads. In that case, we fall back to loading
// the dataframe as a v1 source.
val reader = (ds, userSpecifiedSchema) match {
case (ds: ReadSupportWithSchema, Some(schema)) =>
ds.createReader(schema, options)
case (ds: ReadSupport, None) =>
ds.createReader(options)
case (ds: ReadSupportWithSchema, None) =>
throw new AnalysisException(s"A schema needs to be specified when using $ds.")
case (ds: ReadSupport, Some(schema)) =>
val reader = ds.createReader(options)
if (reader.readSchema() != schema) {
throw new AnalysisException(s"$ds does not allow user-specified schemas.")
}
reader
case _ => null // fall back to v1
}
if (reader == null) {
loadV1Source(paths: _*)
} else {
Dataset.ofRows(sparkSession, DataSourceV2Relation(reader))
}
} else {
loadV1Source(paths: _*)
}
}
这里查看
val cls = DataSource.lookupDataSource(source, sparkSession.sessionState.conf)
传入的是jdbc,判断逻辑进入
serviceLoader.asScala.filter(_.shortName().equalsIgnoreCase(provider1)).toList match {
所以最终返回的是org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider。
if (classOf[DataSourceV2].isAssignableFrom(cls)) 返回false,所以直接进入到loadV1Source(paths: _*)
4)DataFrameReader.loadV1Source
private def loadV1Source(paths: String*) = {
// Code path for data source v1.
sparkSession.baseRelationToDataFrame(
DataSource.apply(
sparkSession,
paths = paths,
userSpecifiedSchema = userSpecifiedSchema,
className = source,
options = extraOptions.toMap).resolveRelation())
}
这里传入的paths为空,userSpecifiedSchema也是空(只有读取csv、parquet文件时才会主动调用schema方法传入值)。往下查看DataSource.resolveRelation
5)DataSource.resolveRelation
创建已解析的BaseRelation,可以从该datasource读取或写入数据。
/**
* Create a resolved [[BaseRelation]] that can be used to read data from or write data into this
* [[DataSource]]
*
* @param checkFilesExist Whether to confirm that the files exist when generating the
* non-streaming file based datasource. StructuredStreaming jobs already
* list file existence, and when generating incremental jobs, the batch
* is considered as a non-streaming file based data source. Since we know
* that files already exist, we don't need to check them again.
*/
def resolveRelation(checkFilesExist: Boolean = true): BaseRelation = {
val relation = (providingClass.newInstance(), userSpecifiedSchema) match {
// TODO: Throw when too much is given.
case (dataSource: SchemaRelationProvider, Some(schema)) =>
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions, schema)
case (dataSource: RelationProvider, None) =>
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
case (_: SchemaRelationProvider, None) =>
throw new AnalysisException(s"A schema needs to be specified when using $className.")
case (dataSource: RelationProvider, Some(schema)) =>
val baseRelation =
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
if (baseRelation.schema != schema) {
throw new AnalysisException(s"$className does not allow user-specified schemas.")
}
baseRelation
// We are reading from the results of a streaming query. Load files from the metadata log
// instead of listing them using HDFS APIs.
case (format: FileFormat, _)
if FileStreamSink.hasMetadata(
caseInsensitiveOptions.get("path").toSeq ++ paths,
sparkSession.sessionState.newHadoopConf()) =>
val basePath = new Path((caseInsensitiveOptions.get("path").toSeq ++ paths).head)
val tempFileCatalog = new MetadataLogFileIndex(sparkSession, basePath, None)
val fileCatalog = if (userSpecifiedSchema.nonEmpty) {
val partitionSchema = combineInferredAndUserSpecifiedPartitionSchema(tempFileCatalog)
new MetadataLogFileIndex(sparkSession, basePath, Option(partitionSchema))
} else {
tempFileCatalog
}
val dataSchema = userSpecifiedSchema.orElse {
format.inferSchema(
sparkSession,
caseInsensitiveOptions,
fileCatalog.allFiles())
}.getOrElse {
throw new AnalysisException(
s"Unable to infer schema for $format at ${fileCatalog.allFiles().mkString(",")}. " +
"It must be specified manually")
}
HadoopFsRelation(
fileCatalog,
partitionSchema = fileCatalog.partitionSchema,
dataSchema = dataSchema,
bucketSpec = None,
format,
caseInsensitiveOptions)(sparkSession)
// This is a non-streaming file based datasource.
case (format: FileFormat, _) =>
val allPaths = caseInsensitiveOptions.get("path") ++ paths
val hadoopConf = sparkSession.sessionState.newHadoopConf()
val globbedPaths = allPaths.flatMap(
DataSource.checkAndGlobPathIfNecessary(hadoopConf, _, checkFilesExist)).toArray
val fileStatusCache = FileStatusCache.getOrCreate(sparkSession)
val (dataSchema, partitionSchema) = getOrInferFileFormatSchema(format, fileStatusCache)
val fileCatalog = if (sparkSession.sqlContext.conf.manageFilesourcePartitions &&
catalogTable.isDefined && catalogTable.get.tracksPartitionsInCatalog) {
val defaultTableSize = sparkSession.sessionState.conf.defaultSizeInBytes
new CatalogFileIndex(
sparkSession,
catalogTable.get,
catalogTable.get.stats.map(_.sizeInBytes.toLong).getOrElse(defaultTableSize))
} else {
new InMemoryFileIndex(
sparkSession, globbedPaths, options, Some(partitionSchema), fileStatusCache)
}
HadoopFsRelation(
fileCatalog,
partitionSchema = partitionSchema,
dataSchema = dataSchema.asNullable,
bucketSpec = bucketSpec,
format,
caseInsensitiveOptions)(sparkSession)
case _ =>
throw new AnalysisException(
s"$className is not a valid Spark SQL Data Source.")
}
relation match {
case hs: HadoopFsRelation =>
SchemaUtils.checkColumnNameDuplication(
hs.dataSchema.map(_.name),
"in the data schema",
equality)
SchemaUtils.checkColumnNameDuplication(
hs.partitionSchema.map(_.name),
"in the partition schema",
equality)
case _ =>
SchemaUtils.checkColumnNameDuplication(
relation.schema.map(_.name),
"in the data schema",
equality)
}
relation
}
providingClass.newInstance() 返回的是JdbcRelationProvider,所以这里走向第二个case分支
case (dataSource: RelationProvider, None) =>
dataSource.createRelation(sparkSession.sqlContext, caseInsensitiveOptions)
然后走向 JdbcRelationProvider.createRelation
6)JdbcRelationProvider.createRelation
override def createRelation(
sqlContext: SQLContext,
parameters: Map[String, String]): BaseRelation = {
import JDBCOptions._
val jdbcOptions = new JDBCOptions(parameters)
val partitionColumn = jdbcOptions.partitionColumn
val lowerBound = jdbcOptions.lowerBound
val upperBound = jdbcOptions.upperBound
val numPartitions = jdbcOptions.numPartitions
val partitionInfo = if (partitionColumn.isEmpty) {
assert(lowerBound.isEmpty && upperBound.isEmpty, "When 'partitionColumn' is not specified, " +
s"'$JDBC_LOWER_BOUND' and '$JDBC_UPPER_BOUND' are expected to be empty")
null
} else {
assert(lowerBound.nonEmpty && upperBound.nonEmpty && numPartitions.nonEmpty,
s"When 'partitionColumn' is specified, '$JDBC_LOWER_BOUND', '$JDBC_UPPER_BOUND', and " +
s"'$JDBC_NUM_PARTITIONS' are also required")
JDBCPartitioningInfo(
partitionColumn.get, lowerBound.get, upperBound.get, numPartitions.get)
}
val parts = JDBCRelation.columnPartition(partitionInfo)
JDBCRelation(parts, jdbcOptions)(sqlContext.sparkSession)
}
这里是 jdbc连接的核心部分,整体逻辑也简单,返回JDBCRelation。回到DataSource.resolveRelation,检查下重复列,返回该relation,最终调用sparkSession.baseRelationToDataFrame返回DataFrame。
7)DataFrame.show
不论是调用show还是将该dataframe写往其他地方,最终会调用JDBCRelation的buildScan方法,这里就是真正读取数据库的地方。可以去看看 PrunedFilteredScan。