据源读取对应的物理执行节点为FileSourceScanExec ,对于非bucket的扫描调用createNonBucketedReadRDD方法定义如下
private def createNonBucketedReadRDD(
readFile: (PartitionedFile) => Iterator[InternalRow],
selectedPartitions: Seq[PartitionDirectory],
fsRelation: HadoopFsRelation): RDD[InternalRow] = {
//读取文件时打包成最大的partition大小 256M,对应一个block大
val defaultMaxSplitBytes =
fsRelation.sparkSession.sessionState.conf.filesMaxPartitionBytes
// 文件打开开销,每次打开文件最少需要读取的字节 4M
val openCostInBytes = fsRelation.sparkSession.sessionState.conf.filesOpenCostInBytes
//400
val defaultParallelism = fsRelation.sparkSession.sparkContext.defaultParallelism
//通过 fs 获取文件的大小bytes,// 总共读取的大小
val totalBytes = selectedPartitions.flatMap(.files.map(.getLen + openCostInBytes)).sum
//单core读取的大小
val bytesPerCore = totalBytes / defaultParallelism
// 计算大小,不会超过设置的256MB
val maxSplitBytes = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +
s"open cost is considered as scanning $openCostInBytes bytes.")
val splitFiles = selectedPartitions.flatMap { partition =>
partition.files.flatMap { file =>
val blockLocations = getBlockLocations(file)
//判断当前文件是否可拆分(即是否是可切分文件格式),如果可切分,则按照最大任务大小 maxSplitBytes 进行拆分,得到一个由 PartitionedFile 组成的序列。拆分的方式是将文件分成若干个连续的块,块的大小不超过最大任务大小
if (fsRelation.fileFormat.isSplitable(//
fsRelation.sparkSession, fsRelation.options, file.getPath)) {
(0L until file.getLen by maxSplitBytes).map { offset =>
val remaining = file.getLen - offset
//如果remaining大于切分的最大block块,则将文件进行切分,否则返回剩下的为一个block文件块,这样的情况下一个文件将被切分为多个256m的文件和剩余一个小文件
val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
val hosts = getBlockHosts(blockLocations, offset, size)53.3 G 159.9 G hdfs:/day=2023-07-04/hour=10
parquet文件总共103个,每个文件大概530,总共53.3G=54579.2
1.实际使用结果记录,申请资源 --num-executors 200 --executor-memory 8G --executor-cores 2
实际启动的task的个数为412
defaultMaxSplitBytes=256M
openCostInBytes=4m
totalBytes = 54579.2 + 103 * 4MB = 54991.2 MB
bytesPerCore = 54991.2MB / 400 = 137.5MB
maxSplitBytes = 137.5MB = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
//412的由来,将部分切分逻辑抽取出来测试
val openCostInBytes=41024
val totalBytes=(53.310241024 +10341024).toInt
val maxSplitBytes=(137.51024).toInt
val data= Array.fill(103)(530*1024).flatMap(totalBytes=>{
(0 until totalBytes by maxSplitBytes).map(offset=>{
val remaining=totalBytes-offset
val size = if (remaining > maxSplitBytes) maxSplitBytes else remaining
size
})
}).sortBy(-_)
data.foreach(println(_))
val partitions = new ArrayBuffer[Int]
val currentFiles = new ArrayBuffer[Int]
var currentSize = 0L
/** Close the current partition and move to the next. */
def closePartition(): Unit = {
if (currentFiles.nonEmpty) {
// Copy to a new Array.
partitions.append(1)
}
currentFiles.clear()
currentSize = 0
}
// Assign files to partitions using "First Fit Decreasing" (FFD)
data.foreach { file =>
if (currentSize + file > maxSplitBytes) {
closePartition()
}
// Add the given file to the current partition.
currentSize += file + openCostInBytes
currentFiles += file
}
closePartition()
println(partitions.size)
2.
1.实际使用结果记录,申请资源 --num-executors 100 --executor-memory 8G --executor-cores 1
实际启动的task的个数为215=ceil(54991.2/256)
defaultMaxSplitBytes=256M
openCostInBytes=4m
totalBytes = 54579.2 + 103 * 4MB = 54991.2 MB
bytesPerCore = 54991.2MB / 100 = 549.91MB
maxSplitBytes = 256MB = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))
4.2 G 12.7 G hdfs:///day=2023-07-03/hour=22/minute=00
defaultMaxSplitBytes=256M
openCostInBytes=4m
totalBytes = 4300.8 + 120 * 4MB = 4780.8 MB
bytesPerCore = 4780.8MB / 400 = 11.95MB
maxSplitBytes = 11.95MB = Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))PartitionedFile(
partition.values, file.getPath.toUri.toString, offset, size, hosts)
}