sc.textFile("/blabla/{*.gz}")
当我们创建spark context后使用textfile读取文件时候,到底是根据什么分区的呢?分区大小又是多少
textfile 将会创建一个HadoopRDD,这个RDD的使用了 TextInputFormat类来判断如何分区的
对于每个RDD来说, getPartitions() 这个函数就是用来切分文件的。下面就是HadoopRDD getPartitions函数的实现
override def getPartitions: Array[Partition] = {
val jobConf = getJobConf()
// add the credentials here as this can be called before SparkContext initialized
SparkHadoopUtil.get.addCredentials(jobConf)
val inputFormat = getInputFormat(jobConf)
val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
val array = new Array[Partition](inputSplits.size)
for (i <- 0 until inputSplits.size) {
array(i) = new HadoopPartition(id, i, inputSplits(i))
}
array
}
可以看到,实际上调用了inputFormat.getSplits 这个函数来切分文件的
而通过代码可以看到 HadoopRDD 使用的是 TextInputFormat.getSplits 这个函数, 其实就是 FileInputFormat.getSplits
代码如下:
public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
Stopwatch sw = new Stopwatch().start();
FileStatus[] files = listStatus(job);
// Save the number of input files for metrics/loadgen
job.setLong(NUM_INPUT_FILES, files.length);
long totalSize = 0; // compute total size
for (FileStatus file: files) { // check we have valid files
if (file.isDirectory()) {
throw new IOException("Not a file: "+ file.getPath());
}
totalSize += file.getLen();
}
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);
// generate splits
ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
FileSystem fs = path.getFileSystem(job);
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(fs, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
splitHosts[0], splitHosts[1]));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations, length
- bytesRemaining, bytesRemaining, clusterMap);
splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
splitHosts[0], splitHosts[1]));
}
} else {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,0,length,clusterMap);
splits.add(makeSplit(path, 0, length, splitHosts[0], splitHosts[1]));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size()
+ ", TimeTaken: " + sw.elapsedMillis());
}
return splits.toArray(new FileSplit[splits.size()]);
}
protected long computeSplitSize(long goalSize, long minSize,
long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
可以看到这个函数需要两个参数 jobConf [这个是读取相关集群配置], numSplits , 切分数量, 这个数量是由textfile()第二个参数决定的,当我们不传第二个参数的时候,取默认值如下:
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
这个defaultParallelism是spark的一个配置项,可以自己google一下, 一般来说就等于2了,
org.apache.hadoop.mapreduce.lib.input.FileInputFormat.SPLIT_MINSIZE值是由mapreduce.input.fileinputformat.split.minsize这个配置项决定的默认为1,
minSplitSize 值默认也是1
所以就有个如下:
long minSize = Math.max(1, 1), 1); //minSize = 1
long goalSize = totalSize / (2 == 0 ? 1 : 2); //即 goalSize = totalSize / 2
当文件是可切分的,这里通过isSplitable 函数判断是否可切分, 这就跟压缩格式有关了。我这里假设不压缩,所以是可分的
protected boolean isSplitable(FileSystem fs, Path file) {
final CompressionCodec codec = compressionCodecs.getCodec(file);
if (null == codec) {
return true;
}
return codec instanceof SplittableCompressionCodec;
}
获取HDFS blockSize
long blockSize = file.getBlockSize(); //blockSize =128M
最后获取切片大小
long splitSize = computeSplitSize(goalSize, minSize, blockSize);//Math.max(minSize, Math.min(goalSize, blockSize))
/**
* 假设有一个20M的文件, max(1,min(10,128)) splitSize = 10M, 所以就是将文件拆分成 2个分区,每个分区10M
* 假设有个520M的文件, max(1,min(260,128)) splitSize = 128M, 这里还要涉及到SPLIT_SLOP,这个值为1.1
* (520 -128*3) = 136M 136/128 = 1.06 < 1.1 所以最后136M 为最后一个分区,最后文件将被切分为4个分区
*/
切分文件工作,实际上是计算出切分大小即多少M切分一下,然后将文件按照这个大小切分成多份,最后partition数就是切分文件的个数。