其实Spark core无论是读取hdfs还是读取本地文件都会以hadoopfile的形式进行读取,不同点在于读取本地文件时可以通过在resources文件中放入hdfs-site.xml文件设置fs.local.block.size参数来控制blocksize大小,而读取hdfs文件时,blocksize是通过获取此文件在hdfs存储的实际块大小,另外虽然blocksize和具体分区数有关但不是唯一决定因素;接下来,就让我们通过源码简单的来看看。(spark version:2.2.1)
val value: RDD[String] = sc.textFile("hdfs://bd-offcn-01:8020/winput/w2.txt")
val value: RDD[String] = sc.textFile("file:\\C:\\Users\\王俊\\Desktop\\work\\hbase.txt")
首先无论textfile读取的是本地文件还是hdfs文件都会使用getSplits作为切分的方法
public InputSplit[] getSplits(JobConf job, int numSplits)
throws IOException {
Stopwatch sw = new Stopwatch().start();
//从这里开始就要去获取blocksize了,有兴趣可以从这个方法进去阅读
FileStatus[] files = listStatus(job);
// Save the number of input files for metrics/loadgen
job.setLong(NUM_INPUT_FILES, files.length);
long totalSize = 0; // compute total size
for (FileStatus file: files) { // check we have valid files
if (file.isDirectory()) {
throw new IOException("Not a file: "+ file.getPath());
}
//这个tatalSize指的是路径下的所有文件大小
totalSize += file.getLen();
}
long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);
// generate splits
ArrayList splits = new ArrayList(numSplits);
NetworkTopology clusterMap = new NetworkTopology();
for (FileStatus file: files) {
Path path = file.getPath();
long length = file.getLen();
if (length != 0) {
FileSystem fs = path.getFileSystem(job);
BlockLocation[] blkLocations;
if (file instanceof LocatedFileStatus) {
blkLocations = ((LocatedFileStatus) file).getBlockLocations();
} else {
blkLocations = fs.getFileBlockLocations(file, 0, length);
}
if (isSplitable(fs, path)) {
long blockSize = file.getBlockSize();
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,
length-bytesRemaining, splitSize, clusterMap);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,
splitHosts[0], splitHosts[1]));
bytesRemaining -= splitSize;
}
if (bytesRemaining != 0) {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations, length
- bytesRemaining, bytesRemaining, clusterMap);
splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
splitHosts[0], splitHosts[1]));
}
} else {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations,0,length,clusterMap);
splits.add(makeSplit(path, 0, length, splitHosts[0], splitHosts[1]));
}
} else {
//Create empty hosts array for zero length files
splits.add(makeSplit(path, 0, length, new String[0]));
}
}
sw.stop();
if (LOG.isDebugEnabled()) {
LOG.debug("Total # of splits generated by getSplits: " + splits.size()
+ ", TimeTaken: " + sw.elapsedMillis());
}
return splits.toArray(new FileSplit[splits.size()]);
}
因为网上或者站内已经对这个方法有比较详细的讲解,这里就不过多赘述,讲两个核心的地方:
//计算每个分片的大小
long splitSize = computeSplitSize(goalSize, minSize, blockSize);
//下面是computeSplitSize方法,从这里可以看出为什么在textfile中第二个参数
//指定分区数时,是最小分区数了,因为在goalSize > blockSize时 分区数会大于你
//所设置的分区;在goalSize < blockSize时 分区数会等于你所设置的分区
protected long computeSplitSize(long goalSize, long minSize,long blockSize) {
return Math.max(minSize, Math.min(goalSize, blockSize));
}
private static final double SPLIT_SLOP = 1.1;
//只有剩余文件大小大于splitSize的1.1倍,这个文件才会被继续切分,length代表一个文件的大小,如果路径下有多个文件,计算出的并行度是相加的;
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
String[][] splitHosts = getSplitHostsAndCachedHosts(blkLocations, length-bytesRemaining, splitSize, clusterMap);
splits.add(makeSplit(path, length-bytesRemaining, splitSize,splitHosts[0],
splitHosts[1]));
bytesRemaining -= splitSize;
}
@Deprecated
public long getDefaultBlockSize() {
// default to 32MB: large enough to minimize the impact of seeks
return getConf().getLong("fs.local.block.size", 32 * 1024 * 1024);
}
//两个方法都是CodedInputStream中的
/** Read a raw Varint from the stream. */
public long readRawVarint64() throws IOException {
int shift = 0;
long result = 0;
while (shift < 64) {
final byte b = readRawByte();
result |= (long)(b & 0x7F) << shift;
if ((b & 0x80) == 0) {
return result;
}
shift += 7;
}
throw InvalidProtocolBufferException.malformedVarint();
}
//下面是readRawByte方法
public byte readRawByte() throws IOException {
if (bufferPos == bufferSize) {
refillBuffer(true);
}
return buffer[bufferPos++];
}
从这里可以看出,是根据文件实际在hdfs上块大小,计算的blockSize。
以上就是全部内容,谢谢阅读!