"split" which is a logical concept relatives to a "block" whis is real store unit.
when a client submit a job to JT,it will compute the splits by file,than the TT will generate InputSplit to map task.
so splits are used for spawn mappers ,if you use FileINputformat and set isSplitable() to false,that means this file will NOT be splitted,so this file is as a file to come to a mapper.
RecordReader is used to recover to file data that splited by client before submitting to JT in Reducer .so if u can read a split as a record.
intergrated FileInputFormat and RecordReader,u can get a only record for a whole file :
a set isSplitable() to false;
b rewrite the next() in RecordReader to read the whlole split once.
how to compute a split size?
new verion computing formula:
split size = max(min-split-size,min(max-split-size,blocksize))
note the final number of split is not simply to divide file length by split size,it use a split slot to optimize.
that is it will consider the positon seeking performance?
old version formula:
split size = max(min-split-size,min(goalsize,blocksize))
the goalsize is generated from dividing the total size of all files by numMapTasks.
of course there is a split slop in it also.
finally,the client will generate a split file which summary all the splits info to the dfs.so it is a logical to let the app have a second chance to adjust to inpput size when running into mapper.
how to restore records from split file?
yes, it is excited to talk about this subject. as the split file is not considered in case of line length(maybe exceed the threshold of mapred.linerecordreader.length) and whether it is breaked in a non-ascii char when generated by client before submiting a job.
in Local mode,this is LocalJobRunner to process tasks running.of course ,a LineReader is used to recover every split file(fragment actually) to push to a mapper.there are the import things to do it :
A each split file have it's raw file (parent file) as it's property.and
it keep a pair of current data offset(relate to raw file) and current data lengh of split file
B a CR and LF both are ascii -codes(that means they are not splittable to avoid affecting to process split proglems)
and this is the style of loca mode,what about real cluster? TODO :)
by the way,there is a trick to avoid resplit the raw file in LocalJobRunner,go to see in job.run() of it:
if (job.getUseNewMapper()) { ... }else{ .. }
you can use the JobClient.getSplits() to instead of it,mabe this is a "optimization" :)