原来的TeraInputFormat继承了FileInputFormat,使用了父类的获取分片的方法
lastResult = super.getSplits(job);
而获取到的分片它并不是理想中的一行一行很整齐的排列的,那它在读取key和value值的时候是怎么弄的呢?
原来其默认由TeraGen生成的数据是定长规则的数据,统一都是10key+90value的格式,即每条record都是整齐的100B,那么它就可以通过不断向后推100B的位移来获取下一条完整的record,从而读取key和value值:
static final int KEY_LENGTH = 10; static final int VALUE_LENGTH = 90; static final int RECORD_LENGTH = KEY_LENGTH + VALUE_LENGTH; int read = 0; while (read < RECORD_LENGTH) { long newRead = in.read(buffer, read, RECORD_LENGTH - read); if (newRead == -1) { if (read == 0) { return false; } else { throw new EOFException("read past eof"); } } read += newRead; } if (key == null) { key = new Text(); } if (value == null) { value = new Text(); } key.set(buffer, 0, KEY_LENGTH); value.set(buffer, KEY_LENGTH, VALUE_LENGTH); offset += RECORD_LENGTH;
而我们生成的Lineitem表里的数据是不定长的,如下:
1|6370|371|3|8|10210.96|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE BACK RETURN|REG AIR|riously. regular, express dep|
1|214|465|4|28|31197.88|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites. fluffily even de|
1|2403|160|5|24|31329.60|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB| pending foxes. slyly re|
1|1564|67|6|32|46897.92|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER IN PERSON|MAIL|arefully slyly ex|
一共由16列组成,但是某些列其数据长度是不一样的,而且我在调试的时候看到读进来的split是不分行的,即一条数据一条数据是连在一起的,这就让我非常头疼,因为想不出办法来找到每条数据的边界,找不到边界就没有办法逐条获取key和value值,最后情急之下憋出了一个很笨又很粗暴的想法:
(1)虽然每条数据的长度不确定,但有一点是确定的,那就是每条数据都是固定的16列;
(2)没列之间都会有“ | ”隔开;
结合上面两点,我就打算把一整个split以“ | ”分割之后全部存进一个数组里,对!就是这么笨拙和粗暴。然后设置全局变量count = 0,这样每读取一次key和value就把count加一,就可以读到下一组key和value,即数组里每16个元素是一条记录:
public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { // Path p = ((FileSplit)split).getPath(); // FileSystem fs = p.getFileSystem(context.getConfiguration()); // in = fs.open(p); // long start = ((FileSplit)split).getStart(); // // find the offset to start at a record boundary // offset = (RECORD_LENGTH - (start % RECORD_LENGTH)) % RECORD_LENGTH; // in.seek(start + offset); // length = ((FileSplit)split).getLength(); /* * */ Path p = ((FileSplit)split).getPath(); FileSystem fs = p.getFileSystem(context.getConfiguration()); in = fs.open(p); in.read(buffer, 0, 20000000); record.set(buffer, 0, 20000000); re = record.toString().split("\\|"); // count = 0; /* * */ }
public boolean nextKeyValue() throws IOException { // if (offset >= length) { // return false; // } // int read = 0; // while (read < RECORD_LENGTH) { // long newRead = in.read(buffer, read, RECORD_LENGTH - read); // if (newRead == -1) { // if (read == 0) { // return false; // } else { // throw new EOFException("read past eof"); // } // } // read += newRead; // } // if (key == null) { // key = new Text(); // } // if (value == null) { // value = new Text(); // } // key.set(buffer, 0, KEY_LENGTH); // value.set(buffer, KEY_LENGTH, VALUE_LENGTH); // offset += RECORD_LENGTH; /* * */ if(count < (re.length/16)){ String valueS = re[count*16] + "|" + re[count*16+1] + "|" + re[count*16+2] + "|" + re[count*16+3] + "|" + re[count*16+4] + "|" + re[count*16+5] + "|" + re[count*16+6] + "|" + re[count*16+7] + "|" + re[count*16+8] + "|" + re[count*16+9] + "|" + re[count*16+10] + "|" + re[count*16+11] + "|" + re[count*16+12] + "|" + re[count*16+13] + "|" + re[count*16+14] + "|"; String keyS = re[count*16+15]; if(value == null){ value = new Text(); } if(key == null){ key = new Text(); } value.set(valueS); key.set(keyS); count = count + 1; } /* * */ return true; } }
这样经测试确实可以采集到正确的key和value值,不过方法显得相当搞笑,给老师汇报之后果然被老师嘲笑了一番哈哈,表示对我设置为20000000的buffer惊到了,本科生的想法啊什么的(本来就是本科生呀哈哈),然后老师分分钟就丢过来一种更加高端的写法,待我研究一番再来总结!