hive的中streamtable 流式表简介.

on a reduce side join, the values from multiple tables are often tagged to identify them on reducer stage, for the table they are coming from.

Consider a case of two tables:

On reduce call, the mixed values associated with both tables are iterated.

During iteration, the value for one of the tag/table are locally stored into an arraylist. (This is buffering).

While the rest of the values are being streamed through and values for another tag/table are detected, the values of first tag are fetched from the saved arraylist. The two tag values are joined and written to output collector.

Contrast this with the case what if the larger table values are kept in arraylist then it could result into OOM if the arraylist outgrows to overwhelm the memory of the container's JVM.


void reduce(TextPair key , Iterator  values ,OutputCollector  output ,Reporter reporter ) throws IOException {
//buffer for table1
ArrayList  table1Values = new ArrayList () ;
//table1 tag
Text table1Tag = key . getSecond();
TextPair value = null;
while( values . hasNext() ){
    value = values . next() ;
    if(value.getSecond().equals(table1Tag)){
        table1Values.add (value.getFirst() );
    }
    else{
        for( Text val : table1Values ){
            output.collect ( key.getFirst() ,new Text(val.toString() + "\t"+                    value.getFirst().toString () ));    
        }
    }
}

You can use the below hint to specify which of the joined tables would be streamed on reduce side:

SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)


你可能感兴趣的:(hive的中streamtable 流式表简介.)