hive map side join

hive map side join 

If all but one of the tables being joined are small, the join can be performed as a map only job. The query

SELECT  /*+ MAPJOIN(b) */  a.key, a.value
FROM a JOIN b ON a.key = b.key

does not need a reducer. For every mapper of A, B is read completely. The restriction is that a FULL/RIGHT OUTER JOIN b cannot be performed.

如果需要join的表中存在某些个小表,则可以使用map side join,这样的话,这次的join可以优化为仅运行map job,不需要再运行reduce job.这样使用存在的限制是不能支持 full/right outer join b. 
类似于,先把小表缓存起来(内存中),然后使用缓存起来的小表和大表做关联,如:
step 1:
从HDFS读取小表的数据到内存中(可以只读取小表的key列)
step 2:
在map端:
for(大表.row){
for(小表.row){
 if(大表.key==小表.key){ out(大表.row)}
}
}
//由此,无法做到right outer join 或full outer join,因为只有map,输出的只有 大表的row.

你可能感兴趣的:(hive map side join)