Hive Outline - Part I (UDF, JOIN, SELECT, Map-Red)

The Outline of Hive

User-Defined Functions

- UDF

UDF is one-one row function,for example, Substring,trim. A customized class extended from org.apache.hadoop.hive.ql.exec.UDF, And implement method 'evaluate' with multiple signature body. This is not declared as interface method, just because it might need multiple parameters,

this is uncertain on application context. for Example:

public class strip extends UDF {
  public Text evaluate(...){}
}


- UDAF

UDAF is many-one rows (aggregation) function, example, Count,Max. The class extends org.apache.hadoop.hive.ql.exec.UDAF. And The class need multiple nested class that extends from org.apache.hadoop.hive.ql.exec.UDADEvaluator, like:

 class maxint extends UDAF {
  public static class maxintEvaluator extends UDAFEvaluator{
   //very-first
   void init(){}
   
   // for each row in partial
   bool iterator(){}
   
   // finish a partial
   terminatePartial(){}
   
   //merge partial
   bool merge(partialresult);
   
   //final result
   doubleWritable terminate(){}
  }
  public static class maxlongEvaluator extends UDAFEvaluator{
  }
 }

 

- UDTF 

UDTF is one-many (table-generated) function, for example explode(column),
 The class extends org.apache.hadoop.hive.ql.udf.generic.GenericUDTF
 The class implement initialize, process, close() method.
 using with lateral view,like:

 select src.id, mytable.col1, mytable.col2 from src lateral view explode_map
(properties) mytable as col1, col2;


Query Hive

Questions
1. What's LATERAL VIEW? Lateral View is using with split/explode together, which are able to explode Array/Map/Structure to multiple rows. but not able to use join, sort, order and etc. Very limited.

  • View

generate by CVAS (Create View AS Select), it's just a saved query, no cached View in hive. it can show by 'Show tables;', and using 'describe extended view' for details.

  • SubQuery, must have alias, how many levels not sure, need verify on practise.

Select a,b
 From (
   select...
 ) aliashere
 Group by a,b

  • Joins,
    Hive support Inner Join, Left/Right/Full Outer join, On syntax only support eaqual condition, other condition not supported.

    The algorithm will use Reduce Only, which will try to cache left table (FROM table) , and transfer right table (JOIN table). So better set FROM table smaller, this will get best in practise.


       'Explain/Explain Extended' can use rule-based query optimizer, and tell more details about       the query. cost-based optimizer might be adedd in futrue.

    -- Semi Join, which is to repolace IN syntax, example, the right table only be referenced in On condition, and it can't be referenced in SELECT and other key words, for example.
 

Select * from TA left SEMI Join TB on (TA.id = TB.id);

    -- Map-side Join

    If the table is a small data set, then Hive able to optimize the table by In-Memory in mapper, to enhance the speed, that's so called Map size, since it's different with Reduce side jon. Using C style comment, this looks wierd, I guess some unkown reason the developer using this syntax, such as this is a hint, but not a rule. because some condition hive need to justice if it's ready to use map-side join.

SELECT /*+ MAPJOIN(tb2) */ tb1.field1,tb2.filed1
FROM student tb1
JOIN teacher tb2
ON(tb1.teacherId = tb2.id)

In above sample, Table teacher is smaller data set than student.


Without MAPJOIN, hive might use auto convert to optimize common JOIN, some Configurations:

hive.auto.convert.join = true; this enable Auto convert in Hive, if the table is taken as small table, the value is as below setting shows.

hive.smalltable.filesize=10M; small table size standard.

hive.hashtable.max.memory.usage =50; if Map task need memory exceed this percentage, the map task will terminate.

Normally, LEFT OUTER JOIN tb1, and RIGHT OUTER JOIN tb1, try to chck if tb1 is ready to be hashtable.

However, FULL OUTER JOIN never try this, it will use reduce only algorithm. see Joins for details.

- Index

Hive actually create a table, ordered by index fields, and remember the offset to acclerate speed.

in my case the Index Table named by Database__Table_Index__.

CREATE INDEX table02_index ON TABLE table02 (column3) AS 'COMPACT' WITH DEFERRED REBUILD;
ALTER INDEX table02_index ON table2 REBUILD;
SHOW FORMATTED INDEX ON table02;
DROP INDEX table02_index ON table02;

Note: In previous version of Hive, I still see bucket which is a hash map to enhance map-side join, but now it looks deprecated because Index is a more friendly feature to cover.


- Order By, order by is bad performance without 'limt N' clause.

Difference between Sort By and Order By

Hive supports SORT BY which sorts the data per reducer. The difference between "order by" and "sort by" is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, "sort by" may give partially ordered final results.

Note: It may be confusing as to the difference between SORT BY alone of a single column and CLUSTER BY. The difference is that CLUSTER BY partitions by the field and SORT BY if there are multiple reducers partitions randomly in order to distribute data (and load) uniformly across the reducers.

Basically, the data in each reducer will be sorted according to the order that the user specified. 

FROM (FROM (FROM src
            SELECT TRANSFORM(value)
            USING 'mapper'
            AS value, count) mapped
      SELECT cast(value as double) AS value, cast(count as int) AS count
      SORT BY value, count) sorted
SELECT TRANSFORM(value, count)
USING 'reducer'
AS whatever


Syntax of Cluster By and Distribute By

Cluster By and Distribute By are used mainly with the Transform/Map-Reduce Scripts. But, it is sometimes useful in SELECT statements if there is a need to partition and sort the output of a query for subsequent queries.

Cluster By is a short-cut for both Distribute By and Sort By.

Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.

Instead of specifying Cluster By, the user can specify Distribute By and Sort By, so the partition columns and sort columns can be different. The usual case is that the partition columns are a prefix of sort columns, but that is not required.


SELECT col1, col2 FROM t1 DISTRIBUTE BY col1
 
SELECT col1, col2 FROM t1 DISTRIBUTE BY col1 SORT BY col1 ASC, col2 DESC


Below table run Map-Reduce job and have DISTRIBUTE BY and SORT BY in different fields, if it's same, you may only specify 'CLUSTER BY'.

FROM (
  FROM pv_users
  MAP ( pv_users.userid, pv_users.date )
  USING 'map_script.py'
  AS c1, c2, c3
  DISTRIBUTE BY c2
  SORT BY c2, c1) map_output
INSERT OVERWRITE TABLE pv_users_reduced
  REDUCE ( map_output.c1, map_output.c2, map_output.c3 )
  USING 'reduce_script.py'
  AS date, count;





你可能感兴趣的:(JOIN,hadoop,hive,jdbc,partition,udf)