Hive是一个以Apache Hadoop为基础的数据仓储基础设施。Hadoop为数据的存储和运行在商业机器上提供了可扩展和高容错的性能。
Hive的设计目标是使得数据汇总更加简单和针对大容量数据的查询和分析。它提供SWL来使得用户可以更简单地查询、汇总和数据分析。同时,Hive的SQL为用户提供了多种地方来融合他们自己的方法实现自定义分析,例如User Defined Functions (UDFs)。
Hive不是为事务联机处理设计的。它是用于处理传统数据仓储任务。
至于如何配置Hive,HiveServer2和Beeline的细节,请参考GettingStarted指南。
Books about Hive 展示了一些可以帮助更好开始Hive的书籍。
在接下来的部分我们将提供一份关于系统性能的指导。我们开始描述data types,tables和partitions(跟传统关系型数据库相似)的概念和通过举例帮助了解Hive的能力。
为了使得粒度合适,Hive数据采用下面展示的组织结构:
timestamp
—当网页被浏览时UNIX timestamp一致的INT类型的数据userid
—用来识别浏览该页面的用户的BIGINT类型的数据page_url —
获取网页位置的STRING类型的数据referer_url—
用于获取用户所在当前页面的位置的STRING类型的数据。IP—
用于获取页面请求时的IP地址。需要说明是对于表来说partitioned和bucketed不是必需的,但这些抽象化概念允许系统在查询操作中筛选掉大量数据来提高查询速度。
Hive支持原始和复杂数据类型,正如下面多描述的。可以在Hive Data Types中查看更多信息。
类型的层次结构如下(父类是所有子类实例的超类型):
类型层级定义了类型在查询语言中的隐性转换。隐性转换允许子类转换成父类。所以当一个查询表达式需要type1但是数据是type2,type1在层级结构中是type2的父类,那么type2可以转换成type1.需要说明的是类型层级允许STRING转换成DOUBLE。
明确的类型转换可以用下面部分#Built In Functions中的cast操作符来实现。
复杂类型可以用原始类型和其他组合类型来组合:
使用原始数据类型和创造复杂类型的架构,任意级别的嵌套类型都可以被创造。例如,对于一个类型,用户可能包含下面的字段:
下面列出的操作符和方法不一定是最新的(Hive Operators and UDFs里面有更多最新信息)在 Beeline 或者 Hive CLI, 使用这些命令行获得最新文档:
SHOW FUNCTIONS;
DESCRIBE FUNCTION;
DESCRIBE FUNCTION EXTENDED;
区分大小写
所有的Hive关键词都是区分大小写,包括Hive操作和方法名。
Relational Operator |
Operand types |
Description |
A = B |
all primitive types |
TRUE if expression A is equivalent to expression B; otherwise FALSE |
A != B |
all primitive types |
TRUE if expression A is not equivalent to expression B; otherwise FALSE |
A < B |
all primitive types |
TRUE if expression A is less than expression B; otherwise FALSE |
A <= B |
all primitive types |
TRUE if expression A is less than or equal to expression B; otherwise FALSE |
A > B |
all primitive types |
TRUE if expression A is greater than expression B] otherwise FALSE |
A >= B |
all primitive types |
TRUE if expression A is greater than or equal to expression B otherwise FALSE |
A IS NULL |
all types |
TRUE if expression A evaluates to NULL otherwise FALSE |
A IS NOT NULL |
all types |
FALSE if expression A evaluates to NULL otherwise TRUE |
A LIKE B |
strings |
TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, |
A RLIKE B |
strings |
NULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B (see Java regular expressions syntax), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to TRUE and so does 'foobar' rlike '^f.*r$'. |
A REGEXP B |
strings |
Same as RLIKE |
Arithmetic Operators |
Operand types |
Description |
A + B |
all number types |
Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands, for example, since every integer is a float. Therefore, float is a containing type of integer so the + operator on a float and an int will result in a float. |
A - B |
all number types |
Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A * B |
all number types |
Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy. |
A / B |
all number types |
Gives the result of dividing B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. If the operands are integer types, then the result is the quotient of the division. |
A % B |
all number types |
Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A & B |
all number types |
Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A | B |
all number types |
Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A ^ B |
all number types |
Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
~A |
all number types |
Gives the result of bitwise NOT of A. The type of the result is the same as the type of A. |
Logical Operators |
Operands types |
Description |
A AND B |
boolean |
TRUE if both A and B are TRUE, otherwise FALSE |
A && B |
boolean |
Same as A AND B |
A OR B |
boolean |
TRUE if either A or B or both are TRUE, otherwise FALSE |
A || B |
boolean |
Same as A OR B |
NOT A |
boolean |
TRUE if A is FALSE, otherwise FALSE |
!A |
boolean |
Same as NOT A |
Operator |
Operand types |
Description |
A[n] |
A is an Array and n is an int |
returns the nth element in the array A. The first element has index 0, for example, if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar' |
M[key] |
M is a Map |
returns the value corresponding to the key in the map for example, if M is a map comprising of |
S.x |
S is a struct |
returns the x field of S, for example, for struct foobar {int foo, int bar} foobar.foo returns the integer stored in the foo field of the struct. |
Return Type |
Function Name (Signature) |
Description |
BIGINT |
round(double a) |
returns the rounded BIGINT value of the double |
BIGINT |
floor(double a) |
returns the maximum BIGINT value that is equal or less than the double |
BIGINT |
ceil(double a) |
returns the minimum BIGINT value that is equal or greater than the double |
double |
rand(), rand(int seed) |
returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic. |
string |
concat(string A, string B,...) |
returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them. |
string |
substr(string A, int start) |
returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar' |
string |
substr(string A, int start, int length) |
returns the substring of A starting from start position with the given length, for example, |
string |
upper(string A) |
returns the string resulting from converting all characters of A to upper case, for example, upper('fOoBaR') results in 'FOOBAR' |
string |
ucase(string A) |
Same as upper |
string |
lower(string A) |
returns the string resulting from converting all characters of B to lower case, for example, lower('fOoBaR') results in 'foobar' |
string |
lcase(string A) |
Same as lower |
string |
trim(string A) |
returns the string resulting from trimming spaces from both ends of A, for example, trim(' foobar ') results in 'foobar' |
string |
ltrim(string A) |
returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar ' |
string |
rtrim(string A) |
returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar' |
string |
regexp_replace(string A, string B, string C) |
returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb' |
int |
size(Map |
returns the number of elements in the map type |
int |
size(Array |
returns the number of elements in the array type |
value of |
cast( |
converts the results of the expression expr to |
string |
from_unixtime(int unixtime) |
convert the number of seconds from the UNIX epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00" |
string |
to_date(string timestamp) |
Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" |
int |
year(string date) |
Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 |
int |
month(string date) |
Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11 |
int |
day(string date) |
Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1 |
string |
get_json_object(string json_string, string path) |
Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid. |
Return Type |
Aggregation Function Name (Signature) |
Description |
BIGINT |
count(*), count(expr), count(DISTINCT expr[, expr_.]) |
count(*)—Returns the total number of retrieved rows, including rows containing NULL values; count(expr)—Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr])—Returns the number of rows for which the supplied expression(s) are unique and non-NULL. |
DOUBLE |
sum(col), sum(DISTINCT col) |
returns the sum of the elements in the group or the sum of the distinct values of the column in the group |
DOUBLE |
avg(col), avg(DISTINCT col) |
returns the average of the elements in the group or the average of the distinct values of the column in the group |
DOUBLE |
min(col) |
returns the minimum value of the column in the group |
DOUBLE |
max(col) |
returns the maximum value of the column in the group |
Hive's SQL 提供基础 SQL操作. 这些操作是用在表和partition上,这些操作是:
下面是原文
Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.
Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides SQL which enables users to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive's SQL gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs).
Hive is not designed for online transaction processing. It is best used for traditional data warehousing tasks.
For details on setting up Hive, HiveServer2, and Beeline, please refer to the GettingStarted guide.
Books about Hive lists some books that may also be helpful for getting started with Hive.
In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables, and partitions (which are very similar to what you would find in a traditional relational DBMS) and then illustrate the capabilities of Hive with the help of some examples.
In the order of granularity - Hive data is organized into:
timestamp
—which is of INT type that corresponds to a UNIX timestamp of when the page was viewed.userid
—which is of BIGINT type that identifies the user who viewed the page.page_url—
which is of STRING type that captures the location of the page.referer_url—
which is of STRING that captures the location of the page from where the user arrived at the current page.IP—
which is of STRING type that captures the IP address from where the page request was made.Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution.
Hive supports primitive and complex data types, as described below. See Hive Data Types for additional information.
The Types are organized in the following hierarchy (where the parent is a super type of all the children instances):
This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2, type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Note that the type hierarchy allows the implicit conversion of STRING to DOUBLE.
Explicit type conversion can be done using the cast operator as shown in the #Built In Functions section below.
Complex Types can be built up from primitive types and other composite types using:
Using the primitive types and the constructs for creating complex types, types with arbitrary levels of nesting can be created. For example, a type User may comprise of the following fields:
The operators and functions listed below are not necessarily up to date. (Hive Operators and UDFs has more current information.) In Beeline or the Hive CLI, use these commands to show the latest documentation:
SHOW FUNCTIONS;
DESCRIBE FUNCTION;
DESCRIBE FUNCTION EXTENDED;
Case-insensitive
All Hive keywords are case-insensitive, including the names of Hive operators and functions.
Relational Operator |
Operand types |
Description |
A = B |
all primitive types |
TRUE if expression A is equivalent to expression B; otherwise FALSE |
A != B |
all primitive types |
TRUE if expression A is not equivalent to expression B; otherwise FALSE |
A < B |
all primitive types |
TRUE if expression A is less than expression B; otherwise FALSE |
A <= B |
all primitive types |
TRUE if expression A is less than or equal to expression B; otherwise FALSE |
A > B |
all primitive types |
TRUE if expression A is greater than expression B] otherwise FALSE |
A >= B |
all primitive types |
TRUE if expression A is greater than or equal to expression B otherwise FALSE |
A IS NULL |
all types |
TRUE if expression A evaluates to NULL otherwise FALSE |
A IS NOT NULL |
all types |
FALSE if expression A evaluates to NULL otherwise TRUE |
A LIKE B |
strings |
TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, |
A RLIKE B |
strings |
NULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B (see Java regular expressions syntax), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to TRUE and so does 'foobar' rlike '^f.*r$'. |
A REGEXP B |
strings |
Same as RLIKE |
Arithmetic Operators |
Operand types |
Description |
A + B |
all number types |
Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands, for example, since every integer is a float. Therefore, float is a containing type of integer so the + operator on a float and an int will result in a float. |
A - B |
all number types |
Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A * B |
all number types |
Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy. |
A / B |
all number types |
Gives the result of dividing B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. If the operands are integer types, then the result is the quotient of the division. |
A % B |
all number types |
Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A & B |
all number types |
Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A | B |
all number types |
Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
A ^ B |
all number types |
Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. |
~A |
all number types |
Gives the result of bitwise NOT of A. The type of the result is the same as the type of A. |
Logical Operators |
Operands types |
Description |
A AND B |
boolean |
TRUE if both A and B are TRUE, otherwise FALSE |
A && B |
boolean |
Same as A AND B |
A OR B |
boolean |
TRUE if either A or B or both are TRUE, otherwise FALSE |
A || B |
boolean |
Same as A OR B |
NOT A |
boolean |
TRUE if A is FALSE, otherwise FALSE |
!A |
boolean |
Same as NOT A |
Operator |
Operand types |
Description |
A[n] |
A is an Array and n is an int |
returns the nth element in the array A. The first element has index 0, for example, if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar' |
M[key] |
M is a Map |
returns the value corresponding to the key in the map for example, if M is a map comprising of |
S.x |
S is a struct |
returns the x field of S, for example, for struct foobar {int foo, int bar} foobar.foo returns the integer stored in the foo field of the struct. |
Return Type |
Function Name (Signature) |
Description |
BIGINT |
round(double a) |
returns the rounded BIGINT value of the double |
BIGINT |
floor(double a) |
returns the maximum BIGINT value that is equal or less than the double |
BIGINT |
ceil(double a) |
returns the minimum BIGINT value that is equal or greater than the double |
double |
rand(), rand(int seed) |
returns a random number (that changes from row to row). Specifiying the seed will make sure the generated random number sequence is deterministic. |
string |
concat(string A, string B,...) |
returns the string resulting from concatenating B after A. For example, concat('foo', 'bar') results in 'foobar'. This function accepts arbitrary number of arguments and return the concatenation of all of them. |
string |
substr(string A, int start) |
returns the substring of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar' |
string |
substr(string A, int start, int length) |
returns the substring of A starting from start position with the given length, for example, |
string |
upper(string A) |
returns the string resulting from converting all characters of A to upper case, for example, upper('fOoBaR') results in 'FOOBAR' |
string |
ucase(string A) |
Same as upper |
string |
lower(string A) |
returns the string resulting from converting all characters of B to lower case, for example, lower('fOoBaR') results in 'foobar' |
string |
lcase(string A) |
Same as lower |
string |
trim(string A) |
returns the string resulting from trimming spaces from both ends of A, for example, trim(' foobar ') results in 'foobar' |
string |
ltrim(string A) |
returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar ' |
string |
rtrim(string A) |
returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar' |
string |
regexp_replace(string A, string B, string C) |
returns the string resulting from replacing all substrings in B that match the Java regular expression syntax(See Java regular expressions syntax) with C. For example, regexp_replace('foobar', 'oo|ar', ) returns 'fb' |
int |
size(Map |
returns the number of elements in the map type |
int |
size(Array |
returns the number of elements in the array type |
value of |
cast( |
converts the results of the expression expr to |
string |
from_unixtime(int unixtime) |
convert the number of seconds from the UNIX epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the format of "1970-01-01 00:00:00" |
string |
to_date(string timestamp) |
Return the date part of a timestamp string: to_date("1970-01-01 00:00:00") = "1970-01-01" |
int |
year(string date) |
Return the year part of a date or a timestamp string: year("1970-01-01 00:00:00") = 1970, year("1970-01-01") = 1970 |
int |
month(string date) |
Return the month part of a date or a timestamp string: month("1970-11-01 00:00:00") = 11, month("1970-11-01") = 11 |
int |
day(string date) |
Return the day part of a date or a timestamp string: day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1 |
string |
get_json_object(string json_string, string path) |
Extract json object from a json string based on json path specified, and return json string of the extracted json object. It will return null if the input json string is invalid. |
Return Type |
Aggregation Function Name (Signature) |
Description |
BIGINT |
count(*), count(expr), count(DISTINCT expr[, expr_.]) |
count(*)—Returns the total number of retrieved rows, including rows containing NULL values; count(expr)—Returns the number of rows for which the supplied expression is non-NULL; count(DISTINCT expr[, expr])—Returns the number of rows for which the supplied expression(s) are unique and non-NULL. |
DOUBLE |
sum(col), sum(DISTINCT col) |
returns the sum of the elements in the group or the sum of the distinct values of the column in the group |
DOUBLE |
avg(col), avg(DISTINCT col) |
returns the average of the elements in the group or the average of the distinct values of the column in the group |
DOUBLE |
min(col) |
returns the minimum value of the column in the group |
DOUBLE |
max(col) |
returns the maximum value of the column in the group |
Hive's SQL provides the basic SQL operations. These operations work on tables or partitions. These operations are: