前几章已经学习过Hive表的定义和数据操纵,本章我们开始学习HiveQL查询。
SELECT ... FROM ...查询
SELECT在SQL中是一个投影操作。让我们从新来看之前定义过的分区表employees:
CREATE TABLE employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING> COMMENT '下属',
deductions MAP<STRING,FLOAT> COMMENT '扣费',
address STRUT<street:STRING,city:STRING,state:STRING,zip:INT>
)
PARTITIONED BY(country STRING,state STRING);
SELECT查询:
hive> SELECT name,salary FROM employees;
John Doe 100000.0
Mary Smith 80000.0
Todd Jones 70000.0
Bill King 60000.0
用户也可以给FROM之后的表,视图或子查询起一个别名,如:
hive> SELECT e.name,e.salary FROM employees e;
上面两个HiveQL语句是相同的,给表起别名在JOIN操作中特别有用。
下面我们来看如何查询employees表中的集合类型的数据。我们先看一下如何查询ARRAY类型的数据,如employees表的下属“subordinates”
hive> SELECT name,subordinates FROM employees;
John Doe ["Mary Smith","Todd Jones"]
Mary Smith ["Bill King"]
Todd Jones []
Bill king []
再看MAP类型的查询,如“deductions”:
hive> SELECT name,deductions FROM employees;
John Doe {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Mary Smith {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
Todd Jones {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
Bill King {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
再看STRUCT类型的查询,如“address”:
hive> SELECT name,address FROM employees;
John Doe {"Street":"1 Michign Ave.","city":"Chicago","State":"IL","ZIP":60600}
Mary Smith {"Street":"100 Ontario St.","city":"Chicago","State":"IL","ZIP":60601}
Todd Jones {"Street":"200 Chicago Ave.","city":"Oak Park","State":"IL","ZIP":60700}
Bill King {"Street":"300 Obscure Dr.","city":"Obscuria","State":"IL","ZIP":60100}
接下来我们再看如何查看集合性属性字段中的数据:
hive> SELECT name,subordinates[0],deductions["State Taxes"],address.city FROM employees;
John Doe Mary Smith 0.05 Chicago
Mary Smith Bill King 0.05 Chicago
Todd Jones NULL 0.03 Oak Park
Bill King NULL 0.03 Obscuria
使用正则表达式查询符合条件的列
在Hive查询中,用户可以使用正则表达式查询符合条件的列,下面的实例中就是使用正则表达式的使用用例,可以查询到symbol列和所有以“price”开头的列:
hive> SELECT symbol,'price.*' FROM stocks;
AAPL 195.69 197.88 194.0 194.12 194.12
AAPL 192.63 196.0 190.85 195.46 195.46
AAPL 196.73 198.37 191.57 192.05 192.05
AAPL 195.17 200.2 194.42 199.23 199.23
AAPL 195.91 196.32 193.38 195.86 195.86
...
列计算
在HiveQL中,用户不但可以从表中查询某些列,还可以通过函数或数学表达式来计算列的值。例如,我们可以在employees表中查询雇员的姓名,薪水,联邦税百分百及其他列的值:
hive> SELECT upper(name),salary,deductions["Federal Taxes"],
> round(salary * (1 - deductions["Federal Taxes"]))
> FROM employees;
JOHN DOE 100000.0 0.2 80000
MARY SMITH 80000.0 0.2 64000
TODD JONES 70000.0 0.15 59500
BILL KING 60000.0 0.15 51000
Hive是使用JAVA写的开源软件,在函数或数学表达式来计算列的值时类型转型和JAVA的转型相同。
聚合函数
要在HiveQL查询中使用聚合函数,必须先将
hive.map.aggr配置参数设置为true,举例如下:
hive> SET hive.map.aggr=true;
hibe> SELECT count(*),avg(salary) FROM employees;
但是将
hive.map.aggr
设置为true会占用更多的内存。
LIMIT
一次典型的HiveQL查询可能会返回所有符合条件的数据记录,但是
LIMIT关键字可以限制返回的记录的条数:
hive> SELECT upper(name),salary,deductions["Federal Taxes"],
> round(salary * (1 - deductions["Federal Taxes"]))
> FROM employees
> LIMIT 2;
JOHN DOE 100000.0 0.2 80000
MARY SMITH 80000.0 0.2 64000
给列奇别名
hive> SELECT upper(name),salary,deductions["Federal Taxes"] AS
> fed_taxes,round(salary * (1 - deductions["Federal Taxes"])) AS
> salary_minus_fed_taxes
> FROM employees
> LIMIT 2;
JOHN DOE 100000.0 0.2 80000
MARY SMITH 80000.0 0.2 64000
子查询
给列起别名特别适合与子查询中的列,让我们将上个查询示例修改为子查询的使用用例:
hive> FROM(
> SELECT upper(name),salary,deductions["Federal Taxes"] AS
> fed_taxes,round(salary * (1 - deductions["Federal Taxes"]))
> AS salary_minus_fed_taxes
> FROM employees
> ) e
> SELECT e.name,e.salary_minus_fed_taxes
> WHERE e.salary_minus_fed_taxes > 70000;
JOHN DOE 100000.0 0.2 80000
CASE ... WHEN ... THEN语句
CASE ... WHEN ... THEN向标准的SQL语句中一样使用在SELECT列中,对某一个猎德返回值做判断,示例如下:
hive> SELECT name,salary,
> CASE
> WHEN salary < 50000.0 THEN 'low'
> WHEN salary >= 50000.0 AND salary < 70000.0 THEN 'middle'
> WHEN salary >= 70000.0 AND salay < 100000.0 THEN 'high'
ELSE 'very high'
> END AS bracket FROM employees;
John Doe 100000.0 very high
Mary Smith 80000.0 high
Todd Jones 70000.0 high
Bill King 60000.0 middle
Boss Man 200000.0 very high
Fred Finance 150000.0 very high
Stcy Accountant 60000.0 middle
WHERE过滤条件
SELECT决定返回哪些数据列,而
WHERE决定返回那些符合条件的数据:
hive> SELECT name,salary,deductions["Federal Taxes"],
> salary * (1 - deductions["Federal Taxes"])
> FROM employees
> WHERE round(salary * (1 - deductions["Federal Taxes"])) >
> 70000;
John Doe 100000.0 0.2 80000.0
该示例有一个问题,那就是
salary * (1 - deductions["Federal Taxes"])分别在
SELECT部分和
WHERE部分都执行了,性能上不是多优化。那么,对salary * (1 - deductions["Federal Taxes"])使用别名能否消除这种冲突呢?,不幸的是这是无效的:
hive> SELECT name,salary,deductions["Federal Taxes"],
> salary * (1 - deductions["Federal Taxes"]) AS
> salary_minus_fed_taxes
> FROM employees
> WHERE round(salary_minus_fed_taxes) > 70000;
FAILED:Error in semantic analysis: Line 4:13 Invalid table alias or
colomn reference 'salary_minus_fed_taxes': (possible colomn names
are: name,salary,subordinates,deductions,address)
如错误信息中所说,用户不能在
WHERE部分中引用列的别名,那么我们是否可以使用其他办法来消除这种冲突呢?答案是使用子查询:
hive> SELECT e.* FROM
> (SELECT name,salary,deductions["Federal Taxes"] AS ded,
> salary * (1 - deductions["Federal Taxes"]) AS
> salary_minus_fed_taxes
> FROM employees) e
> WHERE round(salary_minus_fed_taxes) > 70000;