在前面我们看到了UDF、UDTF、UDAF的实现并不是很简单,而且还要求对Java比较熟悉,而Hive设计的初衷是方便那些非Java人员使用。因此,Hive提供了另一种数据处理方式——Streaming,这样就可以不需要编写Java代码了,其实Streaming处理方式可以支持很多语言。但是,Streaming的执行效率通常比对应编写的UDF或改写InputFormat对象的方式要低。管道中序列化然后反序列化数据通常时低效的。而且以通常的方式很难调试整个程序。
Hive中提供了多种语法来使用Streaming,包括:
但是,注意MAP()实际上并非在Mapper阶段执行Streaming,正如REDUCE()实际上并非在Reducer阶段执行Streaming。因此,相同的功能,通常建议使用TRANSFORM()语句,这样可以避免产生疑惑。
Streaming的实现需要TRANSFORM()函数和USING关键字,TRANSFORM()的参数是表的列名,USING关键字用于指定脚本。本节的数据仍然使用Hive UDF教程(一)中所使用的employee表。
例一:Streaming使用Linux命令
先看Streaming直接使用Linux系统中的命令cat来查询表,cat.q是HiveQL文件,内容如下:
SELECT TRANSFORM(e.name, e.salary) USING '/bin/cat' AS name, salary FROM employee e;
hive (mydb)> SOURCE cat.q; OK Time taken: 0.044 seconds Query ID = root_20160120000909_2de2d4f9-b50c-4ed1-a876-768c0127f067 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1453275977382_0001, Tracking URL = http://master:8088/proxy/application_1453275977382_0001/ Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453275977382_0001 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-01-20 00:10:16,258 Stage-1 map = 0%, reduce = 0% 2016-01-20 00:10:22,942 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.12 sec MapReduce Total cumulative CPU time: 1 seconds 120 msec Ended Job = job_1453275977382_0001 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.12 sec HDFS Read: 1040 HDFS Write: 139 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 120 msec OK John Doe 100000.0 Mary Smith 80000.0 Todd Jones 70000.0 Bill King 60000.0 Boss Man 200000.0 Fred Finance 150000.0 Stacy Accountant 60000.0 Time taken: 24.758 seconds, Fetched: 7 row(s)
例二:Streaming使用Python脚本
下面,在对比下Hive的sum()函数,和使用sum.py的Python脚本执行情况,先看Hive的sum()函数执行:
hive (mydb)> SELECT sum(salary) FROM employee; Query ID = root_20160120012525_1abf156b-d44b-4f1c-b2c2-3604e4c1bba0 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1453281391968_0002, Tracking URL = http://master:8088/proxy/application_1453281391968_0002/ Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453281391968_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-01-20 01:25:20,364 Stage-1 map = 0%, reduce = 0% 2016-01-20 01:25:31,620 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.55 sec 2016-01-20 01:25:42,394 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.73 sec MapReduce Total cumulative CPU time: 2 seconds 730 msec Ended Job = job_1453281391968_0002 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.73 sec HDFS Read: 1040 HDFS Write: 9 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 730 msec OK 720000.0 Time taken: 33.891 seconds, Fetched: 1 row(s)
然后,在看Streaming的方式执行,sum.py脚本:
#!/usr/bin/env python import sys def sum(arg): global total total += arg if __name__ == "__main__": total = 0.0 for arg in sys.stdin: sum(float(arg)) print total;
SELECT TRANSFORM(salary) USING 'python /root/experiment/hive/sum.py' AS total FROM employee;
hive> source sum.q; OK Time taken: 0.022 seconds Query ID = root_20160120002626_0ced0b93-e4e8-4f3a-91d0-f2aaa06b5f11 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1453278047512_0002, Tracking URL = http://master:8088/proxy/application_1453278047512_0002/ Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453278047512_0002 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2016-01-20 00:26:28,341 Stage-1 map = 0%, reduce = 0% 2016-01-20 00:26:36,185 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.4 sec MapReduce Total cumulative CPU time: 1 seconds 400 msec Ended Job = job_1453278047512_0002 MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Cumulative CPU: 1.4 sec HDFS Read: 1040 HDFS Write: 9 SUCCESS Total MapReduce CPU Time Spent: 1 seconds 400 msec OK 720000.0 Time taken: 17.048 seconds, Fetched: 1 row(s)
例三:Streaming的WordCount
本节最后,在给一个用Hive Streaming的方式运行WordCount的例子。先看docs数据表:
hive (mydb)> SELECT * FROM docs; OK hello world hello hadoop hello spark Time taken: 0.044 seconds, Fetched: 3 row(s)
#!/sur/bin/env python import sys def splitWord(rows): words = rows.strip().split(" ") for word in words: print "%s\t1" % (word) if __name__ == "__main__": for line in sys.stdin: splitWord(line)
#!/usr/bin/env python import sys (lastKey, lastCount) = (None, 0) #f = open("test") for line in sys.stdin: (key, count) = line.strip().split("\t") if (lastKey) and (lastKey != key): print "%s\t%d" % (lastKey, lastCount) (lastKey, lastCount) = (key, int(count)) else: lastKey = key lastCount += int(count) if lastKey: print "%s\t%d" % (lastKey, lastCount)
CREATE TABLE IF NOT EXISTS wordcount( word STRING, count INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'; FROM( FROM docs SELECT TRANSFORM(line) USING 'python /root/experiment/hive/wc_mapper.py' AS word, count CLUSTER BY word) wc INSERT OVERWRITE TABLE wordcount SELECT TRANSFORM(wc.word, wc.count) USING 'python /root/experiment/hive/wc_reducer.py' AS words, counts;
hive (mydb)> SOURCE wc.q; OK Time taken: 0.022 seconds OK Time taken: 0.066 seconds Query ID = root_20160120013535_c6e957a9-1981-475a-b21a-e73576df6a99 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1453281391968_0003, Tracking URL = http://master:8088/proxy/application_1453281391968_0003/ Kill Command = /root/install/hadoop-2.4.1/bin/hadoop job -kill job_1453281391968_0003 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 2016-01-20 01:35:53,691 Stage-1 map = 0%, reduce = 0% 2016-01-20 01:36:00,339 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.15 sec 2016-01-20 01:36:08,961 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.98 sec MapReduce Total cumulative CPU time: 2 seconds 980 msec Ended Job = job_1453281391968_0003 Loading data to table mydb.wordcount Table mydb.wordcount stats: [numFiles=1, numRows=4, totalSize=33, rawDataSize=29] MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.98 sec HDFS Read: 260 HDFS Write: 103 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 980 msec OK Time taken: 25.652 seconds hive (mydb)> SELECT * FROM wordcount; OK hadoop 1 hello 3 spark 1 world 1 Time taken: 0.047 seconds, Fetched: 4 row(s)