关于Hadoop和Hive的安装,可参考我们公司一位Hadoop牛人写的Hadoop一键安装(里面包含了Hive的安装)
https://github.com/hadoop-deployer/hadoop-deployer
Impala 号称在性能上比Hive高出3~30倍,甚至预言说在将来的某一天可能会超过Hive的使用率而成为Hadoop上最流行的实时计算平台(也许我这里有点曲解Impala专家的意思,但其诱惑的言辞足以令Hadoop迷不禁有蠢蠢欲试的激动)。毕竟Impala也是人写出来的,是否真的如想象中的快,还得靠客观数据来验证。下面就这两个星期对Impala的认识小记一下,供日后翻阅。(请原谅我没有告诉你Hadoop是个啥东东,因为我这里假设你已经听过这头在海量数据的世界驰骋几个岁月的大象,但不一定要求你是大牛)
以下内容是对Cloudera官网中关于Impala文档(主要是《Installing and Using Cloudera Impala》)一些内容的个人理解,欠妥之处还请不吝赐教:
Impala的目的不在于替换现有的MapReduce工具,如Hive,而是提供一个统一的平台用于实时查询。事实上Impala的运行也是依赖Hive的元数据。Impala与其它组件之间的关系如下:
与Hive类似,Impala也可以直接与HDFS和HBase库直接交互。只不过Hive和其它建立在MapReduce上的框架适合需要长时间运行的批处理任务。例如那些批量提取,转化,加载(ETL)类型的Job。而Impala主要用于实时查询。
对应进程为 statestored (笔者这里使用的Impala版本为0.4,有些版本的statestore进程名可能不是这样叫的)
用于协调各个运行impalad的实例之间的信息关系,Impala正是通过这些信息去定位查询请求所要的数据。换句话说,state store的作用主要为跟踪各个impalad实例的位置和状态,让各个impalad实例以集群的方式运行起来。
与 HDFS的NameNode不一样,虽然State Store一般只安装一份,但一旦State Store挂掉了,各个impalad实例却仍然会保持集群的方式处理查询请求,只是无法将各自的状态更新到State Store中,如果这个时候新加入一个impalad实例,则新加入的impalad实例不为现有集群中的其他impalad实例所识别(事实上,经笔者测试,如果impalad启动在statestored之后,根本无法正常启动,因为impalad启动时是需要指定statestored的主机信息的)。然而,State Store一旦重启,则所有State Store所服务的各个impalad实例(包括state store挂掉期间新加入的impalad实例)的信息(由impalad实例发给state store)都会进行重建。
对应进程为 impalad(核心进程,数据的计算就靠这个进程来执行)
该进程应运行在DataNode机器上(建议每个DataNode机器运行一个impalad,官方的意思似乎这种建议是必须的),每个impalad实例会接收、规划并调节来自ODBC或Impala Shell等客户端的查询。每个impalad实例会充当一个Worker,处理由其它impalad实例分发出来的查询片段(query fragments)。客户端可以随便连接到任意一个impalad实例,被连接的impalad实例将充当本次查询的协调者(Ordinator),将查询分发给集群内的其它impalad实例进行并行计算。当所有计算完毕时,其它各个impalad实例将会把各自的计算结果发送给充当 Ordinator的impalad实例,由这个Ordinator实例把结果返回给客户端。每个impalad进程可以处理多个并发请求。
这里介绍使用rpm包安装的方式(需有root或sudo权限),基于源码包安装的方式待后续折腾。
impala能使用的内存无法超过系统的硬件可用内存(GA版,查询需要的内存如果超出硬件内存,则查询将失败),对内存要求高,典型的硬件内存为:32~48G
impala(版本0.4)只支持redhat 5.7/centos 5.7或redhat 6.2/centos 6.2以上(好像还要求是64位的,所以建议安装在64位系统上),不支持ubuntu
假设你已经安装了CDH4(即Hadoop 2.0)
假设你已经安装了Hive,并配置一个外部数据库(如MySQL)供Hive存储元数据。可通过执行下面的命令来判断Hive是否安装正常
$ hive
hive> show tables;
OK
Time taken: 2.809 seconds
这里请原谅我没有提到Hadoop和Hive的安装过程,还请尊驾自行搜索。
Impala不支持的特性:
下载Impala的yum repository (考虑到内存和性能问题,如果机器数允许,建议Impalad实例不要跟NameNode运行在同一台机,但却需与DataNode安装在同一台机,以免影响Impala整体性能)。因Impala的rpm包比较大(v0.4版约90M),且需要在多部机器上安装,故建议直接下载rpm包,然后通过rpm -ivh的方式安装。这里给出rpm包的地址:http://beta.cloudera.com/impala/redhat/6/x86_64/impala/0/RPMS/x86_64/。
rpm包如下(发现Impala的版本已经更新到v0.5了,但本文的测试结果还是还是基于Impala v0.4的):
文件名 | 更新时间 | 包大小 |
---|---|---|
impala-0.5-1.p0.491.el6.x86_64.rpm | 01-Feb-2013 20:10 | 97M |
impala-debuginfo-0.5-1.p0.491.el6.x86_64.rpm | 01-Feb-2013 20:10 | 75M |
impala-server-0.5-1.p0.491.el6.x86_64.rpm | 01-Feb-2013 20:10 | 4.2K |
impala-shell-0.5-1.p0.491.el6.x86_64.rpm | 01-Feb-2013 20:10 | 450K |
impala-state-store-0.5-1.p0.491.el6.x86_64.rpm | 01-Feb-2013 20:10 | 4.3K |
其中,除了 impala-debuginfo-0.5-1.p0.491.el6.x86_64.rpm 可以不下载之外,其它几个包都是必须的,尤其是 impala-0.5-1.p0.491.el6.x86_64.rpm ,这里对各个包的作用稍微说明一下:
如果你选择用yum的方式来安装,则请将下面的repo文件拷贝到/etc/yum.repos.d/ 目录下
文件:cloudera-impala.repo
[cloudera-impala]
name=Impala
baseurl=http://beta.cloudera.com/impala/redhat/6/x86_64/impala/0/
gpgkey = http://beta.cloudera.com/impala/redhat/6/x86_64/impala/RPM-GPG-KEY-cloudera
gpgcheck = 1
如果你非要选择yum的方式安装,请执行以下相关命令(这里假设你有sudo权限,不建议用该方式,除非你的repo库是在内网。当然这种方式也有个好处,它会自动安装一些依赖包):
进入Impala安装目录,默认为/usr/lib/impala(可通过rpm -ql impala查看),创建目录conf如果不存在的话。这里创建conf目录是为了存放impalad的配置文件,impalad的配置文件路径由环境变量IMPALA_CONF_DIR指定,默认为/usr/lib/impala/conf。
拷贝hive-site.xml、core-site.xml、hdfs-site.xml(只需从Hadoop和Hive配置文件目录中拷贝过来)至/usr/lib/impala/conf目录下(假设 impalad的配置文件路径为/usr/lib/impala/conf),并作下面修改(这些修改据官方文档,说是为了优化Impala性能,但具体效果如何,笔者目前尚未测出):
在core-site.xml文件中添加如下内容(如果不存在的话):
1
2
3
4
5
6
7
8
9
10
|
<
property
>
<
name
>
dfs
.
client
.
read
.
shortcircuit
<
/
name
>
<
value
>
true
<
/
value
>
<
/
property
>
<
property
>
<
name
>
dfs
.
client
.
read
.
shortcircuit
.
skip
.
checksum
<
/
name
>
<
value
>
false
<
/
value
>
<
/
property
>
|
在hdfs-site.xml文件中添加如下内容(如果不存在的话):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
<
property
>
<
name
>
dfs
.
datanode
.
data
.
dir
.
perm
<
/
name
>
<
value
>
755
<
/
value
>
<
/
property
>
<
property
>
<
name
>
dfs
.
block
.
local
-
path
-
access
.
user
<
/
name
>
<
value
>
hadoop
<
/
value
>
<
/
property
>
<
property
>
<
name
>
dfs
.
datanode
.
hdfs
-
blocks
-
metadata
.
enabled
<
/
name
>
<
value
>
true
<
/
value
>
<
/
property
>
|
这里需要提一点的是,如果你用的是hadoop 2.0(即CDH4,虽然官方也称Impala必须得CDH4以上)的HA方式配置NameNode,则Impala的core-site.xml(注意,只有Impala的core-site.xml才需作修改,Hadoop的core-site.xml配置文件不用改)还需作以下修改:
将原来(以NameNode的HA方式配置,其中mycluster代表某个NameService)
1
2
3
4
5
|
<
property
>
<
name
>
fs
.
defaultFS
<
/
name
>
<
value
>
hdfs
:
//mycluster</value>
<
/
property
>
|
改为(以NameNode非HA方式配置,即指定某个具体的NameNode主机信息)
1
2
3
4
5
|
<
property
>
<
name
>
fs
.
defaultFS
<
/
name
>
<
value
>
hdfs
:
//192.168.22.30:12900</value>
<
/
property
>
|
这里配置的NameNode主机信息要求与impalad实例启动时指定的-nn=namenode_host -nn_port=namenode_port参数的信息一致。从这里也初步怀疑Impala目前可能尚不支持NameNode的HA配置(到底是不是如此,还请高人赐教)。
下面为笔者安装后的机器(共5台,节点越多,也许越能测出更有价值的性能数字)及服务(安装中发现impalad的启动似乎需要依赖Hive,所以每台启动impalad实例的机器都需安装Hive,这关系有点诡异):
主机 | 通过jps命令查看到的服务 | 其它服务 |
---|---|---|
192.168.22.30 | JournalNode QuorumPeerMain NodeManager NameNode ResourceManager DataNode DFSZKFailoverController | Hive statestored impalad |
192.168.22.31 | NameNode QuorumPeerMain NodeManager DFSZKFailoverController JournalNode DataNode | Hive impalad |
192.168.22.32 | NodeManager JournalNode DataNode QuorumPeerMain | Hive impalad |
192.168.22.33 | DataNode JournalNode QuorumPeerMain NodeManager | Hive impalad |
192.168.22.34 | QuorumPeerMain JournalNode DataNode NodeManager | Hive impalad |
从上面可见,笔者在5台机器中都启动了impalad实例,而只有192.168.22.30那台机器启动了statestored。你可能会问,为何没看到启动impala-shell客户端的机器,那是因为笔者决定impala-shell随便装在那台机器都可以,只要能连接到上面启动impalad实例的机器便可,故这里没列出。
以下为impala服务启动命令:
先启动statestored(默认端口为24000):
statestored -state_store_port=24000
再启动impalad实例:
HADOOP_CONF_DIR=”/usr/lib/impala/conf” impalad -state_store_host=192.168.22.30 -nn=192.168.22.30 -nn_port=12900 -hosame=192.168.22.34 -ipaddress=192.168.22.34
注意:
这里需要加上HADOOP_CONF_DIR,否则在impala查询数据,可能会报类似 Wrong FS 。。。expect 。。。的错误
其中的-nn和-nn_port,表示NameNode的主机和端口,因Hadoop 2以上的版本对NameNode采用HA的方式,对外提供NameService而不是某个具体的NameNode,然而这里impalad启动时却依然需要知道某个具体的NameNode的主机和端口,怀疑Impala目前尚不支持Hadoop的NameNode的HA方式。
在第一次启动impalad的时候,你可能会遇到impalad报类似找不到JDBC数据库驱动(假设为MYSQL)的问题,其实是因为impalad默认使用的数据库驱动包的位置为:/usr/share/java/mysql-connector-java.jar,该配置默认由/etc/default/impala文件中的MYSQL_CONNECTOR_JAR项指定,读者可在~/.bash_profile文件中修改为自己的驱动文件路径,如下为笔者在~/.bash_profile中添加的项:
export MYSQL_CONNECTOR_JAR=$HOME/hive/lib/mysql-connector-java-5.1.16-bin.jar:$MYSQL_CONNECTOR_JAR
关于impala服务的启动参数,请参见下表:
Argument | Description | Notes | Required? |
---|---|---|---|
-ipaddress | The IP address for the machine that will host Impalad. While there is a default for this argument, it is important to provide a value other than 127.0.0.1 for good performance. To use the local host, provide the local host’s actual IP address. | Default: 127.0.0.1. | Yes |
-state_store_host | The Impala state store host name. | Default: 127.0.0.1. | Yes |
-state_store_port | The Impala state store port. | Default: 24000. | No |
-nn | The HDFS NameNode hostname or IP address. | For example, MyNameNode. Default: 127.0.0.1. | Yes |
-nn_port | The NameNode port. | Default: 20500. | Yes |
-be_port | Impala’s internal service port. | Default: 22000. | No |
-fe_port | Impala’s front end port for external connections. | Default: 21000. | No |
-log_filename | The path to and name of the file that impala will use to store logging information. | Yes | |
-webserver_interface | The network interface the debugging web server uses. | Default: 0.0.0.0. | No |
-webserver_port | The port the debugging web server uses. | Default: 25000. | No |
-web_log_bytes | The maximum number of bytes to display on a debug web server’s log page. | Default: 1048576 | No |
本次测试中,hdfs中存有文件大小为20G,并已装载到了表mytest_impala中,可通过Hive来装载。
在进入性能测试比较前,先简要介绍一下impala-shell的使用。首先确保你已经有一台机器安装了impala-shell客户端。
$ impala-shell
得到下面的Welcome信息:
Welcome to the Impala shell. Press TAB twice to see a list of available commands. Copyright (c) 2012 Cloudera, Inc. All rights reserved. (Build version: Impala v0.1 (cf57fd9) built on Thu Sep 27 10:32:13 PDT 2012)
[Not connected] >
正如上面提示,可通过敲击两次的TAB键来查看impala-shell目前支持的命令:
connect explain history quit select shell use
describe help insert refresh set show version
[Not connected] >
从中可见,Impala目前尚不支持表的创建(即CREATE TABLE)
[Not connected] > connect 192.168.22.30:21000
Connected to 192.168.22.30:21000
[192.168.22.30:21000] >
[192.168.22.30:21000] > show tables;
Query: show tables
Query finished, fetching results …
mytest
mytest_2
mytest_impala
Returned 3 row(s) in 0.17s
[192.168.22.30:21000] > select * from mytest_impala limit 1;
Query: select * from mytest_impala limit 1
Query finished, fetching results …
1 2012-06-19 21:18:09 http://book1.sina.cn/prog/wapsite/books/vipchl.php?bid=39922&PHPSESSID=9743b7325413117a25d1efa7975daea7&vt=4&wm=4002
Returned 1 row(s) in 1.57s
[192.168.22.30:21000] >
因Impala支持的SQL语句是Hive的HQL语句的一个子集,也就说Hive中的一些HQL语句在这里同样适用,具体请参考相关文档,这里不再详述。
测试show tables。
[192.168.22.31:21000] > show tables;
Query: show tables
Query finished, fetching results …
mytest
mytest_2
mytest_impala
Returned 3 row(s) in 0.01s
刚开始还以为Hive每次show tables都那么慢,当执行第二次时才发现其实不然。
hive> show tables;
OK
mytest
mytest_2
mytest_impala
Time taken: 2.785 seconds
hive> show tables;
OK
mytest
mytest_2
mytest_impala
Time taken: 0.103 seconds
测试select count(*) from mytest_impala
[192.168.22.31:21000] > select count(*) from mytest_impala;
Query: select count(*) from mytest_impala
Query finished, fetching results …
69007188
Returned 1 row(s) in 106.58s
hive> select count(*) from mytest_impala;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_1361238384421_0001, Tracking URL = http://192.168.22.30:12088/proxy/application_1361238384421_0001/
Kill Command = /home/zhengzn/hadoop/bin/hadoop job -Dmapred.job.tracker=ignorethis -kill job_1361238384421_0001
Hadoop job information for Stage-1: number of mappers: 44; number of reducers: 1
2013-02-20 12:11:33,026 Stage-1 map = 0%, reduce = 0%
…
2013-02-20 12:13:45,502 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 256.1 sec
MapReduce Total cumulative CPU time: 4 minutes 16 seconds 100 msec
Ended Job = job_1361238384421_0001
MapReduce Jobs Launched:
Job 0: Map: 44 Reduce: 1 Cumulative CPU: 256.1 sec HDFS Read: 11393427897 HDFS Write: 9 SUCCESS
Total MapReduce CPU Time Spent: 4 minutes 16 seconds 100 msec
OK
69007188
Time taken: 148.285 seconds
测试select count(*) from mytest_impala where id = ’1205-4721599131-fa2451a7′。
[192.168.22.31:21000] > select count(*) from mytest_impala where id = ’1205-4721599131-fa2451a7′;
Query: select count(*) from mytest_impala where id = ’1205-4721599131-fa2451a7′
Query finished, fetching results …
9
Returned 1 row(s) in 96.54s
hive> select count(*) from mytest_impala where id = ’1205-4721599131-fa2451a7′;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_1361238384421_0002, Tracking URL = http://192.168.22.30:12088/proxy/application_1361238384421_0002/
Kill Command = /home/zhengzn/hadoop/bin/hadoop job -Dmapred.job.tracker=ignorethis -kill job_1361238384421_0002
Hadoop job information for Stage-1: number of mappers: 44; number of reducers: 1
2013-02-20 12:46:19,786 Stage-1 map = 0%, reduce = 0%
…
2013-02-20 12:48:00,077 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 295.09 sec
MapReduce Total cumulative CPU time: 4 minutes 55 seconds 90 msec
Ended Job = job_1361238384421_0002
MapReduce Jobs Launched:
Job 0: Map: 44 Reduce: 1 Cumulative CPU: 295.09 sec HDFS Read: 11393427897 HDFS Write: 2 SUCCESS
Total MapReduce CPU Time Spent: 4 minutes 55 seconds 90 msec
OK
9
Time taken: 107.81 seconds
测试select count(*) from mytest_impala group by id;。
[192.168.22.31:21000] > select count(*) from mytest_impala group by id;
Returned 2587674 row(s) in 146.32s
hive> select count(*) from mytest_impala group by id;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 12
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_1361238384421_0005, Tracking URL = http://192.168.22.30:12088/proxy/application_1361238384421_0005/
Kill Command = /home/zhengzn/hadoop/bin/hadoop job -Dmapred.job.tracker=ignorethis -kill job_1361238384421_0005
Hadoop job information for Stage-1: number of mappers: 44; number of reducers: 12
2013-02-20 17:39:48,627 Stage-1 map = 0%, reduce = 0%
…
2013-02-20 17:41:36,799 Stage-1 map = 100%, reduce = 92%, Cumulative CPU 469.77 sec
Time taken: 155.724 seconds
对于上面的测试结果我们也觉得有些困惑,为何跟Impala专家号称的比Hive快3~30倍差那么远呢,虽然是快了点,但并没有传说中的神速。到底是我们的测试节点不够呢,还是我们的测试方法欠妥,本文就以该问题做结束,留给你我来共同思考验证,期待高手不吝赐教。。。。。。