原帖地址: http://blog.csdn.net/nsrainbow/article/details/41748863 最新课程请关注原作者博客
Hive 提供了一个让大家可以使用sql去查询数据的途径。让大家可以在hadoop上写sql语句。但是最好不要拿Hive进行实时的查询。因为Hive的实现原理是把sql语句转化为多个Map Reduce任务所以Hive非常慢,官方文档说Hive 适用于高延时性的场景而且很费资源。
举个简单的例子,可以像这样去查询
hive> select * from h_employee; OK 1 1 peter 2 2 paul Time taken: 9.289 seconds, Fetched: 2 row(s)
相比起很多教程先介绍概念,我喜欢先动手装上,然后用例子来介绍概念。我们先来安装一下Hive
先确认是否已经安装了对应的yum源,如果没有照这个教程里面写的安装cdh的yum源http://blog.csdn.net/nsrainbow/article/details/36629339
hive 基本包
yum install hive -y
hive metastore
yum install hive-metastore
hive服务端
yum install hive-server2 -y
yum install hive-hbase -y
yum install mysql-server启动服务
service mysqld start添加到自启动
chkconfig mysqld on初始化mysql的一些参数,比如root用户的密码等
$ sudo /usr/bin/mysql_secure_installation [...] Enter current password for root (enter for none): OK, successfully used password, moving on... [...] Set root password? [Y/n] y New password: Re-enter new password: Remove anonymous users? [Y/n] Y [...] Disallow root login remotely? [Y/n] N [...] Remove test database and access to it [Y/n] Y [...] Reload privilege tables now? [Y/n] Y All done!
$ sudo yum install mysql-connector-java $ ln -s /usr/share/java/mysql-connector-java.jar /usr/lib/hive/lib/mysql-connector-java.jar
$ mysql -u root -p Enter password: mysql> CREATE DATABASE metastore; mysql> USE metastore; mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.13.0.mysql.sql;创建hive用户
mysql> CREATE USER 'hive'@'metastorehost' IDENTIFIED BY 'mypassword'; ... mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'metastorehost'; mysql> GRANT ALL PRIVILEGES ON metastore.* TO 'hive'@'metastorehost'; mysql> FLUSH PRIVILEGES;
mysql> CREATE USER 'hive'@'%' IDENTIFIED BY 'hive'; mysql> REVOKE ALL PRIVILEGES, GRANT OPTION FROM 'hive'@'%'; mysql> GRANT ALL PRIVILEGES ON metastore.* TO 'hive'@'%'; mysql> FLUSH PRIVILEGES; mysql> quit;
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://host1/metastore</value> <description>the URL of the MySQL database</description> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>hive</value> </property> <property> <name>datanucleus.autoCreateSchema</name> <value>false</value> </property> <property> <name>datanucleus.fixedDatastore</name> <value>true</value> </property> <property> <name>datanucleus.autoStartMechanism</name> <value>SchemaTable</value> </property> <property> <name>hive.metastore.uris</name> <value>thrift://192.168.199.126:9083</value> <description>IP address (or fully-qualified domain name) and port of the metastore host</description> </property> <property> <name>hive.metastore.schema.verification</name> <value>true</value> </property> </configuration>
<property> <name>hive.support.concurrency</name> <description>Enable Hive's Table Lock Manager Service</description> <value>true</value> </property> <property> <name>hive.zookeeper.quorum</name> <description>Zookeeper quorum used by Hive's Table Lock Manager</description> <value>host1,host2</value> </property>
<property> <name>hive.zookeeper.client.port</name> <value>2222</value> <description> The port at which the clients will connect. </description> </property>
service hive-metastore start service hive-server2 start
Starting Hive Metastore Server Error creating temp dir in hadoop.tmp.dir /data/hdfs/tmp due to Permission denied
cd /data/hdfs chmod a+rwx tmp
$ hive hive> hive> show tables; OK Time taken: 10.345 seconds
Hive 中建立的表都叫metastore表。这些表并不真实的存储数据,而是定义真实数据跟hive之间的映射,就像传统数据库中表的meta信息,所以叫做metastore。实际存储的时候可以定义的存储模式有四种:
CREATE TABLE worker(id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';
$ hdfs dfs -ls /user/hive/warehouse Found 11 items drwxrwxrwt - root supergroup 0 2014-12-02 14:42 /user/hive/warehouse/h_employee drwxrwxrwt - root supergroup 0 2014-12-02 14:42 /user/hive/warehouse/h_employee2 drwxrwxrwt - wlsuser supergroup 0 2014-12-04 17:21 /user/hive/warehouse/h_employee_export drwxrwxrwt - root supergroup 0 2014-08-18 09:20 /user/hive/warehouse/h_http_access_logs drwxrwxrwt - root supergroup 0 2014-06-30 10:15 /user/hive/warehouse/hbase_apache_access_log drwxrwxrwt - username supergroup 0 2014-06-27 17:48 /user/hive/warehouse/hbase_table_1 drwxrwxrwt - username supergroup 0 2014-06-30 09:21 /user/hive/warehouse/hbase_table_2 drwxrwxrwt - username supergroup 0 2014-06-30 09:43 /user/hive/warehouse/hive_apache_accesslog drwxrwxrwt - root supergroup 0 2014-12-02 15:12 /user/hive/warehouse/hive_employee
CREATE TABLE workers( id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';
hive> show tables; OK h_employee h_employee2 h_employee_export h_http_access_logs hive_employee workers Time taken: 0.371 seconds, Fetched: 6 row(s)
$ cat workers.csv 1,jack 2,terry 3,michael
hive> LOAD DATA LOCAL INPATH '/home/alex/workers.csv' INTO TABLE workers; Copying data from file:/home/alex/workers.csv Copying file: file:/home/alex/workers.csv Loading data to table default.workers Table default.workers stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 25, raw_data_size: 0] OK Time taken: 0.655 seconds
hive> select * from workers; OK 1 jack 2 terry 3 michael Time taken: 0.177 seconds, Fetched: 3 row(s)
# hdfs dfs -ls /user/hive/warehouse/workers/ Found 1 items -rwxrwxrwt 2 root supergroup 25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv
# cat workers2.txt 4,peter 5,kate 6,ted
hive> LOAD DATA LOCAL INPATH '/home/alex/workers2.txt' INTO TABLE workers; Copying data from file:/home/alex/workers2.txt Copying file: file:/home/alex/workers2.txt Loading data to table default.workers Table default.workers stats: [num_partitions: 0, num_files: 2, num_rows: 0, total_size: 46, raw_data_size: 0] OK Time taken: 0.79 seconds
# hdfs dfs -ls /user/hive/warehouse/workers/ Found 2 items -rwxrwxrwt 2 root supergroup 25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv -rwxrwxrwt 2 root supergroup 21 2014-12-08 15:29 /user/hive/warehouse/workers/workers2.txt
hive> select * from workers; OK 1 jack 2 terry 3 michael 4 peter 5 kate 6 ted Time taken: 0.144 seconds, Fetched: 6 row(s)
create table partition_employee(id int, name string) partitioned by(daytime string) row format delimited fields TERMINATED BY '\054';
# cat 2014-05-05 22,kitty 33,lily # cat 2014-05-06 14,sami 45,micky
hive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-05' INTO TABLE partition_employee partition(daytime='2014-05-05'); Copying data from file:/home/alex/2014-05-05 Copying file: file:/home/alex/2014-05-05 Loading data to table default.partition_employee partition (daytime=2014-05-05) Partition default.partition_employee{daytime=2014-05-05} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0] Table default.partition_employee stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0] OK Time taken: 1.154 seconds hive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-06' INTO TABLE partition_employee partition(daytime='2014-05-06'); Copying data from file:/home/alex/2014-05-06 Copying file: file:/home/alex/2014-05-06 Loading data to table default.partition_employee partition (daytime=2014-05-06) Partition default.partition_employee{daytime=2014-05-06} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0] Table default.partition_employee stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 42, raw_data_size: 0] OK Time taken: 0.763 seconds
hive> select * from partition_employee where daytime='2014-05-05'; OK 22 kitty 2014-05-05 33 lily 2014-05-05 Time taken: 0.173 seconds, Fetched: 2 row(s)
hive> select * from partition_employee where daytime>='2014-05-05'; OK 22 kitty 2014-05-05 33 lily 2014-05-05 14 sami 2014-05-06 45 mick' 2014-05-06 Time taken: 0.273 seconds, Fetched: 4 row(s)
# hdfs dfs -ls /user/hive/warehouse/partition_employee Found 2 items drwxrwxrwt - root supergroup 0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-05 drwxrwxrwt - root supergroup 0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-06
create table p_student(id int, name string) partitioned by(daytime string,country string) row format delimited fields TERMINATED BY '\054';
# cat 2014-09-09-CN 1,tammy 2,eric # cat 2014-09-10-CN 3,paul 4,jolly # cat 2014-09-10-EN 44,ivan 66,billy
hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-09-CN' INTO TABLE p_student partition(daytime='2014-09-09',country='CN'); Copying data from file:/home/alex/2014-09-09-CN Copying file: file:/home/alex/2014-09-09-CN Loading data to table default.p_student partition (daytime=2014-09-09, country=CN) Partition default.p_student{daytime=2014-09-09, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0] Table default.p_student stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0] OK Time taken: 0.736 seconds hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-CN' INTO TABLE p_student partition(daytime='2014-09-10',country='CN'); Copying data from file:/home/alex/2014-09-10-CN Copying file: file:/home/alex/2014-09-10-CN Loading data to table default.p_student partition (daytime=2014-09-10, country=CN) Partition default.p_student{daytime=2014-09-10, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0] Table default.p_student stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 38, raw_data_size: 0] OK Time taken: 0.691 seconds hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-EN' INTO TABLE p_student partition(daytime='2014-09-10',country='EN'); Copying data from file:/home/alex/2014-09-10-EN Copying file: file:/home/alex/2014-09-10-EN Loading data to table default.p_student partition (daytime=2014-09-10, country=EN) Partition default.p_student{daytime=2014-09-10, country=EN} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0] Table default.p_student stats: [num_partitions: 3, num_files: 3, num_rows: 0, total_size: 59, raw_data_size: 0] OK Time taken: 0.622 seconds
# hdfs dfs -ls /user/hive/warehouse/p_student Found 2 items drwxr-xr-x - root supergroup 0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09 drwxr-xr-x - root supergroup 0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-10 # hdfs dfs -ls /user/hive/warehouse/p_student/daytime=2014-09-09 Found 1 items drwxr-xr-x - root supergroup 0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09/country=CN
hive> select * from p_student; OK 1 tammy 2014-09-09 CN 2 eric 2014-09-09 CN 3 paul 2014-09-10 CN 4 jolly 2014-09-10 CN 44 ivan 2014-09-10 EN 66 billy 2014-09-10 EN Time taken: 0.228 seconds, Fetched: 6 row(s)
hive> select * from p_student where daytime='2014-09-10' and country='EN'; OK 44 ivan 2014-09-10 EN 66 billy 2014-09-10 EN Time taken: 0.224 seconds, Fetched: 2 row(s)
CREATE TABLE b_student(id INT, name STRING) PARTITIONED BY(dt STRING, country STRING) CLUSTERED BY(id) SORTED BY(name) INTO 4 BUCKETS row format delimited fields TERMINATED BY '\054';
hive> select * from b_student; OK 1 tammy 2014-09-09 CN 2 eric 2014-09-09 CN 3 paul 2014-09-10 CN 4 jolly 2014-09-10 CN 34 allen 2014-09-11 EN Time taken: 0.727 seconds, Fetched: 5 row(s)
hive> select * from b_student tablesample(bucket 1 out of 4 on id); Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1406097234796_0041, Tracking URL = http://hadoop01:8088/proxy/application_1406097234796_0041/ Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1406097234796_0041 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0 2014-12-08 17:35:56,995 Stage-1 map = 0%, reduce = 0% 2014-12-08 17:36:06,783 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.9 sec 2014-12-08 17:36:07,845 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.9 sec MapReduce Total cumulative CPU time: 2 seconds 900 msec Ended Job = job_1406097234796_0041 MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 2.9 sec HDFS Read: 482 HDFS Write: 22 SUCCESS Total MapReduce CPU Time Spent: 2 seconds 900 msec OK 4 jolly 2014-09-10 CN
hbase(main):005:0> create 'employee','info' 0 row(s) in 0.4740 seconds => Hbase::Table - employee hbase(main):006:0> put 'employee',1,'info:id',1 0 row(s) in 0.2080 seconds hbase(main):008:0> scan 'employee' ROW COLUMN+CELL 1 column=info:id, timestamp=1417591291730, value=1 1 row(s) in 0.0610 seconds hbase(main):009:0> put 'employee',1,'info:name','peter' 0 row(s) in 0.0220 seconds hbase(main):010:0> scan 'employee' ROW COLUMN+CELL 1 column=info:id, timestamp=1417591291730, value=1 1 column=info:name, timestamp=1417591321072, value=peter 1 row(s) in 0.0450 seconds hbase(main):011:0> put 'employee',2,'info:id',2 0 row(s) in 0.0370 seconds hbase(main):012:0> put 'employee',2,'info:name','paul' 0 row(s) in 0.0180 seconds hbase(main):013:0> scan 'employee' ROW COLUMN+CELL 1 column=info:id, timestamp=1417591291730, value=1 1 column=info:name, timestamp=1417591321072, value=peter 2 column=info:id, timestamp=1417591500179, value=2 2 column=info:name, timestamp=1417591512075, value=paul 2 row(s) in 0.0440 seconds
hive> CREATE EXTERNAL TABLE h_employee(key int, id int, name string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:id,info:name") > TBLPROPERTIES ("hbase.table.name" = "employee"); OK Time taken: 0.324 seconds hive> select * from h_employee; OK 1 1 peter 2 2 paul Time taken: 1.129 seconds, Fetched: 2 row(s)
hive> select * from h_employee limit 1 > ; OK 1 1 peter Time taken: 0.284 seconds, Fetched: 1 row(s)但是不支持起点,比如offset