Hive是一个数据仓库软件
- Hive可以使用sql来促进对已经存在分布式设备中的数据进行读,写和管理等操作!
- Hive在使用时,需要对已经存储的数据进行结构的投影(映射)
- Hive提供了一个命令行和JDBC,让用户可以连接到hive!
注意:hive只能分析结构化的数据!Hive在hadoop上,使用hive的前提是安装hadoop!
Hive的特点
- Hive不是一个关系型数据库
- Hive不是基于OLTP(在线饰物处理)设计
- Hive无法做到实时查询,不支持行级别更新(update、delete)
- Hive要分析的数据存储在HDFS,hive为数据创建的表结构(schema),存储在RDMS
- hive基于OLAP(在线分析处理)设计,侧重点在数据的分析上,不追求分析的效率!
- Hive使用类SQL,称为HQL对数据进行分析
- Hive容易使用,可扩展,有弹性
hive安装必须要有JAVA_HOME和HADOOP_HOME。
将apache-hive-1.2.1-bin.tar.gz安装包上传到h2机器上的/opt/soft下,并解压到/opt/module下
[hzhao@h2 ~]$ tar -zxvf /opt/soft/apache-hive-1.2.1-bin.tar.gz -C /opt/module/
配置HIVE的环境变量
[hzhao@h2 ~]$ sudo vim /etc/profile
JAVA_HOME=/opt/module/jdk1.8.0_121
HADOOP_HOME=/opt/module/hadoop-2.7.2
HIVE_HOME=/opt/module/apache-hive-1.2.1-bin
export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin
export PATH JAVA_HOME HADOOP_HOME HIVE_HOME
[hzhao@h2 ~]$ source /etc/profile
测试
[hzhao@h2 ~]$ hive
Logging initialized using configuration in jar:file:/opt/module/apache-hive-1.2.1-bin/lib/hive-common-1.2.1.jar!/hive-log4j.properties
hive> show databases;
OK
default
Time taken: 0.919 seconds, Fetched: 1 row(s)
hive> create table person(name varchar(20),age int);
OK
Time taken: 0.504 seconds
hive> insert into person values('hzhao',23);
Query ID = hzhao_20210104082731_75a2027f-c485-4fa7-93d4-34fd52c8a297
Total jobs = 3
Launching Job 1 out of 3
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1609392649270_0001, Tracking URL = http://h2:8088/proxy/application_1609392649270_0001/
Kill Command = /opt/module/hadoop-2.7.2/bin/hadoop job -kill job_1609392649270_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2021-01-04 08:27:43,310 Stage-1 map = 0%, reduce = 0%
2021-01-04 08:27:52,666 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.81 sec
MapReduce Total cumulative CPU time: 2 seconds 810 msec
Ended Job = job_1609392649270_0001
Stage-4 is selected by condition resolver.
Stage-3 is filtered out by condition resolver.
Stage-5 is filtered out by condition resolver.
Moving data to: hdfs://h1:9000/user/hive/warehouse/person/.hive-staging_hive_2021-01-04_08-27-31_001_7447191442705868645-1/-ext-10000
Loading data to table default.person
Table default.person stats: [numFiles=1, numRows=1, totalSize=9, rawDataSize=8]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.81 sec HDFS Read: 3686 HDFS Write: 79 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 810 msec
OK
Time taken: 22.997 seconds
hive> select * from person;
OK
hzhao 23
Time taken: 0.121 seconds, Fetched: 1 row(s)
1.Hive要分析的数据存储在HDFS上,Hive中的库的位置,在hdfs上就是一个目录!Hive中的表的位置,在hdfs上也是一个目录,在所在的库目录下创建一个子目录!
Hive中的数据,是存在表目录中的文件!
2.在Hive中,存储的数据必须是结构化的数据,而且这个数据的格式要和表紧密相关。表在创建的时候,有分隔符属性,这个分隔符属性,代表在执行MR程序时,使用哪个分隔符区分割每行中的字段。
Hive中默认字段的分隔符为:Ctrl+A。进入编辑模式,ctrl+v,ctrl+a。
[hzhao@h2 ~]$ rpm -qa|grep mysql
mysql-libs-5.1.73-7.el6.x86_64
[hzhao@h2 ~]$ sudo rpm -e --nodeps mysql-libs-5.1.73-7.el6.x86_64
[hzhao@h2 ~]$ sudo rm -rf /var/mysql
[hzhao@h2 ~]$ unzip /opt/soft/mysql-libs.zip -d /opt/module/
[hzhao@h2 ~]$ sudo rpm -ivh /opt/module/mysql-libs/MySQL-client-5.6.24-1.el6.x86_64.rpm
[hzhao@h2 ~]$ sudo rpm -ivh /opt/module/mysql-libs/MySQL-server-5.6.24-1.el6.x86_64.rpm
[hzhao@h2 ~]$ sudo service mysql start
Starting MySQL. SUCCESS!
[hzhao@h2 ~]$ sudo cat /root/.mysql_secret
# The random password set for the root user at Tue Jan 5 03:26:01 2021 (local time): FSygOu627vDTRXYN
[hzhao@h2 ~]$ mysql -u root -pFSygOu627vDTRXYN
mysql> SET PASSWORD = password('123456');
Query OK, 0 rows affected (0.00 sec)
mysql> drop user root@'h2';
Query OK, 0 rows affected (0.01 sec)
mysql> drop user root@'127.0.0.1';
Query OK, 0 rows affected (0.00 sec)
mysql> drop user root@'::1';
Query OK, 0 rows affected (0.00 sec)
mysql> update mysql.user set host = '%' where user = 'root';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> update mysql.user set host = '%' where user = 'root';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
mysql> select user,password,host from mysql.user;
+------+-------------------------------------------+------+
| user | password | host |
+------+-------------------------------------------+------+
| root | *6BB4837EB74329105EE4568DDA7DC67ED2CA2AD9 | % |
+------+-------------------------------------------+------+
1 row in set (0.00 sec)
[hzhao@h2 ~]$ sudo service mysql restart
Shutting down MySQL.. SUCCESS!
Starting MySQL. SUCCESS!
[hzhao@h2 ~]$ vim /opt/module/apache-hive-1.2.1-bin/conf/hive-site.xml
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://h2:3306/metastore?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>123456</value>
</property>
</configuration>
[hzhao@h2 ~]$ tar -zxvf /opt/module/mysql-libs/mysql-connector-java-5.1.27.tar.gz
[hzhao@h2 ~]$ cp mysql-connector-java-5.1.27/mysql-connector-java-5.1.27-bin.jar /opt/module/apache-hive-1.2.1-bin/lib/
手动创建metastore数据库,选择latin1字符集。
表的信息都存储在tbls表中,通过db_id和dbs表中的库进行外键约束! 库的信息都存储在dbs表中!
字段信息存在column_v2表中,通过CD_ID和表的主键进行外键约束!
机器h2上启动hiveserver2服务
[hzhao@h2 ~]$ hiveserver2
引入hive的maven依赖
<dependency>
<groupId>org.apache.hivegroupId>
<artifactId>hive-jdbcartifactId>
<version>1.2.1version>
dependency>
public static void main(String[] args) throws Exception {
Connection connection = DriverManager.getConnection("jdbc:hive2://h2:10000", "root", "123456");
String sql = "select * from person";
PreparedStatement preparedStatement = connection.prepareStatement(sql);
ResultSet resultSet = preparedStatement.executeQuery();
while (resultSet.next()) {
System.out.println("name->"+resultSet.getString("name"));
System.out.println("age->"+resultSet.getInt("age"));
}
}
Default数据仓库的最原始位置是在hdfs上的:/usr/hive/warehouse下。
修改default数据仓库原始位置
[hzhao@h2 ~]$ vim /opt/module/apache-hive-1.2.1-bin/conf/hive-site.xml
在配置文件中添加如下配置
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/hive</value>
</property>
修改default数据仓库原始位置
[hzhao@h2 ~]$ vim /opt/module/apache-hive-1.2.1-bin/conf/hive-site.xml
在配置文件中添加如下配置
<property>
<name>hive.cli.print.header</name>
<value>true</value>
</property>
<property>
<name>hive.cli.print.current.db</name>
<value>true</value>
</property>
[hzhao@h2 ~]$ mv /opt/module/apache-hive-1.2.1-bin/conf/hive-log4j.properties.template /opt/module/apache-hive-1.2.1-bin/conf/hive-log4j.properties
编辑hive.log.dir
hive.log.dir=/opt/module/apache-hive-1.2.1-bin/logs
hive修改为本地模式
hive (default)> set hive.exec.mode.local.auto=true;//开启本地mr
//设置local mr的最大输入数据量,当输入数据量小于这个值时采用local mr的方式,默认为134217728,即128M
hive (default)> set hive.exec.mode.local.auto.inputbytes.max=50000000;
//设置local mr的最大输入文件个数,当输入文件个数小于这个值时采用local mr的方式,默认为4
hive (default)> set hive.exec.mode.local.auto.input.files.max=10;