如图所示,Hive主要有三个角色:HiveServer2、Metastore Server、以及代理角色Gateway
主要两个服务端守护进程:
1、Hiveserver2:支撑JDBC访问,Thrift服务,部署在masternode3节点。
2、MetaStore Server:支撑访问元数据库的服务,部署在toolnode1节点。
Complier:编译器,编译hql语法。
Optimizer:优化hql代码,产生最优执行计划。通过explain select …查看执行计划。
Executor:执行最终转化的类(MRjob)。
用户接口主要有三个:CLI, JDBC/ODBC和WebGUI。
1、CLI,即hive shell命令行,Command line。
2、JDBC/ODBC是Hive的JAVA,与使用传统数据库JDBC的方式类似。
3、WebGUI是通过浏览器访问Hive,废弃功能。
下面测试一下从MySql导入一张千万数据量的测试表进入hive
下面是导入命令行:
sqoop import --connect jdbc:mysql://192.168.20.160:3306/test --username root --password 111111 --table card --fields-terminated-by '/t' --delete-target-dir --num-mappers 1 --hive-import --hive-database default --hive-table test
SELECT count(*)
FROM test
<!-- hive jdbc-->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.1.0</version>
<exclusions>
<exclusion>
<groupId>org.eclipse.jetty.aggregate</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
ackage com.sanshi.hivetest;
import com.sanshi.hivetest.entity.Employee;
import com.sanshi.hivetest.util.HiveUtil;
import org.springframework.beans.factory.annotation.Autowired;
import java.beans.IntrospectionException;
import java.lang.reflect.InvocationTargetException;
import java.sql.*;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/**
* Created by lirui on 2019/3/29
*/
public class hivetest {
private static String driverName = "org.apache.hive.jdbc.HiveDriver";
private static String url = "jdbc:hive2://192.168.20.164:10000/default";
private static String user = "hive";
private static String password = "hive";
private static Connection conn = null;
private static Statement stmt = null;
private static ResultSet rs = null;
@Autowired
Employee employee;
public static void main(String[] args) throws InvocationTargetException, IntrospectionException, InstantiationException, IllegalAccessException {
List<Map> result = new ArrayList<>();
PreparedStatement pstmt = null;
try {
Class.forName(driverName);
} catch (ClassNotFoundException e) {
System.out.println(e.toString());
}
try {
conn = DriverManager.getConnection(url,user,password);
stmt = conn.createStatement();
String sql = "select * from customers where id ='17254' ";
ResultSet res = stmt.executeQuery(sql);
System.out.println("id" + "/t" + "姓名" + "/t" + "email" + "/t" + "地址" + "/t" + "其他" );
while (res.next()) {
System.out.println(res.getString("id") + "/t/t" + res.getString("name") + "/t/t" + res.getString("email_preferences") + "/t/t" + res.getString("addresses") + "/t/t" + res.getString("orders") );
}
} catch (SQLException e) {
e.printStackTrace();
}
}
}
{"email_format":"text","frequency":"daily","categories":{"promos":true,"surveys":true}} {"shipping":{"street_1":"158 Jadewood Drive","street_2":"Apt 2","city":"Gary","state":"IN","zip_code":"46403"},"billing":{"street_1":"4169 Oakwood Lane","street_2":"","city":"Gary","state":"IN","zip_code":"46403"}} [{"order_id":"I72T39","order_date":"2015-03-14T11:00:00-05:00","items":[{"product_id":4112183,"sku":"T513-091-2","name":"Tea for One","price":18.0,"qty":1}]}]
如上所示,这是一个简单的查询Sql,Java调用跟关系型数据库一样JDBC调用
众所周知,hive的查询延迟是很高的,由于没有索引,需要扫描整张表,另一个原因是MapReduce计算框架,由于MapReduce本身具有很高的延迟,因此在利用MapReduce执行查询时,也会有很高的延迟,因此,我们决定使用Spark on Hive提升hive性能
测试表:customers
测试Sql:
select * from customers where id ='54362'
MapReduce测试时间:18 s
Hive on Spark第一次查询时间:1 min 13 s
因为第一次启动要开启各主机spark计算进程,所以比较耗时
Hive on Spark第二次查询时间:4 s
通过测试发现Hive on Spark对hive性能有明显提升,这还是运行在只有三个计算节点,每个计算节点只有4G内存的集群,在更高配置的集群里,Hive on Spark对hive的提升性能更加巨大。