Hive代码组织和简要架构

Hive代码组织和简要架构(Hive Code Organization and a Brief Architecture )

Introduction

Hive has 3 main components:
Hive 有3个主要的组件:
Serializers/Deserializers (hive-serde) 序列化与反序列化器
This component has the framework libraries that allow users to develop serializers and deserializers for their own data formats. This component also contains some builtin serialization/deserialization families
Hive Serde 该组件具有框架库,允许用户为自己的数据格式开发序列化器和反序列化器。该组件还包含一些内置的序列化/反序列化系列

MeteStore(hive-metastore) 元存储
This component implements the metadata server, which is used to hold all the information about the tables and partitions that are in the warehouse.
Hive Metastore 该组件实现了元数据服务器,该元数据服务器用于保存有关仓库中表和分区的所有信息

Query Processor (hive-exec) 查询处理器
This component implements the processing framework for converting SQL to a graph of map/reduce jobs and the execution time framework to run those jobs in the order of dependencies.
Hive Query Language 该组件实现了将SQL转换为map / reduce作业图的处理框架,以及实现按依赖关系顺序运行这些作业的执行时间框架。

Apart from these major components, Hive also contains a number of other components. These are as follows:
除了这些主要组件之外,Hive还包含许多其他组件。这些如下:
Command Line Interface (hive-cli) - This component has all the java code used by the Hive command line interface
Hive CLI 该组件具有Hive命令行界面使用的所有Java接口(代码).

**Hive Server (hive-service) ** - This component implements all the APIs that can be used by other clients (such as JDBC drivers) to talk to Hive.
Hive Service 该组件实现了所有其他客户端(例如JDBC驱动程序)可以用来与Hive对话的API

** Common (hive-common)** -This component contains common infrastructure needed by the rest of the code. Currently, this contains all the java sources for managing and passing Hive configurations(HiveConf) to all the other code components.
Hive Common 该组件包含其余代码所需的通用基础结构。当前,它包含用于管理Hive配置(HiveConf)并将其传递给所有其他代码组件的所有Java源

Hive Shims 该组件相关Shims类用于不同兼容Hadoop和Hive版本

Ant Utilities (hive-ant) - This component contains the implementation of some ant tasks that are used by the build infrastructure.
Hive Ant Utilities 此组件包含构建基础结构使用的一些ant任务的实现

Scripts (./bin) - This component contains all the scripts provided in the distribution including the scripts to run the Hive CLI (bin/hive).
该组件包含分发中提供的所有脚本,包括运行Hive CLI的脚本(bin / hive)

The following top level directories contain helper libraries, packaged configuration files etc…:
以下顶级目录包含帮助程序库,打包的配置文件等:
./conf - This directory contains the packaged hive-default.xml and hive-site.xml.
此目录包含打包的hive-default.xml和hive-site.xml
./data - This directory contains some data sets and configurations used in the Hive tests.
此目录包含在Hive测试中使用的一些数据集和配置
./ivy - This directory contains the Ivy files used by the build infrastructure to manage dependencies on different Hadoop versions.
此目录包含构建基础架构用来管理对不同Hadoop版本的依赖关系的Ivy文件
./lib - This directory contains the run time libraries needed by Hive.
此目录包含Hive所需的运行时库
trunk/testlibs - This directory contains the junit.jar used by the JUnit target in the build infrastructure.
此目录包含构建基础结构中JUnit目标使用的junit.jar。
trunk/testutils (Deprecated)不推荐使用(已弃用)

Hive SerDe

What is a SerDe?

  • SerDe is a short name for “Serializer and Deserializer.”
    SerDe是“序列化器和反序列化器”的缩写。
  • Hive uses SerDe (and FileFormat) to read and write table rows.
    Hive使用SerDe(和FileFormat)来读取和写入表行。
  • HDFS files --> InputFileFormat --> --> Deserializer --> Row object
  • HDFS文件 --> InputFileFormat–> --> 反序列化器 --> 行对象
  • Row object --> Serializer --> --> OutputFileFormat --> HDFS files
    行对象 --> 序列化器 —> --> OutputFileFormat —> HDFS文件

One principle of Hive is that Hive does not own the HDFS file format. Users should be able to directly read the HDFS files in the Hive tables using other tools or use other tools to directly write to HDFS files that can be loaded into Hive through “CREATE EXTERNAL TABLE” or can be loaded into Hive through “LOAD DATA INPATH,” which just move the file into Hive’s table directory.
Hive的一项原则是Hive不拥有HDFS文件格式。用户应该能够使用其他工具直接读取Hive表中的HDFS文件,或者使用其他工具直接写入可以通过“ CREATE EXTERNAL TABLE”加载到Hive中或通过“ LOAD DATA INPATH”加载到Hive中的HDFS文件。 ”,将文件移到Hive的表目录中

Note that org.apache.hadoop.hive.serde is the deprecated old SerDe library. Please look at org.apache.hadoop.hive.serde2 for the latest version.

Hive currently uses these FileFormat classes to read and write HDFS files:
Hive当前使用以下FileFormat类读取和写入HDFS文件
TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.
SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in Hadoop SequenceFile format.

Hive currently uses these SerDe classes to serialize and deserialize data:
Hive当前使用这些SerDe类对数据进行序列化和反序列化

MetadataTypedColumnsetSerDe: This SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (sorry, quote is not supported yet).
LazySimpleSerDe: This SerDe can be used to read the same data format as MetadataTypedColumnsetSerDe and TCTLSeparatedProtocol, however, it creates Objects in a lazy way which provides better performance. Starting in Hive 0.14.0 it also supports read/write data with a specified encode charset, for example:

ALTER TABLE person SET SERDEPROPERTIES ('serialization.encoding'='GBK');

LazySimpleSerDe can treat ‘T’, ‘t’, ‘F’, ‘f’, ‘1’, and ‘0’ as extended, legal boolean literals if the configuration property hive.lazysimple.extended_boolean_literal is set to true (Hive 0.14.0 and later). The default is false, which means only ‘TRUE’ and ‘FALSE’ are treated as legal boolean literals.
ThriftSerDe: This SerDe is used to read/write Thrift serialized objects. The class file for the Thrift object must be loaded first.
DynamicSerDe: This SerDe also read/write Thrift serialized objects, but it understands Thrift DDL so the schema of the object can be provided at runtime. Also it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records).

MetaStore 元数据

MetaStore contains metadata regarding tables, partitions and databases. This is used by Query Processor during plan generation.
MetaStore包含有关表,分区和数据库的元数据。在计划生成期间,查询处理器将使用它。
Metastore Server - This is the Thrift server (interface defined in metastore/if/hive_metastore.if) that services metadata requests from clients. It delegates most of the requests underlying meta data store and the Hadoop file system which contains data.
元数据服务 这是个Thrift服务器(在metastore/if/ hive_metastore.if中定义的接口),用于服务来自客户端的元数据请求。它委派了基础元数据存储和包含数据的Hadoop文件系统中的大多数请求
Object Store - ObjectStore class handles access to the actual metadata is stored in the SQL store. The current implementation uses JPOX ORM solution which is based of JDA specification. It can be used with any database that is supported by JPOX. New meta stores (file based or xml based) can added by implementing the interface MetaStore. FileStore is a partial implementation of an older version of metastore which may be deprecated soon.
对象存储-对象存储类处理对存储在SQL存储中的实际元数据的访问。当前的实现使用基于JDA规范的JPOX ORM解决方案。它可以与JPOX支持的任何数据库一起使用。可以通过实现接口MetaStore来添加新的元存储(基于文件或基于xml)。 FileStore是Metastore的较旧版本的部分实现,该版本可能很快就会弃用
Metastore Client - There are python, java, php Thrift clients in metastore/src. Java generated client is extended with HiveMetaStoreClient which is used by Query Processor (ql/metadta). This is the main interface to all other Hive components.
这是metastore/src中的python,java,php Thrift客户端。 Java生成的客户端由HiveMetaStoreClient扩展,该模块由查询处理器(ql / metadta)使用。这是所有其他Hive组件的主要界面。

Query Processor 查询处理器

The following are the main components of the Hive Query Processor:
以下是Hive查询处理器的主要组件:

Parse and SemanticAnalysis (ql/parse) - This component contains the code for parsing SQL, converting it into Abstract Syntax Trees, converting the Abstract Syntax Trees into Operator Plans and finally converting the operator plans into a directed graph of tasks which are executed by Driver.java.
Parse and Semantic Analysis (ql/parse)-该组件包含用于解析SQL,将其转换为抽象语法树,将抽象语法树转换为运算符计划,最后将运算符计划转换为有向图的任务的有向图的代码

Optimizer (ql/optimizer) - This component contains some simple rule based optimizations like pruning non referenced columns from table scans (column pruning) that the Hive Query Processor does while converting SQL to a series of map/reduce tasks.
Optimizer (ql/optimizer)-该组件包含一些简单的基于规则的优化,例如从表扫描中修剪未引用的列(列修剪),Hive查询处理器在将SQL转换为一系列映射/减少任务时会执行此操作

Plan Components (ql/plan) - This component contains the classes (which are called descriptors), that are used by the compiler (Parser, SemanticAnalysis and Optimizer) to pass the information to operator trees that is used by the execution code.
Plan Components (ql/plan) -此组件包含类(称为描述符),编译器(解析器,SemanticAnalysis和Optimizer)使用这些类将信息传递给执行代码所使用的运算符树

MetaData Layer (ql/metadata) - This component is used by the query processor to interface with the MetaStore in order to retrieve information about tables, partitions and the columns of the table. This information is used by the compiler to compile SQL to a series of map/reduce tasks.
元数据层(ql /元数据)-查询处理器使用此组件与MetaStore进行接口,以检索有关表,分区和表的列的信息。编译器使用此信息将SQL编译为一系列map / reduce任务

Map/Reduce Execution Engine (ql/exec) - This component contains all the query operators and the framework that is used to invoke those operators from within the map/reduces tasks.
Map / Reduce执行引擎(ql / exec)-此组件包含所有查询运算符和用于从map / reduce任务中调用这些运算符的框架
Hadoop Record Readers, Input and Output Formatters for Hive (ql/io) - This component contains the record readers and the input, output formatters that Hive registers with a Hadoop Job.
Hive的Hadoop记录读取器,输入和输出格式化程序(ql / io)-该组件包含Hive向Hadoop Job注册的记录读取器和输入,输出格式化器。

Sessions (ql/session) - A rudimentary session implementation for Hive.
会话数(ql / session)-Hive的基本会话实施

Type interfaces (ql/typeinfo) - This component provides all the type information for table columns that is retrieved from the MetaStore and the SerDes.
类型接口(ql / typeinfo)-此组件提供从MetaStore和SerDes检索的表列的所有类型信息。

Hive Function Framework (ql/udf) - Framework and implementation of Hive operators, Functions and Aggregate Functions. This component also contains the interfaces that a user can implement to create user defined functions.
Hive函数框架(ql / udf)-Hive运算符,函数和集合函数的框架和实现。该组件还包含用户可以用来创建用户定义功能的接口

Tools (ql/tools) - Some simple tools provided by the query processing framework. Currently, this component contains the implementation of the lineage tool that can parse the query and show the source and destination tables of the query.
Tools (ql/tools) -查询处理框架提供的一些简单工具。当前,此组件包含沿袭工具的实现,该沿袭工具可以解析查询并显示查询的源表和目标表

摘录自 Hive官方开发者向导 https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide
A helpful overview of the Hive query processor can be found in this Hive Anatomy slide deck.
有关Hive查询处理器的有用概述,请参见此Hive Anatomy幻灯片

你可能感兴趣的:(文档相关)