HBase filters are a powerful feature that can greatly enhance your effectiveness working with data stored in tables. You will find predefined filters, already provided by HBase for your use, but also a
framework you can use to implement your own. You will now be introduced to both.
这个feature从名字就知道, 在get和scan时, 需要做一些过滤, 所以需要Filter. HBase本身就实现很多的Filter, 你可以直接用, 当然你也可以定义custom的filter.
看个简单的例子, compare filter, 你需要给出比较标准(<,>,=), 和比较的逻辑comparator
CompareFilter(CompareOp valueCompareOp, WritableByteArrayComparable valueComparator)
还有很多基于compare filter的各种filter, Row Filter, Famliy Filter, QualifierFilter
RowFilter, This filter gives you the ability to filter data based on row keys.
Filter filter1 = new RowFilter(CompareFilter.CompareOp.LESS_OR_EQUAL, new BinaryComparator(Bytes.toBytes("row-22")));
最后有张表列出HBase提供的Filters, 可以用到再参考
Next to the already discussed functionality HBase is offering another advanced feature: counters. Many applications that collect statistics -such as clicks or views in online advertising - were used to collect the data in log files that would subsequently be analyzed. Using counters has the potential of switching to live accounting, forgoing the delayed batch processing step completely.
计数器, 这个更简单, 你自己也可以实现一个计数操作,
You would have to lock a row, read the value, increment it, write it back, and eventually unlock the row for other writers to be able to access it subsequently.
但比较麻烦需要多次client-side call, 所以HBase提供这样的功能, 计数operation atomically in a single client-side call, 其实就是上面这些步骤直接在server做了
例子如下,
long incrementColumnValue(byte[] row, byte[] family, byte[] qualifier, long amount) throws IOException
So far you have seen how you can use, for example, filters to reduce the amount of data being sent over the network from the servers to the client. Another feature in HBase allows you to even move part of the computation to where the data lives: coprocessors.
Using the client API, combined with specific selector mechanisms, such as filters, or column family scoping it is possible to limit what data is transferred to the client. It would be good though to take this further and, for example, perform certain operations directly on the server side while only returning a small result set. Think of this a small MapReduce framework that distributes work across the entire cluster.
Coprocessors enable you to run arbitrary code directly on each region server. More precisely it executes the code on a per region basis, giving you trigger like functionality - similar to stored procedures in the RDBMS world. From the client side you do not have to take specific actions as the framework handles the distributed nature transparently.
为了提高效率, 可以定义类似存储过程一样的, 在服务器端执行的代码脚本
Use-cases for coprocessors are, for instance, using hooks into row mutation operations to maintain secondary indexes, or implement some kind of referential integrity. Filters could be enhanced to
become stateful and therefore make decisions across row boundaries. Aggregate functions, such as sum(), or avg() known from RDBMS and SQL, could be moved to the servers to scan the data locally and only returning the single number result across the network.
Instead of creating a HTable instance for every request from your client application it makes much more sense to create one initially and then subsequently reuse them.
The primary reason for doing so is that creating a HTable instance is a fairly expensive operation that takes a few seconds to complete. In a highly contended environment with thousands of requests per second you would not be able to use this approach at all - creating the HTable instance would be too slow. You need to create the instance at startup and use them for the duration of your client's life cycle.
Configuration conf = HBaseConfiguration.create(); HTablePool pool = new HTablePool(conf, 5); //pool size为5 HTableInterface[] tables = new HTableInterface[10]; for (int n = 0; n < 10; n++) { //虽然pool size为5, 但你可以get到10个htable, 没问题 tables[n] = pool.getTable("testtable"); System.out.println(Bytes.toString(tables[n].getTableName())); } for (int n = 0; n < 5; n++) { //但是pool中最多只能有5个, 所以其他的put完后会被drop掉 pool.putTable(tables[n]); } pool.closeTablePool("testtable");
Every instance of HTable requires a connection to the remote servers.
This is internally represented by the HConnection class, and more importantly managed process-wide by the shared HConnectionManager class. From a user perspective there is usually no immediate need to deal with either of these two classes, instead you simply create a new Configuration instance, and use that with your client API calls.
Internally the connections are keyed in a map, where the key is the Configuration instance you are using.
In other words, if you create a number of HTable instances while providing the same configuration reference they all share the same underlying HConnection instance.
Apart from the client API used to deal with data manipulation features, HBase also exposes a data definition like API. This is similar to the separation into DDL and DML found in RDBMSs.
Creating a table in HBase implicitly involves the definition of a table schema, as well as the schemas for all contained column families.
They define the pertinent characteristics of how - and when - the data inside the table and columns are ultimately stored.
HBase提供类来定义table和column families的各种属性...
HTableDescriptor(HTableDescriptor desc); HColumnDescriptor(HColumnDescriptor desc);
Just as with the client API you also have an API for administrative tasks at your disposal. Compare this to the Data Definition Language (DDL) found in RDBMS systems - while the client API is more ananalog to the Data Manipulation Language (DML).
It provides operations to create tables with specific column families, check for table existence, alter table and column family definitions, drop tables, and much more. The provided functions can be grouped into related operations, discussed separately below.
提高接口来创建table, alter, drop tables等
HBase comes with a variety of clients that can be used from various programming languages. This chapter is going to give you an overview of what is available.
Access to HBase is possible from virtually every popular programming language, and environment. You either use the client API directly, or access it through some sort of proxy that translates your request into an API call. These proxies wrap the native Java API into other protocol APIs so that clients can be written in any language the external API provides. Typically the external API is implemented in a dedicated Java based server that can internally use the provided HTable client API. This simplifies the implementation and maintenance of these gateway servers.
首先所有访问HBase都必须通过HTable, 其他任何方法都是对HTable的各种封装. 那么如果要用其他的语言来访问HBase, 就是对Java对象经行封装, 并提供相应的接口.
这儿HTable放什么地方创建, 有两种选择,
直接放在client, 或放在gateway, 因为HTable创建比较耗费资源, 往往考虑reuse, HTablePool, 所以往往选择在gateway创建(gateway可以和DB同一server)
这样的问题是, 从client到gateway怎么通信, 因为需要client发送request到gateway, 然后gateway将request转化为HTable访问, 去访问真实的数据.
首先想到的选择就是Restful的方式
The protocol between the gateways and the clients is then driven by the available choices and requirements of the remote client. An obvious choice is the Representational State Transfer (abbreviated as REST) [67] which is based on existing web based technologies. The actual transport is typically HTTP - which is the standard protocol for web applications. This makes REST ideal to communicate between heterogeneous systems: the protocol layer takes care of transporting the data in an interoperable format.
REST defines the semantics so that the protocol can be used in a generic way to address remote resources. By not changing the protocol REST is compatible with existing technologies, such as web
servers, and proxies. Resources are uniquely specified as part of the request URI - which is the opposite of, for example, SOAP-based[68] services, which define a new protocol that conform to a standard.
Restful方式和SOAP方式有什么不同, 可以参考Restful相关blog.
Rest基于Http协议, 变的是定义不同的resource
Soap需要自己定义协议
但Rest和Soap存在的问题是, 他们都是文本协议, 所以overhead很高, 对于huge的server farm, 就会有效率问题, 如带宽...
Both REST and SOAP though suffer from the verbosity level of the protocol. Human readable text, be it plain or XML based, is used to communicate between client and server. Transparent compression of the data sent over the network can mitigate this problem to a certain extend.
所以需要二进制的协议来降低overhead, Google开发了ProtocolBuffers, 但不开源, 所以Facebook仿了个Thrift, 而hadoop也出了Avro.
Especially companies with very large server farms, extensive bandwidth usage, and many disjoint services felt the need to reduce the overhead and implemented their own RPC layers. One of them
was Google, implementing Protocol Buffers.[69] Since the implementation was initially not published, Facebook developed their own version, named Thrift.[70]. The Hadoop project founders started a third project, Apache Avro[71], providing an alternative implementation.
All of them have similar feature sets, vary in the number of languages they support, and have (arguably) slightly better or worse levels of encoding efficiencies.
The key difference of Protocol Buffers to Thrift and Avro is that it has no RPC stack of its own, rather it generates the RPC definitions, which have to be used with other RPC libraries subsequently.
这3种区别仅仅是支持的语言, 和编码效率上, 没有本质区别. 还有Protocol Buffers没有开发自己的RPC栈协议, 需要使用其他的RPC库.
HBase ships with auxiliary servers for REST, Thrift, and Avro. They are implemented as stand-alone gateway servers, which can run on shared or dedicated machines.
Since Thrift and Avro have their own RPC implementation, the gateway servers simply provide a wrapper around them.
For REST HBase has its own implementation, offering access to the stored data.
HBase同时也发布了REST, Thrift, and Avro auxiliary servers
The first group of clients are the interactive ones, those that send client API calls on demand, such as get, put, or delete, to servers.
Based on the choice of the protocol you can use the supplied gateway servers to gain access from your applications.
这个不具体说了, 可以通过Native Java, Rest, Thrift, Avro各种接口去访问
The opposite use-case of interactive clients is the batch access to the data. The difference is that these clients usually run asynchronously in the background, scanning large amounts of data to build, for example, search indexes, machine learning based mathematical models, or statistics needed for reporting.
Access is less user driven and therefore SLAs are more geared towards overall runtime, as opposed to per request latencies. The majority of the batch frameworks reading and writing from and to
HBase are MapReduce based.
The Hadoop MapReduce framework is built to process petabytes of data, in a reliable, deterministic, yet easy to program way.
There are a variety of ways to include HBase as a source and target for MapReduce jobs.
Native Java
The Java based MapReduce API for HBase is discussed in Chapter 7, MapReduce Integration.
Clojure
The HBase-Runner project offers support for HBase from the functional programming language Clojure. You can write MapReduce jobs in Clojure while accessing HBase tables.
Hive
The Apache Hive[75] project offers a data warehouse infrastructure atop Hadoop. It was initially developed at Facebook, but is now part of the open-source Hadoop ecosystem.
Hive offers an SQL-like query language, called HiveQL, which allows you to query the semi-structured data stored in Hadoop. The query is eventually turned into a MapReduce job, executed either locally, or on a distributed MapReduce cluster. The data is parsed at job execution time and Hive employs a storage handler[76] abstraction layer that allows for data not to just reside in HDFS, but other data sources as well. A storage handler transparently makes arbitrarily stored information available to the HiveQL based user queries.
Since version 0.6.0 Hive also comes with a handler for HBase.[77] You can define Hive tables that are backed by HBase tables, mapping columns as required. The row key can be exposed as another column when needed. Hive通过抽象的存储层, 可以处理除HDFS以外的数据源, 0.6.0版Hive可以支持HBase.
Pig
The Apache Pig[78] project provides a platform to analyze large amounts of data. It has its own high-level query language, called Pig Latin, which uses an imperative programming (命令式编程) style to formulate the steps involved to transform the input data to the final output. This is the opposite of Hive's declarative approach (声明方法) to emulate SQL.
The nature of Pig Latin, in comparison to HiveQL, appeals to everyone with a procedural programming background, but also lends itself to significant parallelization. Combined with the power of Hadoop and the MapReduce framework you can process massive amounts of data in reasonable time frames.
Version 0.7.0 of Pig introduced the LoadFunc/StoreFunc classes and functionality, which allows to load and store data from other sources than the usual HDFS. One of those sources is HBase, implemented in the HBaseStorage class.
Pigs support for HBase includes reading and writing to existing tables. You can map table columns as Pig tuples, while optionally include the row key as the first field for read operations. For writes the first field is always used as the row key.
Cascading
Cascading is an alternative API to MapReduce. Under the covers it uses MapReduce during execution, but during development, users don't have to think in MapReduce to create solutions for execution on Hadoop.
The model used is similar to a real-world pipe assembly, where data sources are taps, and outputs are sinks. These are piped together to form the processing flow, where data passes through the pipe and is transformed in the process. Pipes can be connected to larger pipe assemblies to form more complex processing pipelines from existing pipes.
Data then streams through the pipeline and can be split, merged, grouped, or joined. The data is represented as tuples, forming a tuple stream through the assembly. This very visually oriented model makes building MapReduce jobs more like construction work, while abstracting the complexity of the actual work involved.
Cascading (as of version 1.0.1) has support for reading and writing data to and from a HBase cluster. Detailed information and access to the source code can be found on the Cascading Modules page.
类似Pig, Hive, 但是应用场景不同, 适用于工作流, pipe
The HBase Shell is the command line interface to your HBase cluster(s). You can use it to connect to local or remote servers and interact with them. The shell provides both, client and administrative
operations, mirroring the APIs discussed in the earlier chapters of this book.
The HBase processes exposes a web-based user interface (in short UI), which you can use to gain insight into the cluster's state, as well as the tables it hosts. The majority of the functionality is read-only, but there are a few selected operation you can trigger through the UI.