Python happaybase使用Thrift API和Filter实现Hbase的复杂查询

1 背景

HappyBase是一个开发人员友好的Python库,可与Apache HBase进行交互。 HappyBase为应用程序开发人员提供了Pythonic API与HBase交互。这些api包括:

  • 建立连接
  • 表操作接口
  • 数据的查询
  • 数据操作
  • 连接池

详细的文档参考这里
happybase在scanapi中也提供了hbase thrift的Filter查询接口,但是却没有详细的Filter语法文档,在互联网上也没有找到很详细的文档。
为此,我查看了hbase的的文档,翻译了Thrift API and Filter Language部分的内容。接下来2.1介绍了happybase的scan接口,2.2为hbase的翻译内容。

2 使用Filter进行复杂的Hbase查询

2.1 happybase的scan接口

scan(row_start=None, row_stop=None, row_prefix=None, columns=None, filter=None, timestamp=None, include_timestamp=False, batch_size=1000, scan_batching=None, limit=None, sorted_columns=False, reverse=False)
其中的filter参数就是用于hbase的Filter查询。下面是一个简单的示例:

import happybase
hbase_host = ''
hbase_port = 9090
# hbase连接
conn = happybase.Connection(host=hbase_host, port=hbase_port)
table = conn.table('test')
# filter
scan_filter = "SingleColumnValueFilter('info', 'item_delivery_status', =, 'binary:1', true, true) " 
# 查询
result = table.scan(filter=scan_filter)
# 打印查询结果
for row_key, item in result:
    print(row_key)
    print(item)

2.2 Filter语法

这一部分的文字内容翻译自:hbase文档-Thrift API and Filter Language,代码为自己书写,使用时需要将host、表名、列名等信息更改为自己信息。

2.2.1 基本查询语法

"FilterName (argument, argument,... , argument)"

语法指导:

  • 首先指定过滤器的名称,后跟括号,括号中为参数列表,使用逗号分隔。
  • 如果参数是字符串, 应该使用单引号'把字符串包起来.
  • 如果参数是布尔型、整型或者操作符(如<, >!=),不能使用单引号包裹。
  • filter name 必须是一个单词,换句话说必须是除空格、引号、括号之外的ASCII 字符。
  • Filter的参数可以包含任意的ASCII字符,如果一个参数中包含单引号,那么必须使用另外一个单引号对其转义

2.2.2 多个过滤条件和逻辑运算符

二元运算符

  • AND
    同时满足两个条件
  • OR
    满足其中一个条件即可

一元运算符

  • SKIP
    For a particular row, if any of the key-values fail the filter condition, the entire row is skipped.
    -WHILE
    For a particular row, key-values will be emitted until a key-value is reached that fails the filter condition.

例子

(Filter1 AND Filter2) OR (Filter3 AND Filter4)

运算优先级

  • 括号拥有最高的优先级;
  • 一元运算符 SKIPWHILE 次之, 它们拥有相同的优先级;
  • 接下来是二元运算符。 AND 的优先级高于 OR

例子1

Filter1 AND Filter2 OR Filter
is evaluated as
(Filter1 AND Filter2) OR Filter3

例子2

Filter1 AND SKIP Filter2 OR Filter3
is evaluated as
(Filter1 AND (SKIP Filter2)) OR Filter3

2.2.3 比较运算符

  • 小于 LESS (<)
  • 小于等于 LESS_OR_EQUAL (⇐)
  • 等于 EQUAL (=)
  • 不等于 NOT_EQUAL (!=)
  • 大于等于GREATER_OR_EQUAL (>=)
  • 大于GREATER (>)
  • 无操作NO_OP (no operation)

用户需要使用这些符号 (<, ⇐, =, !=, >, >=) 表示比较运算符

2.2.4 比较器(Comparator)

  • BinaryComparator - 以字典序与特定的字节数组进行比较,使用Bytes.compareTo(byte[], byte[])
  • BinaryPrefixComparator- 前缀比较,以字典序与特定的字节数组进行比较,比较的长度仅仅是该字节数组的长度;
  • RegexStringComparator - 正则表达式比较,使用正则表达式来匹配. 仅可以使用 EQUALNOT_EQUAL 两种比较运算符;
  • SubStringComparator - 子串比较,如果给定的子字符串出现,则返回该查询结果。该比较器是大小写敏感的。仅可以使用 EQUALNOT_EQUAL 两种比较运算符。

比较器的语法是: ComparatorType:ComparatorValue

ComparatorType与comparators的对应关系如下:

  • BinaryComparator - binary
  • BinaryPrefixComparator - binaryprefix
  • RegexStringComparator - regexstring
  • SubStringComparator - substring

例子

  1. binary:abc 将匹配字典序大于 abc的数据;
  2. binaryprefix:abc 将匹配前三个字符的字典序与abc相等的数据;
  3. regexstring:ab*yz 将会根据正则表达式 ab*yz 进行匹配(该正则表达式表示:不以ab为开头和以yz为结束的数据)
  4. substring:abc123将会匹配包含子字符串 abc123 的数据

2.2.5 Filter

  • KeyOnlyFilter
    这个filter不接受任何参数,只返回所有键值对中的键和row_key(不包含值)

英文原文: This filter doesn’t take any arguments. It returns only the key component of each key-value.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "KeyOnlyFilter()"
result = table.scan(filter=scan_filter)
for item in result:
    print(item)
  • FirstKeyOnlyFilter
    该filter不接受任何的参数,只返回每一行中的第一个键值对和row_key

英文原文: This filter doesn’t take any arguments. It returns only the first key-value from each row.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "FirstKeyOnlyFilter()"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • PrefixFilter
    该filter仅仅包含一个参数-主键的前缀,返回前缀相匹配的行

英文原文: This filter takes one argument – a prefix of a row key. It returns only those key-values present in a row that starts with the specified row prefix

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "PrefixFilter('0047a')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • ColumnPrefixFilter
    该filter接受一个参数-列的前缀,仅返回列名前缀与给定参数相同的列

英文原文: This filter takes one argument – a column prefix. It returns only those key-values present in a column that starts with the specified column prefix. The column prefix must be of the form: “qualifier”.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "ColumnPrefixFilter('box')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • MultipleColumnPrefixFilter
    该filter接受一组列前缀,仅仅返回与列表中的前缀相匹配的列

英文原文: This filter takes a list of column prefixes. It returns key-values that are present in a column that starts with any of the specified column prefixes. Each of the column prefixes must be of the form: “qualifier”.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "MultipleColumnPrefixFilter('box', 'create')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • ColumnCountGetFilter
    该filter接受一个参数 - limit, 返回第一行,前limit列的数据

英文原文: This filter takes one argument – a limit. It returns the first limit number of columns in the table.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "ColumnCountGetFilter(6)"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • PageFilter
    该filter接受一个参数 -- page-size, 返回page size行数据

英文原文: This filter takes one argument – a page size. It returns page size number of rows from the table.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "PageFilter(5)"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • ColumnPaginationFilter
    该filter接受两个参数 -- limit 和 offset,它返回偏移列数后的列数限制。它为所有行执行此操作。

英文原文: This filter takes two arguments – a limit and offset. It returns limit number of columns after offset number of columns. It does this for all the rows.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "ColumnPaginationFilter(3, 7)"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • InclusiveStopFilter
    该filter接受一个参数--row key(在该row key处停止scanning),返回截止row key之前的行(包含)的所有列

英文原文: This filter takes one argument – a row key on which to stop scanning. It returns all key-values present in rows up to and including the specified row.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "InclusiveStopFilter('005c2_4530489164_10599261608')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • TimeStampsFilter
    该filter接受一组timestamps,

英文原文: This filter takes a list of timestamps. It returns those key-values whose timestamps matches any of the specified timestamps.

  • RowFilter
    该filter接受一个比较操作符(=, !=, >, <, >=, <=)和一个比较器(binary, binaryprefix, regexstring, substring)。使用比较操作符比较所有的行与比较器的匹配情况,如果返回true,则返回该行的row key和所有的列

英文原文: This filter takes a compare operator and a comparator. It compares each row key with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that row.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "RowFilter(=, 'binary:0047a_4530641731_102627717')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • FamilyFilter
    该filter接受一个比较运算符(compare operator)和一个比较器(comparator)。根据比较运算符(compare operator)把所有的列族名与比较器(comparator)进行比较,如果返回true,就返回所有行的row key和列族下的列

英文原文: This filter takes a compare operator and a comparator. It compares each column family name with the comparator using the compare operator and if the comparison returns true, it returns all the Cells in that column family.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "FamilyFilter(=, 'binary:info')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • QualifierFilter
    该filter接受一个比较运算符(compare operator)和一个比较器(comparator)。根据比较运算符(compare operator)把所有的列名(Qualifier)与比较器(comparator)进行比较,如果返回true,就返回所有行的row key和匹配的所有列

英文原文: This filter takes a compare operator and a comparator. It compares each qualifier name with the comparator using the compare operator and if the comparison returns true, it returns all the key-values in that column.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "QualifierFilter(=, 'binary:item_delivery_status')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • ValueFilter
    该filter接受一个比较运算符(compare operator)和一个比较器(comparator)。根据比较运算符(compare operator)把所有的value(Qualifier)与比较器(comparator)进行比较,如果返回true,就返回所有行的row key和所匹配的键值对

英文原文: This filter takes a compare operator and a comparator. It compares each value with the comparator using the compare operator and if the comparison returns true, it returns that key-value.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "ValueFilter(=, 'binary:2')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • DependentColumnFilter
    该filter接受两个参数--列族(fanily)和列名(qualifier)

英文原文: This filter takes two arguments – a family and a qualifier. It tries to locate this column in each row and returns all key-values in that row that have the same timestamp. If the row doesn’t contain the specified column – none of the key-values in that row will be returned.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "DependentColumnFilter('info', 'store_code')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • SingleColumnValueFilter
    该filter接受--一个列族(column family), 一个列(qualifier), 一个比较运算符(compare operator) 和一个比较器(comparator)。根据列族和列名确定的的列,把所有的值与比较器(comparator)进行比较, 如果返回true, 则输出该行和所有的列,如果指定的列不存在,那么将返回所有的行。

英文原文: This filter takes a column family, a qualifier, a compare operator and a comparator. If the specified column is not found – all the columns of that row will be emitted. If the column is found and the comparison with the comparator returns true, all the columns of the row will be emitted. If the condition fails, the row will not be emitted.

注意⚠️: 实际上,该filter还有两个参数 分别表示是否过滤缺失数据的行、是否只取最近的版本

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "SingleColumnValueFilter(
    'info', 'item_delivery_status', =, 'binary:2', true, true)"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • SingleColumnValueExcludeFilter
    该过滤器接受的参数与SingleColumnValueFilter相同,与SingleColumnValueFilter不同的是,将会输出与输入条件相同的所有行,除去参数指定的列。

英文原文: This filter takes the same arguments and behaves same as SingleColumnValueFilter – however, if the column is found and the condition passes, all the columns of the row will be emitted except for the tested column value.

conn = happybase.Connection(host=TEST_HBASE_HOST)
table = conn.table('openapi:openapi_suning_purchase_order')
scan_filter = "SingleColumnValueExcludeFilter(
    'info', 'item_delivery_status', =, 'binary:2')"
result = table.scan(filter=scan_filter)
for index, item in enumerate(result):
    print(item)
  • ColumnRangeFilter

英文原文: This filter is used for selecting only those keys with columns that are between minColumn and maxColumn. It also takes two boolean variables to indicate whether to include the minColumn and maxColumn or not.

3 参考资料

1.happybase文档
2.hbase文档-Thrift API and Filter Language

你可能感兴趣的:(Python happaybase使用Thrift API和Filter实现Hbase的复杂查询)