这一篇博文是【大数据技术●降龙十八掌】系列文章的其中一篇,点击查看目录:大数据技术●降龙十八掌
系列文章:
:【十八掌●武功篇】第八掌:HBase之基本概念
【十八掌●武功篇】第八掌:HBase之Shell
【十八掌●武功篇】第八掌:HBase之基本操作Java API
【十八掌●武功篇】第八掌:HBase之过滤器总结
【十八掌●武功篇】第八掌:HBase之性能调优
【十八掌●武功篇】第八掌:HBase之安装与集成 [草稿]
比较过滤器需要一个比较运算符和一个比较器实例,代码中设置一个比较过滤器,符合过滤条件的记录才会被输出到客户端。
● 比较运算符:
操作 | 描述 |
---|---|
CompareFilter.CompareOp.LESS | 匹配小于设定的值 |
CompareFilter.CompareOp.LESS_OR_EQUAL | 匹配小于或者等于设定的值 |
CompareFilter.CompareOp.EQUAL | 匹配等于设置的值 |
CompareFilter.CompareOp.NOT_EQUAL | 匹配与设定值不相等的值 |
CompareFilter.CompareOp.GREATER_OR_EQUAL | 匹配大于或者等于设定值的 |
CompareFilter.CompareOp.GREATER | 匹配大于设定的值 |
CompareFilter.CompareOp.NO_OP | 排除一切值 |
● 比较器实例:
比较器 | 描述 |
---|---|
BinaryComparator | 使用Bytes.compareTo()比较当前值与阀值 |
BinaryPrefixComparator | Bytes.compareTo()比较当前值与阀值,但是从左端开始前缀匹配 |
NullComparator | 不做匹配,只判断当前值是不是Null |
BitComparator | 通过BitwiseOp类的按位与、或、异或操作执行位级比较 |
RegexStringComparator | 根据一个正则表达式,去筛选匹配的数据 |
SubstringComparator | 把阀值和表中数据当做String实例,通过contains()操作匹配字符串 |
行过滤器是针对rowkey来进行过滤。
//实例一:这个实例是筛选RowKey小于000004_00201612_150000的所有行,这里是对RowKey的精确匹配
private static void test1() {
Configuration conf= HBaseConfiguration.create();
try {
Connection connection=ConnectionFactory.createConnection(conf);
Table table=connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
Filter filter = new RowFilter(CompareFilter.CompareOp.LESS, new BinaryComparator(Bytes.toBytes("000004_00201612_150000")));
scan.setFilter(filter);
ResultScanner scanner=table.getScanner(scan);
for(Result res:scanner)
{
System.out.println(res);
}
} catch (IOException e) {
e.printStackTrace();
}
}
//实例二:通过正则表达式筛选RowKey
private static void test2() {
Configuration conf= HBaseConfiguration.create();
try {
Connection connection=ConnectionFactory.createConnection(conf);
Table table=connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
Filter filter = new RowFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("000004_00201612_*"));
scan.setFilter(filter);
ResultScanner scanner=table.getScanner(scan);
for(Result res:scanner)
{
System.out.println(res);
}
} catch (IOException e) {
e.printStackTrace();
}
}
//实例三:通过子字符串匹配RowKey
private static void test3() {
Configuration conf= HBaseConfiguration.create();
try {
Connection connection=ConnectionFactory.createConnection(conf);
Table table=connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
Filter filter = new RowFilter(CompareFilter.CompareOp.EQUAL, new SubstringComparator ("_00201612_000000"));
scan.setFilter(filter);
ResultScanner scanner=table.getScanner(scan);
for(Result res:scanner)
{
System.out.println(res);
}
} catch (IOException e) {
e.printStackTrace();
}
}
列族过滤器可以指定只返回特定的列族。
//只返回指定列族的数据
private static void test4() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
FilterList filterList = new FilterList();
Filter filter = new FamilyFilter(CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("cf")));
filterList.addFilter(filter);
Filter filter2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("000004_00201612_*"));
filterList.addFilter(filter2);
scan.setFilter(filterList);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
} catch (IOException e) {
e.printStackTrace();
}
}
列名筛选器可以指定只是返回指定的列。
private static void test5() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
Filter filter = new QualifierFilter(CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("attention_uv")));
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
} catch (IOException e) {
e.printStackTrace();
}
}
//当输出列为多个时候,可以用RegexStringComparator正则来筛选
private static void test6() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
Filter filter = new QualifierFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("attention_uv|province_id"));
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
} catch (IOException e) {
e.printStackTrace();
}
}
值过滤器是用来过滤特定值的单元格。通常同其他过滤器一块使用。
//筛选包含6的单元格
private static void test7() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
Filter filter = new ValueFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("6"));
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
} catch (IOException e) {
e.printStackTrace();
}
}
单列值过滤器可以根据某一列的值进行过滤,SingleColumnValueFilter的构造函数,参数分别是列族、列标签、比较运算符、比较器。
//使用单列值过滤器,筛选factory_id为466的数据,从打印结果可以看到,结果中包括筛选列factory_id,但是在下面的SingleColumnValueExcludeFilter的结果中,就不包含筛选列了
private static void test8() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes("cf"), Bytes.toBytes("factory_id"), CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("466")));
//当行中没有指定列时候,不包含到结果中(getFilterMissing默认是false,是包含在结果中的)
filter.setFilterIfMissing(true);
//筛选器只检查参考列的最新版本。(默认是true,指检查最新版本)
filter.setLatestVersionOnly(true);
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
for (Cell cell : res.listCells()) {
String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
String val = Bytes.toString(CellUtil.cloneValue(cell));
if (qualifier.equals("factory_id")) {
System.out.println("factory_id:" + val);
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
SingleColumnValueExcludeFilter继承于SingleColumnValueFilter,用法和功能基本一致,唯一不同就是筛选列不在结果Result中。
//用来做为筛选的列factory_id不在结果中
private static void test9() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
SingleColumnValueExcludeFilter filter = new SingleColumnValueExcludeFilter(Bytes.toBytes("cf"), Bytes.toBytes("factory_id"), CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("466")));
//当行中没有指定列时候,不包含到结果中(getFilterMissing默认是false,是包含在结果中的)
filter.setFilterIfMissing(true);
//筛选器只检查参考列的最新版本。(默认是true,指检查最新版本)
filter.setLatestVersionOnly(true);
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
for (Cell cell : res.listCells()) {
String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
String val = Bytes.toString(CellUtil.cloneValue(cell));
if (qualifier.equals("factory_id")) {
System.out.println("factory_id:" + val);
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
PrefixFilter过滤器是传入一个前缀,与这个前缀匹配的行都会返回给客户端,在扫描表时,当遇到大于前缀的RowKey时,就会停止扫描,所以不指定StopRowKey并不会影响查询性能。
//查询前缀为000466_00201612_的记录
private static void test10() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
Filter filter = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
} catch (IOException e) {
e.printStackTrace();
}
}
分页过滤器可以对结果按行进行分页,分页的大小需要指定。
private static void test11() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
Filter filter = new PageFilter(20);
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
int count=0;
for (Result res : scanner) {
count++;
}
System.out.println("结果行数:"+count);
} catch (IOException e) {
e.printStackTrace();
}
}
RowKey过滤器值返回RowKey的值,并不返回列的值,构造函数的参数为false时,返回的Result对象中列值是长度为0的字节数组,参数为true时,列值的数组长度和列值的长度一致,但是仍然是空数组。
private static void test12() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
//返回的列是长度为多少的字节数组,false是0长度、true是和列值长度相同
Filter filter = new KeyOnlyFilter(false);
Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
FilterList list = new FilterList();
list.addFilter(filter);
list.addFilter(filter2);
scan.setFilter(list);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
for (Cell cell : res.listCells()) {
String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
String val = Bytes.toString(CellUtil.cloneValue(cell));
System.out.println("RowKey:" + Bytes.toString(res.getRow()) + "列:" + qualifier + ";值长度:" + CellUtil.cloneValue(cell).length);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
首次行过滤器只是访问一行中的第一列后就结束对当前行的扫描,跳到下一行,所以返回的列只有第一个列。
private static void test13() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
//返回的列是长度为多少的字节数组,false是0长度、true是和列值长度相同
Filter filter = new FirstKeyOnlyFilter();
Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
FilterList list = new FilterList();
list.addFilter(filter);
list.addFilter(filter2);
scan.setFilter(list);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
System.out.println("列个数:"+res.listCells().size());
for (Cell cell : res.listCells()) {
String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
String val = Bytes.toString(CellUtil.cloneValue(cell));
System.out.println("RowKey:" + Bytes.toString(res.getRow()) + "列:" + qualifier + ";值长度:" + CellUtil.cloneValue(cell).length + ";值:" + val);
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
private static void test14() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
List ts=new ArrayList();
ts.add(new Long(5));
ts.add(new Long(10));
ts.add(new Long(15));
//定义一个时间戳过滤器,定义返回三个时间戳的数据
Filter filter = new TimestampsFilter(ts);
Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
FilterList list = new FilterList();
list.addFilter(filter);
list.addFilter(filter2);
scan.setFilter(list);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
} catch (IOException e) {
e.printStackTrace();
}
}
列计数过滤器可以来限制每行最多取回多少列,当一行的列数达到设定的最大值时,就会停止整个扫描,所以它并不太适合Scan操作,而是更适合于get操作。
//Scan中使用列计数过滤器
private static void test15() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
Filter filter = new ColumnCountGetFilter(3);
Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
FilterList list = new FilterList();
list.addFilter(filter);
list.addFilter(filter2);
scan.setFilter(list);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
System.out.println("列数:" + res.listCells().size());
}
} catch (IOException e) {
e.printStackTrace();
}
}
//Get方法中使用列计数过滤器
private static void test16() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Filter filter = new ColumnCountGetFilter(3);
Get get = new Get(Bytes.toBytes("000466_00201612_630000"));
get.setFilter(filter);
Result result = table.get(get);
System.out.println(result);
System.out.println("列数:"+result.listCells().size());
} catch (IOException e) {
e.printStackTrace();
}
}
列分页过滤器可是只返回部分列,构造函数为:
ColumnPaginationFilter(int limit, int offset)
返回的列的偏移量大于offset(从0开始),返回列的个数为limit个。
private static void test17() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
Filter filter = new ColumnPaginationFilter(5,0);
Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
FilterList list = new FilterList();
list.addFilter(filter);
list.addFilter(filter2);
scan.setFilter(list);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
System.out.println("列的数量为:"+res.listCells().size());
}
} catch (IOException e) {
e.printStackTrace();
}
}
根据列名的前缀来筛选返回的列,只有列名带指定前缀的列才在返回结果中。
private static void test18() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
Filter filter = new ColumnPrefixFilter(Bytes.toBytes("attention_"));
Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
FilterList list = new FilterList();
list.addFilter(filter);
list.addFilter(filter2);
scan.setFilter(list);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
System.out.println("列的数量为:"+res.listCells().size());
}
} catch (IOException e) {
e.printStackTrace();
}
}
随机行过滤器使得结果中包含随机行,构造函数的参数是一个介于0.0到1.0之间的数字,内部使用Java的Random.nextFloat()来决定一行是否被过滤掉,当参数是一个负数时,所有结果都会被过滤掉,当参数大于1时则所有结果都包含在结果中。
private static void test19() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
Filter filter = new RandomRowFilter((float) 0.6);
Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
FilterList list = new FilterList();
list.addFilter(filter);
list.addFilter(filter2);
scan.setFilter(list);
ResultScanner scanner = table.getScanner(scan);
int rowcount = 0;
for (Result res : scanner) {
System.out.println(res);
rowcount++;
}
System.out.println("行的数量为:" + rowcount);
} catch (IOException e) {
e.printStackTrace();
}
}
跳转过滤器常常与其他过滤器一起使用,跳转过滤器是对其他过滤器的修饰。当其他过滤器发现一行中的某一列需要被过滤时,再通过跳转过滤器就将所在行给过滤掉了。
//查询每个单元格里都不含有466的行,
private static void test20() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
scan.setStartRow(Bytes.toBytes("000003_00201612_450000"));
scan. setStopRow(Bytes.toBytes("000003_00201612_530000"));
//通过ValueFilter过滤器,得到不等于466的单元格
Filter filter = new ValueFilter(CompareFilter.CompareOp.NOT_EQUAL, new BinaryComparator(Bytes.toBytes("466")));
//通过SkipFilter过滤器进行装饰,将不符合ValueFilter的单元格所在行都排除掉。
Filter filter1=new SkipFilter(filter);
scan.setFilter(filter1);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
//System.out.println(res);
String str=Bytes.toString(res.getRow())+"--";
for (Cell cell : res.listCells()) {
String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
String val = Bytes.toString(CellUtil.cloneValue(cell));
str+=qualifier+":"+val+",";
}
System.out.println(str);
}
} catch (IOException e) {
e.printStackTrace();
}
}
全匹配过滤器也往往和其他过滤器一块使用,是对其他过滤器的装饰,当其他过滤器遇到不匹配的行时,全匹配过滤器就将扫描停掉。
private static void test21() {
Configuration conf = HBaseConfiguration.create();
try {
Connection connection = ConnectionFactory.createConnection(conf);
Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
Scan scan = new Scan();
scan.setStartRow(Bytes.toBytes("000003_00201612_450000"));
scan.setStopRow(Bytes.toBytes("000003_00201612_530000"));
Filter filter = new RowFilter(CompareFilter.CompareOp.NOT_EQUAL, new BinaryComparator(Bytes.toBytes("000003_00201612_500000")));
Filter filter1=new WhileMatchFilter(filter);
scan.setFilter(filter1);
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
System.out.println(res);
}
} catch (IOException e) {
e.printStackTrace();
}
}
可以将多个不同类型的Filter放入一个FilterList来共同限制返回到客户端的数据。FilterList可以指定一个Operator类型的参数,用来决定多个过滤器的组合效果。
FilterList.Operator枚举值:
枚举值 | 说明 |
---|---|
MUST_PASS_ALL | 所有的过滤器都符合,这个值才被包含在结果中。 |
MUST_PASS_ONE | 只要有符合一个过滤器,这个值就包含在结果中。 |