【十八掌●武功篇】第八掌:HBase之过滤器总结

这一篇博文是【大数据技术●降龙十八掌】系列文章的其中一篇,点击查看目录:这里写图片描述大数据技术●降龙十八掌


系列文章:
:【十八掌●武功篇】第八掌:HBase之基本概念
【十八掌●武功篇】第八掌:HBase之Shell
【十八掌●武功篇】第八掌:HBase之基本操作Java API
【十八掌●武功篇】第八掌:HBase之过滤器总结
【十八掌●武功篇】第八掌:HBase之性能调优
【十八掌●武功篇】第八掌:HBase之安装与集成 [草稿]

一、 比较过滤器

比较过滤器需要一个比较运算符和一个比较器实例,代码中设置一个比较过滤器,符合过滤条件的记录才会被输出到客户端。
● 比较运算符:

操作 描述
CompareFilter.CompareOp.LESS 匹配小于设定的值
CompareFilter.CompareOp.LESS_OR_EQUAL 匹配小于或者等于设定的值
CompareFilter.CompareOp.EQUAL 匹配等于设置的值
CompareFilter.CompareOp.NOT_EQUAL 匹配与设定值不相等的值
CompareFilter.CompareOp.GREATER_OR_EQUAL 匹配大于或者等于设定值的
CompareFilter.CompareOp.GREATER 匹配大于设定的值
CompareFilter.CompareOp.NO_OP 排除一切值

● 比较器实例:

比较器 描述
BinaryComparator 使用Bytes.compareTo()比较当前值与阀值
BinaryPrefixComparator Bytes.compareTo()比较当前值与阀值,但是从左端开始前缀匹配
NullComparator 不做匹配,只判断当前值是不是Null
BitComparator 通过BitwiseOp类的按位与、或、异或操作执行位级比较
RegexStringComparator 根据一个正则表达式,去筛选匹配的数据
SubstringComparator 把阀值和表中数据当做String实例,通过contains()操作匹配字符串

1、 RowFilter行过滤器

行过滤器是针对rowkey来进行过滤。

//实例一:这个实例是筛选RowKey小于000004_00201612_150000的所有行,这里是对RowKey的精确匹配
private static void test1() {
    Configuration conf= HBaseConfiguration.create();
    try {
        Connection connection=ConnectionFactory.createConnection(conf);
        Table table=connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        Filter filter = new RowFilter(CompareFilter.CompareOp.LESS, new BinaryComparator(Bytes.toBytes("000004_00201612_150000")));
        scan.setFilter(filter);
        ResultScanner scanner=table.getScanner(scan);
        for(Result res:scanner)
        {
            System.out.println(res);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}
//实例二:通过正则表达式筛选RowKey
private static void test2() {
    Configuration conf= HBaseConfiguration.create();
    try {
        Connection connection=ConnectionFactory.createConnection(conf);
        Table table=connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        Filter filter = new RowFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("000004_00201612_*"));
        scan.setFilter(filter);
        ResultScanner scanner=table.getScanner(scan);
        for(Result res:scanner)
        {
            System.out.println(res);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

//实例三:通过子字符串匹配RowKey
private static void test3() {
    Configuration conf= HBaseConfiguration.create();
    try {
        Connection connection=ConnectionFactory.createConnection(conf);
        Table table=connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        Filter filter = new RowFilter(CompareFilter.CompareOp.EQUAL, new SubstringComparator ("_00201612_000000"));
        scan.setFilter(filter);
        ResultScanner scanner=table.getScanner(scan);
        for(Result res:scanner)
        {
            System.out.println(res);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

2、 FamilyFilter列族过滤器

列族过滤器可以指定只返回特定的列族。

//只返回指定列族的数据
private static void test4() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        FilterList filterList = new FilterList();

        Filter filter = new FamilyFilter(CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("cf")));
        filterList.addFilter(filter);
        Filter filter2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("000004_00201612_*"));
        filterList.addFilter(filter2);

        scan.setFilter(filterList);
        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

3、 QualifierFilter列名过滤器

列名筛选器可以指定只是返回指定的列。

private static void test5() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        Filter filter = new QualifierFilter(CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("attention_uv")));
        scan.setFilter(filter);
        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}
//当输出列为多个时候,可以用RegexStringComparator正则来筛选
private static void test6() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        Filter filter = new QualifierFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("attention_uv|province_id"));
        scan.setFilter(filter);
        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

4、 ValueFilter值过滤器

值过滤器是用来过滤特定值的单元格。通常同其他过滤器一块使用。

//筛选包含6的单元格
private static void test7() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        Filter filter = new ValueFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator("6"));
        scan.setFilter(filter);
        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

5、 DependentColumnFilter参考列过滤器

二、 专用过滤器

6、 SingleColumnValueFilter单列值过滤器

单列值过滤器可以根据某一列的值进行过滤,SingleColumnValueFilter的构造函数,参数分别是列族、列标签、比较运算符、比较器。

//使用单列值过滤器,筛选factory_id为466的数据,从打印结果可以看到,结果中包括筛选列factory_id,但是在下面的SingleColumnValueExcludeFilter的结果中,就不包含筛选列了
private static void test8() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        SingleColumnValueFilter filter = new SingleColumnValueFilter(Bytes.toBytes("cf"), Bytes.toBytes("factory_id"), CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("466")));
        //当行中没有指定列时候,不包含到结果中(getFilterMissing默认是false,是包含在结果中的)
        filter.setFilterIfMissing(true);
        //筛选器只检查参考列的最新版本。(默认是true,指检查最新版本)
        filter.setLatestVersionOnly(true);
        scan.setFilter(filter);
        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
            for (Cell cell : res.listCells()) {
                String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
                String val = Bytes.toString(CellUtil.cloneValue(cell));
                if (qualifier.equals("factory_id")) {
                    System.out.println("factory_id:" + val);
                }
            }
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

7、 SingleColumnValueExcludeFilter单列排除过滤器

SingleColumnValueExcludeFilter继承于SingleColumnValueFilter,用法和功能基本一致,唯一不同就是筛选列不在结果Result中。

//用来做为筛选的列factory_id不在结果中
private static void test9() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        SingleColumnValueExcludeFilter filter = new SingleColumnValueExcludeFilter(Bytes.toBytes("cf"), Bytes.toBytes("factory_id"), CompareFilter.CompareOp.EQUAL, new BinaryComparator(Bytes.toBytes("466")));
        //当行中没有指定列时候,不包含到结果中(getFilterMissing默认是false,是包含在结果中的)
        filter.setFilterIfMissing(true);
        //筛选器只检查参考列的最新版本。(默认是true,指检查最新版本)
        filter.setLatestVersionOnly(true);
        scan.setFilter(filter);
        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
            for (Cell cell : res.listCells()) {
                String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
                String val = Bytes.toString(CellUtil.cloneValue(cell));
                if (qualifier.equals("factory_id")) {
                    System.out.println("factory_id:" + val);
                }
            }
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

8、 PrefixFilter前缀过滤器

PrefixFilter过滤器是传入一个前缀,与这个前缀匹配的行都会返回给客户端,在扫描表时,当遇到大于前缀的RowKey时,就会停止扫描,所以不指定StopRowKey并不会影响查询性能。

//查询前缀为000466_00201612_的记录
private static void test10() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        Filter filter = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
        scan.setFilter(filter);
        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

9、 PageFilter分页过滤器

分页过滤器可以对结果按行进行分页,分页的大小需要指定。

private static void test11() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        Filter filter = new PageFilter(20);
        scan.setFilter(filter);
        ResultScanner scanner = table.getScanner(scan);
        int count=0;
        for (Result res : scanner) {
            count++;
        }
        System.out.println("结果行数:"+count);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

10、 KeyOnlyFilter RowKey过滤器

RowKey过滤器值返回RowKey的值,并不返回列的值,构造函数的参数为false时,返回的Result对象中列值是长度为0的字节数组,参数为true时,列值的数组长度和列值的长度一致,但是仍然是空数组。

private static void test12() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
//返回的列是长度为多少的字节数组,false是0长度、true是和列值长度相同
        Filter filter = new KeyOnlyFilter(false);
        Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
        FilterList list = new FilterList();
        list.addFilter(filter);
        list.addFilter(filter2);
        scan.setFilter(list);
        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
            for (Cell cell : res.listCells()) {
                String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
                String val = Bytes.toString(CellUtil.cloneValue(cell));
                System.out.println("RowKey:" + Bytes.toString(res.getRow()) + "列:" + qualifier + ";值长度:" + CellUtil.cloneValue(cell).length);
            }
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

11、 FirstKeyOnlyFilter 首次行过滤器

首次行过滤器只是访问一行中的第一列后就结束对当前行的扫描,跳到下一行,所以返回的列只有第一个列。

private static void test13() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        //返回的列是长度为多少的字节数组,false是0长度、true是和列值长度相同
        Filter filter = new FirstKeyOnlyFilter();
        Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
        FilterList list = new FilterList();
        list.addFilter(filter);
        list.addFilter(filter2);
        scan.setFilter(list);
        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
            System.out.println("列个数:"+res.listCells().size());
            for (Cell cell : res.listCells()) {
                String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
                String val = Bytes.toString(CellUtil.cloneValue(cell));
                System.out.println("RowKey:" + Bytes.toString(res.getRow()) + "列:" + qualifier + ";值长度:" + CellUtil.cloneValue(cell).length + ";值:" + val);
            }
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

12、 InclusiveStopFilter包含结束的过滤器

13、 TimestampsFilter时间戳过滤器

private static void test14() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();

        List ts=new ArrayList();
        ts.add(new Long(5));
        ts.add(new Long(10));
        ts.add(new Long(15));
        //定义一个时间戳过滤器,定义返回三个时间戳的数据
        Filter filter = new TimestampsFilter(ts);
        Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
        FilterList list = new FilterList();
        list.addFilter(filter);
        list.addFilter(filter2);
        scan.setFilter(list);

        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

14、 ColumnCountGetFilter列计数过滤器

列计数过滤器可以来限制每行最多取回多少列,当一行的列数达到设定的最大值时,就会停止整个扫描,所以它并不太适合Scan操作,而是更适合于get操作。

//Scan中使用列计数过滤器
private static void test15() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        Filter filter = new ColumnCountGetFilter(3);
        Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
        FilterList list = new FilterList();
        list.addFilter(filter);
        list.addFilter(filter2);
        scan.setFilter(list);

        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
            System.out.println("列数:" + res.listCells().size());
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

//Get方法中使用列计数过滤器
private static void test16() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Filter filter = new ColumnCountGetFilter(3);
        Get get = new Get(Bytes.toBytes("000466_00201612_630000"));
        get.setFilter(filter);
        Result result = table.get(get);
        System.out.println(result);
        System.out.println("列数:"+result.listCells().size());
    } catch (IOException e) {
        e.printStackTrace();
    }
}

15、 ColumnPaginationFilter列分页过滤器

列分页过滤器可是只返回部分列,构造函数为:
ColumnPaginationFilter(int limit, int offset)
返回的列的偏移量大于offset(从0开始),返回列的个数为limit个。

private static void test17() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        Filter filter = new ColumnPaginationFilter(5,0);
        Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
        FilterList list = new FilterList();
        list.addFilter(filter);
        list.addFilter(filter2);
        scan.setFilter(list);

        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
System.out.println("列的数量为:"+res.listCells().size());
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

16、 ColumnPrefixFilter列前缀过滤器

根据列名的前缀来筛选返回的列,只有列名带指定前缀的列才在返回结果中。

private static void test18() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        Filter filter = new ColumnPrefixFilter(Bytes.toBytes("attention_"));
        Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
        FilterList list = new FilterList();
        list.addFilter(filter);
        list.addFilter(filter2);
        scan.setFilter(list);

        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
            System.out.println("列的数量为:"+res.listCells().size());
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

17、 RandomRowFilter随机行过滤器

随机行过滤器使得结果中包含随机行,构造函数的参数是一个介于0.0到1.0之间的数字,内部使用Java的Random.nextFloat()来决定一行是否被过滤掉,当参数是一个负数时,所有结果都会被过滤掉,当参数大于1时则所有结果都包含在结果中。

private static void test19() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        Filter filter = new RandomRowFilter((float) 0.6);
        Filter filter2 = new PrefixFilter(Bytes.toBytes("000466_00201612_"));
        FilterList list = new FilterList();
        list.addFilter(filter);
        list.addFilter(filter2);
        scan.setFilter(list);
        ResultScanner scanner = table.getScanner(scan);
        int rowcount = 0;
        for (Result res : scanner) {
            System.out.println(res);
            rowcount++;
        }
        System.out.println("行的数量为:" + rowcount);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

三、 附加过滤器

18、 SkipFilter跳转过滤器

跳转过滤器常常与其他过滤器一起使用,跳转过滤器是对其他过滤器的修饰。当其他过滤器发现一行中的某一列需要被过滤时,再通过跳转过滤器就将所在行给过滤掉了。

//查询每个单元格里都不含有466的行,
private static void test20() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        scan.setStartRow(Bytes.toBytes("000003_00201612_450000"));
        scan. setStopRow(Bytes.toBytes("000003_00201612_530000"));
//通过ValueFilter过滤器,得到不等于466的单元格
        Filter filter = new ValueFilter(CompareFilter.CompareOp.NOT_EQUAL, new BinaryComparator(Bytes.toBytes("466")));
//通过SkipFilter过滤器进行装饰,将不符合ValueFilter的单元格所在行都排除掉。
        Filter filter1=new SkipFilter(filter);
        scan.setFilter(filter1);
        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            //System.out.println(res);
            String str=Bytes.toString(res.getRow())+"--";
            for (Cell cell : res.listCells()) {
                String qualifier = Bytes.toString(CellUtil.cloneQualifier(cell));
                String val = Bytes.toString(CellUtil.cloneValue(cell));
                str+=qualifier+":"+val+",";
            }
            System.out.println(str);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

19、 WhileMatchFilter全匹配过滤器

全匹配过滤器也往往和其他过滤器一块使用,是对其他过滤器的装饰,当其他过滤器遇到不匹配的行时,全匹配过滤器就将扫描停掉。

private static void test21() {
    Configuration conf = HBaseConfiguration.create();
    try {
        Connection connection = ConnectionFactory.createConnection(conf);
        Table table = connection.getTable(TableName.valueOf("DLR:ft_fact_data_month_quarter"));
        Scan scan = new Scan();
        scan.setStartRow(Bytes.toBytes("000003_00201612_450000"));
        scan.setStopRow(Bytes.toBytes("000003_00201612_530000"));
        Filter filter = new RowFilter(CompareFilter.CompareOp.NOT_EQUAL, new BinaryComparator(Bytes.toBytes("000003_00201612_500000")));
        Filter filter1=new WhileMatchFilter(filter);
        scan.setFilter(filter1);
        ResultScanner scanner = table.getScanner(scan);
        for (Result res : scanner) {
            System.out.println(res);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

四、 FilterList

可以将多个不同类型的Filter放入一个FilterList来共同限制返回到客户端的数据。FilterList可以指定一个Operator类型的参数,用来决定多个过滤器的组合效果。
FilterList.Operator枚举值:

枚举值 说明
MUST_PASS_ALL 所有的过滤器都符合,这个值才被包含在结果中。
MUST_PASS_ONE 只要有符合一个过滤器,这个值就包含在结果中。

这一篇博文是【大数据技术●降龙十八掌】系列文章的其中一篇,点击查看目录:这里写图片描述大数据技术●降龙十八掌

你可能感兴趣的:(大数据技术,大数据技术)