Hyperscan 匹配性能,参数设置,db生成的一些理解

Hyperscan是个开源的高性能正则匹配库,支持几十万的正则库,使用起来比较方便,具体使用方法可以参照官方文档

git地址:https://github.com/intel/hyperscan

开发者手册:http://intel.github.io/hyperscan/dev-reference/

个人使用的总结如下

Hyperscan database 的生成,一般数据量小的时候比较快的,几百个规则生成时间也就几秒或是几十秒,一万条规则的时候需要110秒左右,二万条规则的时候需要664秒左右,20万条的时候大概8小时左右,可能和机器和正则表达式的复杂程度有关系

根据database 匹配字符串的速度问题,这个和生成的时候需要设置几个参数就是8Hi,8表示UTF,H表示单匹配(只要匹配到就返回,不然会一直匹配下去,比如.*的情况,会一直匹配到结束,效率会非常的低),i表示忽略大小写

java封装了一个工具,git地址:https://github.com/gliwka/hyperscan-java

具体生成正则库的代码:

in为正则文件一条一换行即可,前面不需要就id号,如:

(A|b|c)|(你好 Hyperscan)

(1|b|c)|(你好 Hyperscan.*?)

(1|b|c)|(你好 Hyperscan.*?)

out为输出路径,后缀随意,如:HyDB

public static void gen(String in, String out){
    long start = System.currentTimeMillis();
    System.out.println("装载正则库开始:"+start);
    List expressions2 = new ArrayList<>(250000);
    try {

        BufferedInputStream fis = null;
        BufferedReader reader = null;
        File file = new File(in);
        fis = new BufferedInputStream(new FileInputStream(file));
        reader = new BufferedReader(new InputStreamReader(fis, "utf-8"), 5 * 1024 * 1024);// 用5M的缓冲读取文本文件
        List expressionStrings2 = new ArrayList<>(250000);
        String line = "";
        while ((line = reader.readLine()) != null) {
            if (line.getBytes().length > 0) {
                expressionStrings2.add(line);

            }
        }

        for (int i = 0; i < expressionStrings2.size(); i++) {
            try {

                expressions2.add(new Expression(expressionStrings2.get(i), EnumSet.of(ExpressionFlag.UTF8,ExpressionFlag.CASELESS,ExpressionFlag.SINGLEMATCH)));

            } catch (Exception e) {
                continue;
            }
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (UnsupportedEncodingException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

    try(Database db = Database.compile(expressions2)) {

        try(OutputStream outs = new FileOutputStream(out)) {
            db.save(outs);
            long end = System.currentTimeMillis();
            System.out.println("end   "+end);
            System.out.println("用时"+(end-start)/1000+"秒");
        }

    }
    catch (CompileErrorException ce) {

        Expression failedExpression = ce.getFailedExpression();
        System.out.println("error : "+failedExpression);
    }
    catch(IOException ie) {

    }

}

读取库就比较简单了:

dbPath为生成好的库的地址

input为目标字符串文件

public static void regx(String input,String dbPath){
    FileInputStream reader = null;

    System.out.println("装载正则库开始:"+System.currentTimeMillis());
    long startLoad = System.currentTimeMillis();
    FileInputStream fi = null;
    try {
        fi = new FileInputStream(dbPath);
        File file = new File(input);
        reader = new FileInputStream(file);

        byte[] buf = new byte[1024];
        int length = 0;
        StringBuffer tmp= new StringBuffer();
        while((length = reader.read(buf)) != -1){

            tmp .append(new String(buf,0,length));
        }
    Database db = Database.load(fi);
    Scanner scanner = new Scanner();
    scanner.allocScratch(db);
        long endLoad = System.currentTimeMillis();
        long regxstart = System.currentTimeMillis();
    System.out.println("装载正则库结束"+endLoad+",用时"+(endLoad-startLoad)/1000+"秒,匹配开始:"+regxstart);
    List matches = scanner.scan(db, tmp.toString());
    System.out.println("匹配命中结果:"+matches.size());
        long regxend = System.currentTimeMillis();
    System.out.println("匹配结束:"+System.currentTimeMillis()+",用时:"+(regxend-regxstart)/1000+"秒");
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

添加pom依赖:


    com.gliwka.hyperscan
    hyperscan
    1.0.0

demo代码地址:https://download.csdn.net/download/airyearth/14890010

性能还是比较快的,1万条规则,匹配出33条结果,74毫秒,2万条规则,匹配出68条结果,86毫秒,待续

你可能感兴趣的:(devtools,SpringBoot,Hyperscan)