Hyperscan是个开源的高性能正则匹配库,支持几十万的正则库,使用起来比较方便,具体使用方法可以参照官方文档
git地址:https://github.com/intel/hyperscan
开发者手册:http://intel.github.io/hyperscan/dev-reference/
个人使用的总结如下
Hyperscan database 的生成,一般数据量小的时候比较快的,几百个规则生成时间也就几秒或是几十秒,一万条规则的时候需要110秒左右,二万条规则的时候需要664秒左右,20万条的时候大概8小时左右,可能和机器和正则表达式的复杂程度有关系
根据database 匹配字符串的速度问题,这个和生成的时候需要设置几个参数就是8Hi,8表示UTF,H表示单匹配(只要匹配到就返回,不然会一直匹配下去,比如.*的情况,会一直匹配到结束,效率会非常的低),i表示忽略大小写
java封装了一个工具,git地址:https://github.com/gliwka/hyperscan-java
具体生成正则库的代码:
in为正则文件一条一换行即可,前面不需要就id号,如:
(A|b|c)|(你好 Hyperscan)
(1|b|c)|(你好 Hyperscan.*?)
(1|b|c)|(你好 Hyperscan.*?)
out为输出路径,后缀随意,如:HyDB
public static void gen(String in, String out){
long start = System.currentTimeMillis();
System.out.println("装载正则库开始:"+start);
List expressions2 = new ArrayList<>(250000);
try {
BufferedInputStream fis = null;
BufferedReader reader = null;
File file = new File(in);
fis = new BufferedInputStream(new FileInputStream(file));
reader = new BufferedReader(new InputStreamReader(fis, "utf-8"), 5 * 1024 * 1024);// 用5M的缓冲读取文本文件
List expressionStrings2 = new ArrayList<>(250000);
String line = "";
while ((line = reader.readLine()) != null) {
if (line.getBytes().length > 0) {
expressionStrings2.add(line);
}
}
for (int i = 0; i < expressionStrings2.size(); i++) {
try {
expressions2.add(new Expression(expressionStrings2.get(i), EnumSet.of(ExpressionFlag.UTF8,ExpressionFlag.CASELESS,ExpressionFlag.SINGLEMATCH)));
} catch (Exception e) {
continue;
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
try(Database db = Database.compile(expressions2)) {
try(OutputStream outs = new FileOutputStream(out)) {
db.save(outs);
long end = System.currentTimeMillis();
System.out.println("end "+end);
System.out.println("用时"+(end-start)/1000+"秒");
}
}
catch (CompileErrorException ce) {
Expression failedExpression = ce.getFailedExpression();
System.out.println("error : "+failedExpression);
}
catch(IOException ie) {
}
}
读取库就比较简单了:
dbPath为生成好的库的地址
input为目标字符串文件
public static void regx(String input,String dbPath){
FileInputStream reader = null;
System.out.println("装载正则库开始:"+System.currentTimeMillis());
long startLoad = System.currentTimeMillis();
FileInputStream fi = null;
try {
fi = new FileInputStream(dbPath);
File file = new File(input);
reader = new FileInputStream(file);
byte[] buf = new byte[1024];
int length = 0;
StringBuffer tmp= new StringBuffer();
while((length = reader.read(buf)) != -1){
tmp .append(new String(buf,0,length));
}
Database db = Database.load(fi);
Scanner scanner = new Scanner();
scanner.allocScratch(db);
long endLoad = System.currentTimeMillis();
long regxstart = System.currentTimeMillis();
System.out.println("装载正则库结束"+endLoad+",用时"+(endLoad-startLoad)/1000+"秒,匹配开始:"+regxstart);
List matches = scanner.scan(db, tmp.toString());
System.out.println("匹配命中结果:"+matches.size());
long regxend = System.currentTimeMillis();
System.out.println("匹配结束:"+System.currentTimeMillis()+",用时:"+(regxend-regxstart)/1000+"秒");
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
添加pom依赖:
com.gliwka.hyperscan
hyperscan
1.0.0
demo代码地址:https://download.csdn.net/download/airyearth/14890010
性能还是比较快的,1万条规则,匹配出33条结果,74毫秒,2万条规则,匹配出68条结果,86毫秒,待续