因业务需求,对正则匹配进行优化,使用hyperscan进行文本内容提取优化;
Hyperscan是一款来自于Intel的高性能的正则表达式匹配库;
因为业务代码是java编写的,所以参照hyperscan-java的方法,使用C/C++编译Hyperscan,然后使用JNA调用的方式使用Hyperscan;
不过经性能测试,发现Hyperscan-java,并未能产生明显的性能优化;
隧开始定位为什么Hypersacn-java未能提升性能,
走读代码发现,Hyperscan-java在调用so时,使用了java的回调函数,代码如下:
HyperscanLibraryDirect.hs_scan(dbPointer, input, bytesLength,
0, pointer, (id, from, to, flags, context) -> {
long[] tuple = {id, from, to};
matchedIds.add(tuple);
return 0;
}, Pointer.NULL);
为了避免C++回调java的接口;所以,采用修改hs_scan接口的方式;让hs_scan内部缓存结果,然后;再将缓存的结果返回;
采用这个思路,对HyperScan源码进行了修改,本例采用的是Hyperscan-5.3.0,新增了接口hs_scan_with_result.
按照这个思路,修改代码,并对Hyperscan-5.3.0进行重新编译生成libhs.so,编译完成后,测试确发现,hs_scan_with_result这个符号找不到;出现了undefined symbol错误
Error looking up function 'hs_scan_with_result': /root/.cache/JNA/temp/jna7512264629104952875.tmp:
undefined symbol: hs_scan_with_result
然后,通过nm查看libhs.so的符号表
nm -A libhs.so | grep hs_scan_with_result
libhs.so:000000000053e3c0 T avx2_hs_scan_with_result
libhs.so:000000000031d000 T core2_hs_scan_with_result
libhs.so:0000000000431320 T corei7_hs_scan_with_result
发现有带前缀的符号,但是没有hs_scan_with_result这个符号
继续查看原有可用的hs_scan的符号
nm -A libhs.so | grep hs_scan
libhs.so:000000000053dd90 T avx2_hs_scan
libhs.so:000000000053eb60 T avx2_hs_scan_stream
libhs.so:000000000053fee0 T avx2_hs_scan_vector
libhs.so:000000000053e3c0 T avx2_hs_scan_with_result
libhs.so:000000000031c9d0 T core2_hs_scan
libhs.so:000000000031d7d0 T core2_hs_scan_stream
libhs.so:000000000031eba0 T core2_hs_scan_vector
libhs.so:000000000031d000 T core2_hs_scan_with_result
libhs.so:0000000000430cf0 T corei7_hs_scan
libhs.so:0000000000431af0 T corei7_hs_scan_stream
libhs.so:0000000000432ec0 T corei7_hs_scan_vector
libhs.so:0000000000431320 T corei7_hs_scan_with_result
libhs.so:000000000031ae00 t error_hs_scan
libhs.so:000000000031ae60 t error_hs_scan_stream
libhs.so:000000000031ae80 t error_hs_scan_vector
libhs.so:000000000031af60 i hs_scan
libhs.so:000000000031b2c0 i hs_scan_stream
libhs.so:000000000031b3e0 i hs_scan_vector
libhs.so:000000000031af60 t resolve_hs_scan
libhs.so:000000000031b2c0 t resolve_hs_scan_stream
libhs.so:000000000031b3e0 t resolve_hs_scan_vector
发现hs_scan也有avx2_,core2_,corei7_打头的符号,但是还多了一个hs_scan自身的符号;
对照后发现肯定是新增的接口,少加了配置;经过对照接口的实现,发现申明和定义关键字都是相同的;同时在外围的配置文件,hs.def, hs_runtime.def中也增加了相应的配置;同样没有产生效果;
后面发现hs_scan前面的定义符号是i。
后面想起来addr2line是可以定位到函数的文件;
addr2line 000000000031af60 -e libhs.so
dispatcher.c:?
打开dispatcher.c中有hs_scan,hs_scan_stream等相关接口的定义;所以参照样式增加了语句
CREATE_DISPATCH(hs_error_t, hs_scan_with_result, const hs_database_t *db, const char *data,
unsigned length, unsigned flags, hs_scratch_t *scratch, void*userCtx);
增加完后,然后在使用nm查看libhs.so发现想要的符号已经出现了
nm -A libhs.so | grep hs_scan_with_result
libhs.so:000000000053e4c0 T avx2_hs_scan_with_result
libhs.so:000000000031d100 T core2_hs_scan_with_result
libhs.so:0000000000431420 T corei7_hs_scan_with_result
libhs.so:000000000031ae70 t error_hs_scan_with_result
libhs.so:000000000031b060 i hs_scan_with_result
libhs.so:000000000031b060 t resolve_hs_scan_with_result
将so,更新到Hyperscan-java后,发现可以运行成功;
性能也有了显著提升