【正则表达式】pyahocorasick介绍

pyahocorasick是个python模块,由两种数据结构实现:trie和Aho-Corasick自动机。

Trie是一个字符串索引的词典,检索相关项时时间和字符串长度成正比。

AC自动机能够在一次运行中找到给定集合所有字符串。AC自动机其实就是在Trie树上实现KMP,可以完成多模式串的匹配。
(推荐学习资料:http://blog.csdn.net/niushuai666/article/details/7002823;http://www.cnblogs.com/kuangbin/p/3164106.html)

作者

Wojciech Muła, [email protected]
官方地址:
https://pypi.python.org/pypi/pyahocorasick/

安装

python setup.py install

API

模块ahocorasick包含几个常量和Automaton类

Unicode和bytes

Automaton接受和返回的字符串既可以是unicode,也可以是bytes,取决于编译阶段设置。用模块成员unicode表示选择的类型。

需要注意的是:如果选择使用unicode,trie存储每个字母时用2或者4个bytes;如果选择bytes,每个字母只需要一个byte。

constants

  • unicode:
  • STORE_ANY,STORE_INTS,STORE_LENGTH
  • EMPTY,TRIE,AHOCORASICK
  • MATCH_EXACT_LENGTH,MATCH_AT_MOST_PREFIX,MATCH_AT_LEAST_PREFIX

Antomaton类

略。有兴趣了解请看官网介绍。

AutomatonSearchIter类

略。有兴趣了解请看官网介绍。

示例

import ahocorasick
A = ahocorasick.Automaton()

# 向trie树中添加单词
for index,word in enumerate("he her hers she".split()):
    A.add_word(word, (index, word))
# 用法分析add_word(word,[value]) => bool
# 根据Automaton构造函数的参数store设置,value这样考虑:
# 1. 如果store设置为STORE_LENGTH,不能传递value,默认保存len(word)
# 2. 如果store设置为STORE_INTS,value可选,但必须是int类型,默认是len(automaton)
# 3. 如果store设置为STORE_ANY,value必须写,可以是任意类型

# 测试单词是否在树中
if "he" in A:
    print True
else:
    print False
A.get("he")
# (0,'he')
A.get("cat","")
# ''
A.get("dog")
# KeyError

# 将trie树转化为Aho-Corasick自动机
A.make_automaton()

# 找到所有匹配字符串
for item in A.iter("_hershe_"):
    print item
#(2,(0,'he'))
#(3,(1,'her'))
#(4, (2, 'hers'))
#(6, (3, 'she'))
#(6, (0, 'he'))

示例2

import ahocorasick
A = ahocorasick.Automaton()

# 添加单词
for index,word in enumerate("cat catastropha rat rate bat".split()):
    A.add_word(word, (index, word))

# prefix
list(A.keys("cat"))
## ["cat","catastropha"]

list(A.keys("?at","?",ahocprasick.MATCH_EXACT_LENGTH))
## ['bat','cat','rat']

list(A.keys("?at?", "?", ahocorasick.MATCH_AT_MOST_PREFIX))
## ["bat", "cat", "rat", "rate"]

list(A.keys("?at?", "?", ahocorasick.MATCH_AT_LEAST_PREFIX))
## ['rate']
## keys用法分析
## keys([prefix, [wildcard, [how]]]) => yield strings
## If prefix (a string) is given, then only words sharing this prefix are yielded.
## If wildcard (single character) is given, then prefix is treated as a simple pattern with selected wildcard. Optional parameter how controls which strings are matched:
## MATCH_EXACT_LENGTH [default]:Only strings with the same length as a pattern’s length are yielded. In other words, literally match a pattern.
## MATCH_AT_LEAST_PREFIX:Strings that have length greater or equal to a pattern’s length are yielded.
## MATCH_AT_MOST_PREFIX:Strings that have length less or equal to a pattern’s length are yielded.

你可能感兴趣的:(python)