Shallow parsing
又叫Chunking
(分块)是介于词性标注和Constituency parsing 之间的一种浅层分析方法。用于识别文本中最小短语块,例如名词短语NP,动词短语VP以及介词短语PP等。
介绍
例如上图中,从文本 "We saw the yellow dog"
,提取出名词短语块,称为NP-chunk
。最后得到相应的浅层句法结构
从解决方法上看与命名实体识别NER
相似,都是序列标注的问题,常用的标签有BMES
,BIO
,BIOE
。标签与相应的块名称X
组合, 例如B-NP
代表块名词短语的开头。
图片来自博客
句子中的短语块,一般有以下几种类型:
但是现有的工具(spacy
,textblob
等),一般只关注NP-chunking任务,仅仅提取文本序列中的名词短语块。conll2000-chunking任务提取NP, VP以及PP短语块,这里也提供了相应的数据集
实践
可以使用基于规则的方法和基于机器学习的方法
基于规则的方法
基于规则的方法需要手动定义chunking的文法,并且需要注意嵌套
def preprocess(doc):
sentences = nltk.sent_tokenize(doc)
sentences = [nltk.word_tokenize(sent) for sent in sentences]
sentences = [nltk.pos_tag(sent) for sent in sentences]
return sentences
sentence = "The blogger taught the reader to chunk"
sentence = preprocess(sentence)
print(sentence)
grammar = "NP: {?*}"
# 匹配模式,限定词(0或1个) + 形容词(0个以上) + 名词
NPChunker = nltk.RegexpParser(grammar)
result = NPChunker.parse(sentence[0])
print(result)
输出:
[[('The', 'DT'), ('blogger', 'NN'), ('taught', 'VBD'), ('the', 'DT'), ('reader', 'NN'), ('to', 'TO'), ('chunk', 'VB')]]
(S
(NP The/DT blogger/NN)
taught/VBD
(NP the/DT reader/NN)
to/TO
chunk/VB)
基于机器学习的方法(最大熵分类器)
输入有两种形式,一是原始的文本,二是原始文本+词性标注(准确率相比前者会高很多)
这里使用nltk中自带的语料conll2000,可使用如下命令下载,训练最大熵分类器,自动提取文本中的名词短语块NP,动词短语块VP和介词短语块PP:
import nltk
nltk.download("conll2000")
代码如下:
def tags_since_dt(sentence, i):
tags = set()
for word, pos in sentence[:i]:
if pos == 'DT':
tags = set()
else:
tags.add(pos)
return '+'.join(sorted(tags))
def npchunk_features(sentence, i, history):
word, pos = sentence[i]
if i == 0:
prevword, prevpos = "<START>", "<START>"
else:
prevword, prevpos = sentence[i - 1]
if i == len(sentence) - 1:
nextword, nextpos = "<END>", "<END>"
else:
nextword, nextpos = sentence[i + 1]
return {"pos": pos,
"word": word,
"prevpos": prevpos,
"nextpos": nextpos,
"prevword": prevword,
"nextword": nextword,
"prevpos+pos": "%s+%s" % (prevpos, pos),
"pos+nextpos": "%s+%s" % (pos, nextpos),
"prevpos+pos+nextpos": "%s+%s+%s" % (prevpos, pos, nextpos),
"prevword+word+nextword": "%s+%s+%s" % (prevword, word, nextword),
"tags-since-dt": tags_since_dt(sentence, i)}
class ConsecutiveNPChunkTagger(nltk.TaggerI):
def __init__(self, train_sents):
train_set = []
for tagged_sent in train_sents:
untagged_sent = nltk.tag.untag(tagged_sent)
history = []
for i, (word, tag) in enumerate(tagged_sent):
featureset = npchunk_features(untagged_sent, i, history)
train_set.append((featureset, tag))
history.append(tag)
self.classifier = nltk.MaxentClassifier.train(
train_set, algorithm='IIS', trace=0)
def tag(self, sentence):
history = []
for i, word in enumerate(sentence):
featureset = npchunk_features(sentence, i, history)
tag = self.classifier.classify(featureset)
history.append(tag)
return zip(sentence, history)
# 模型及特征构建
class ConsecutiveNPChunker(nltk.ChunkParserI):
def __init__(self, train_sents):
tagged_sents = [[((w, t), c) for (w, t, c) in
nltk.chunk.tree2conlltags(sent)]
for sent in train_sents]
# 词->词性->chunk标签
# iob_tagged = tree2conlltags(chunked_sentence)
# chunk_tree = conlltags2tree(iob_tagged)
# len(conll2000.chunked_sents()) # 10948
# len(conll2000.chunked_words()) # 166433
self.tagger = ConsecutiveNPChunkTagger(tagged_sents)
def parse(self, sentence):
tagged_sents = self.tagger.tag(sentence)
conlltags = [(w, t, c) for ((w, t), c) in tagged_sents]
return nltk.chunk.conlltags2tree(conlltags)
from nltk.corpus import conll2000
# 获取训练和测试数据
train_sents = conll2000.chunked_sents('train.txt')
chunked_sentence = conll2000.chunked_sents()[0]
test_sents = conll2000.chunked_sents('test.txt')
# 训练模型
chunker = ConsecutiveNPChunker(train_sents)
# 测试
print(chunker.evaluate(test_sents))
# 保存模型
import pickle
pickle.dump(chunker, open("chunker.bin", "wb"))
# 加载模型
chunker = pickle.load(open("chunker.bin", "rb"))
# 测试样例
sentence = 'It is the 2019 novel coronavirus that has breaks out worldwide.'
test_sent_words = nltk.word_tokenize(sentence)
test_sent_pos = nltk.pos_tag(test_sent_words)
test_sent = [(word, pos) for word, pos in zip(test_sent_words, test_sent_pos)]
print(chunker.parse(test_sent_pos))
输出:
ChunkParse score:
IOB Accuracy: 93.9%%
Precision: 89.0%%
Recall: 92.1%%
F-Measure: 90.5%%
(S
(NP It/PRP)
(VP is/VBZ)
(NP the/DT 2019/CD novel/NN coronavirus/NN)
(NP that/WDT)
(VP has/VBZ breaks/VBN)
out/RP
(NP worldwide/NN)
./.)