在使用python3.7运行Natural Language Processing with Python Chapter 5 的最后一个示例
from nltk.tbl import demo as brill_demo
brill_demo.demo()
print(open("errors.out").read())
时, 出现如下错误:
Traceback (most recent call last):
File "E:/Python Practice/NLP/Chapter5.py", line 332, in
print(open("errors.out").read())
FileNotFoundError: [Errno 2] No such file or directory: 'errors.out'
字面意思就是说,该文件不存在,在当前目录查找后也确实没有。通过搜索没有找到现成的解决方法,于是在StackOverflow求助,怀疑是nltk.tbl.demo模块的版本问题——是不是新的模块中有其他类似的生成errors.out文件的方法?
于是查看nltk/tbl/demo模块的源码,果然发现有一个类似的函数,如下
def demo_error_analysis():
"""
Writes a file with context for each erroneous word after tagging testing data
"""
postag(error_output="errors.txt")
根据注释,发现这个函数的功能正是生成类似errors.out的文件。于是自然就想到,我们只要首先执行demo_error_analysis()函数,然后读取生成的文件就好啦,
brill_demo.demo_error_analysis()
然而事情往往没有那么简单。。。运行后报错如下:
Traceback (most recent call last):
File "E:/Python Practice/NLP/Chapter5.py", line 331, in
brill_demo.demo_error_analysis()
File "D:\Anaconda3\lib\site-packages\nltk\tbl\demo.py", line 124, in demo_error_analysis
postag(error_output="errors.txt")
File "D:\Anaconda3\lib\site-packages\nltk\tbl\demo.py", line 322, in postag
u"\n".join(error_list(gold_data, taggedtest)).encode("utf-8") + "\n" #
TypeError: can't concat str to bytes
跟随提示的路径找到报错所在的源文件,如下
# writing error analysis to file
if error_output is not None:
with open(error_output, "w") as f:
f.write("Errors for Brill Tagger %r\n\n" % serialize_output)
f.write(
u"\n".join(error_list(gold_data, taggedtest)).encode("utf-8") + "\n"
)
print("Wrote tagger errors including context to {0}".format(error_output))
那么报错的意思就是说,在下面这一行,生成error_list时出现类型转换的问题了
u"\n".join(error_list(gold_data, taggedtest)).encode("utf-8") + "\n"
通过查阅这篇文章,发现问题所在:encode函数返回的是bytes类型的变量,不可以直接和string类型的变量合并,需要再调用decode函数,把bytes类型转变为string类型。
因此,解决方法很简单,即把这一行改成
u"\n".join(error_list(gold_data, taggedtest)).encode("utf-8").decode() + "\n" #add .decode()
(修改时可能会出现提示信息询问是否确认修改,放心大胆的改吧朋友们,如果不放心的话后面注释一下修改的内容,向我上面那样做)
经过小小的改动之后,再次运行
brill_demo.demo_error_analysis()
这时候就正常啦!
Loading tagged data from treebank...
Read testing data (200 sents/5251 wds)
Read training data (800 sents/19933 wds)
Read baseline data (800 sents/19933 wds) [reused the training set]
Trained baseline tagger
Accuracy on test set: 0.8366
Training tbl tagger...
TBL train (fast) (seqs: 800; tokens: 19933; tpls: 24; min score: 3; min acc: None)
Finding initial useful rules...
Found 12799 useful rules.
B |
S F r O | Score = Fixed - Broken
c i o t | R Fixed = num tags changed incorrect -> correct
o x k h | u Broken = num tags changed correct -> incorrect
r e e e | l Other = num tags changed incorrect -> incorrect
e d n r | e
------------------+-------------------------------------------------------
23 23 0 0 | POS->VBZ if Pos:PRP@[-2,-1]
18 19 1 0 | NN->VB if Pos:-NONE-@[-2] & Pos:TO@[-1]
14 14 0 0 | VBP->VB if Pos:MD@[-2,-1]
12 12 0 0 | VBP->VB if Pos:TO@[-1]
11 11 0 0 | VBD->VBN if Pos:VBD@[-1]
11 11 0 0 | IN->WDT if Pos:-NONE-@[1] & Pos:VBP@[2]
10 11 1 0 | VBN->VBD if Pos:PRP@[-1]
9 10 1 0 | VBD->VBN if Pos:VBZ@[-1]
8 8 0 0 | NN->VB if Pos:MD@[-1]
7 7 0 1 | VB->NN if Pos:DT@[-1]
7 7 0 0 | VB->VBP if Pos:PRP@[-1]
7 7 0 0 | IN->WDT if Pos:-NONE-@[1] & Pos:VBZ@[2]
7 8 1 0 | IN->RB if Word:as@[2]
6 6 0 0 | VBD->VBN if Pos:VBP@[-2,-1]
6 6 0 1 | IN->WDT if Pos:-NONE-@[1] & Pos:VBD@[2]
5 5 0 0 | POS->VBZ if Pos:-NONE-@[-1]
5 5 0 0 | VB->VBP if Pos:NNS@[-1]
5 5 0 0 | VBD->VBN if Word:be@[-2,-1]
4 4 0 0 | POS->VBZ if Pos:``@[-2]
4 4 0 0 | VBP->VB if Pos:VBD@[-2,-1]
4 6 2 3 | RP->RB if Pos:CD@[1,2]
4 4 0 0 | RB->JJ if Pos:DT@[-1] & Pos:NN@[1]
4 4 0 0 | NN->VBP if Pos:NNS@[-2] & Pos:RB@[-1]
4 5 1 0 | VBN->VBD if Pos:NNP@[-2] & Pos:NNP@[-1]
4 4 0 0 | IN->WDT if Pos:-NONE-@[1] & Pos:MD@[2]
4 8 4 0 | VBD->VBN if Word:*@[1]
4 4 0 0 | JJS->RBS if Word:most@[0] & Word:the@[-1] & Pos:DT@[-1]
3 3 0 0 | VBD->VBN if Pos:VBN@[-1]
3 4 1 0 | VBN->VB if Pos:TO@[-1]
3 4 1 1 | IN->RB if Pos:.@[1]
3 3 0 0 | JJ->RB if Pos:VBD@[1]
3 3 0 0 | PRP$->PRP if Pos:TO@[1]
3 3 0 0 | NN->VBP if Pos:NNS@[-1] & Pos:DT@[1]
3 3 0 0 | VBP->VB if Word:n't@[-2,-1]
Trained tbl tagger in 2.45 seconds
Accuracy on test set: 0.8572
Tagging the test data
Wrote tagger errors including context to errors.txt
我们可以看到当前目录下多出了一个errors.txt文件
最后一步,读取并输出文件
print(open("errors.txt").read())
输出内容如下(部分):
Errors for Brill Tagger None
left context | word/test->gold | right context
--------------------------+------------------------+--------------------------
| Soon/NN->RB | ,/, T-shirts/NNS *ICH*-1/
n/IN the/DT corridors/NNS | that/IN->WDT | *T*-2/-NONE- carried/VBD
NNS that/WDT *T*-2/-NONE- | carried/VBN->VBD | the/DT school/NN 's/POS f
D the/DT school/NN 's/POS | familiar/NN->JJ | red-and-white/JJ GHS/NNP
ool/NN 's/POS familiar/JJ | red-and-white/NN->JJ | GHS/NNP logo/NN on/IN the
iliar/JJ red-and-white/JJ | GHS/NN->NNP | logo/NN on/IN the/DT fron
/NN ,/, the/DT shirts/NNS | read/VBP->VBD | ,/, ``/`` We/PRP have/VBP
,/, ``/`` We/PRP have/VBP | all/DT->PDT | the/DT answers/NNS ./. ''
JJ colleagues/NNS are/VBP | angry/NN->JJ | at/IN Mrs./NNP Yeargin/NN
n/NNP Rice/NNP ,/, who/WP | *T*-100/NN->-NONE- | had/VBD discovered/VBN th
VBD discovered/VBN the/DT | crib/JJ->NN | notes/NNS ./.
``/`` We/PRP | work/NN->VBP | damn/RB hard/RB at/IN wha
``/`` We/PRP work/VBP | damn/NN->RB | hard/RB at/IN what/WP we/
/IN what/WP we/PRP do/VBP | *T*-101/NN->-NONE- | for/IN damn/RB little/JJ
VBP *T*-101/-NONE- for/IN | damn/NN->RB | little/JJ pay/NN ,/, and/
...
至此,我们就解决了最初的问题~
赶在双十一的尾巴总结一下这个困扰我两三个小时的问题,希望对后来者有帮助~