Rasa预设管道不支持提取中文实体
数据来源
nlu.md
## intent:greet
- 你好
- 你好啊
- 你好吗
- hello
- hi
- 早上好
- 晚上好
- 嗨
## intent:affirm
- 是的
- 是
- 对的
- 确实
- 好
- ok
- 好的
- 好的,谢谢你
- 好滴
- 好啊
## intent:restaurant_search
- 我想找地方吃饭
- 我想吃[火锅](food)啊
- 找个吃[拉面](food)的店
- 这附近哪里有吃[麻辣烫](food)的地方
- 附近有什么好吃的地方吗
- 肚子饿了,推荐一家吃饭的地儿呗
- 带老婆孩子去哪里吃饭比较好
- 想去一家有情调的餐厅
## intent:goodbye
- bye
- 再见
- 886
- 拜拜
- 下次见
## intent:medical
- [感冒](disease)了怎么办
- 我[便秘](disease)了,该吃什么药
- 我[胃痛](disease),该吃什么药?
- 一直[打喷嚏](disease)怎么办
- 父母都有[高血压](disease),我会遗传吗
- 我生病了
- 头上烫烫的,感觉[发烧](disease)了
- 头很疼该怎么办
- 减肥有什么好方法吗?
- 怎样良好的生活习惯才能预防生病呢?
config.yml
language: zh
pipeline: supervised_embeddings
policies:
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy
训练
rasa train nlu
预测
rasa shell nlu
结果
Next message:
我想吃麻辣烫
{
"intent": {
"name": "restaurant_search",
"confidence": 0.8509110808372498
},
"entities": [],
"intent_ranking": [
{
"name": "restaurant_search",
"confidence": 0.8509110808372498
},
{
"name": "greet",
"confidence": 0.07149744778871536
},
{
"name": "goodbye",
"confidence": 0.030554646626114845
},
{
"name": "medical",
"confidence": 0.02477618120610714
},
{
"name": "affirm",
"confidence": 0.022260669618844986
}
],
"text": "我想吃麻辣烫"
}
意图识别正确,但实体抽取失效。
方案一:tensorflow_embedding
方案二:mitie预训练中文词向量模型
config.yml
language: zh
pipeline:
- name: JiebaTokenizer
- name: CRFEntityExtractor
- name: CountVectorsFeaturizer
OOV_token: oov
token_pattern: '(?u)\b\w+\b'
- name: EmbeddingIntentClassifier
效果
Next message:
我想吃麻辣烫啊
{
"intent": {
"name": "restaurant_search",
"confidence": 0.9744491577148438
},
"entities": [
{
"start": 3,
"end": 6,
"value": "麻辣烫",
"entity": "food",
"confidence": 0.6385026355026224,
"extractor": "CRFEntityExtractor"
}
],
"intent_ranking": [
{
"name": "restaurant_search",
"confidence": 0.9744491577148438
},
{
"name": "medical",
"confidence": 0.010666999034583569
},
{
"name": "goodbye",
"confidence": 0.00956057757139206
},
{
"name": "affirm",
"confidence": 0.002777603454887867
},
{
"name": "greet",
"confidence": 0.002545646158978343
}
],
"text": "我想吃麻辣烫啊"
}
PS:换成我想吃麻辣烫就不行了
一、安装MITIE
python setup.py install
二、安装Rasa_NLU_Chi
python setup.py install
三、修改配置
config.yml
language: zh
pipeline:
- name: MitieNLP
model: data/total_word_feature_extractor_zh.dat
- name: JiebaTokenizer
- name: MitieEntityExtractor
- name: EntitySynonymMapper
- name: RegexFeaturizer
- name: MitieFeaturizer
- name: SklearnIntentClassifier
结果
Next message:
我想吃麻辣烫
{
"intent": {
"name": "restaurant_search",
"confidence": 0.2884605946762254
},
"entities": [
{
"entity": "food",
"value": "麻辣烫",
"start": 3,
"end": 6,
"confidence": null,
"extractor": "MitieEntityExtractor"
}
],
"intent_ranking": [
{
"name": "restaurant_search",
"confidence": 0.2884605946762254
},
{
"name": "medical",
"confidence": 0.2303384672073249
},
{
"name": "affirm",
"confidence": 0.22970967542957596
},
{
"name": "goodbye",
"confidence": 0.1354964350169286
},
{
"name": "greet",
"confidence": 0.11599482766994498
}
],
"text": "我想吃麻辣烫"
}
实体正确抽取出了麻辣烫,但意图识别的置信度略低
查看MITIE模型的所有词,共有20w个词条
from mitie import *
twfe = total_word_feature_extractor("total_word_feature_extractor_zh.dat") # 加载
words = twfe.get_words_in_dictionary()
words = list(map(bytes.decode, words)) # 批量解码
for word in words:
print(word)
print(len(words))
方案一原理:词向量转换后使用cos余弦相似度实现意图区分
方案一缺点:存在问题未登录词问题
方案二原理:基于wiki百科训练的数据模型
方案二缺点:不适合大数据训练
官网建议如果训练数据小于1000条采用方案二,否则采用方案一
总结:这种配置方法是没有前途的
更进一步的配置参考:rasa_nlu_gq,支持中文,自定义了N种模型,支持不同的场景和任务