注意:
为了清晰起见,官方重新命名了预定义的管道,以反映它们做了什么,而不是在Rasa NLU 0.15中使用了哪些库。tensorflow_embed_embeddings
管道现在称为supervised_embeddings
, spacy_sklearn
现在称为pretrained_embeddings_spacy
。如果你正在使用这些,请更新代码吧。
本文是Rasa NLU中每个内置组件配置选项的参考。如果希望构建自定义组件,请查看自定义NLU组件。
文章目录
- 1. 词向量
- 1.1 MitieNLP
- 1.2 SpacyNLP
- 2. 文本特征化
- 2.1 MitieFeaturizer
- 2.2 SpacyFeaturizer
- 2.3 ConveRTFeaturizer
- 2.4 RegexFeaturizer
- 2.5 CountVectorsFeaturizer
- 3. 意图分类器
- 3.1 MitieIntentClassifier
- 3.2 SklearnIntentClassifier
- 3.3 EmbeddingIntentClassifier
- 3.4 KeywordIntentClassifier
- 4. 选择器Selectors
- 5. 分词器Tokenizers
- 5.1 WhitespaceTokenizer
- 5.2 JiebaTokenizer
- 5.3 MitieTokenizer
- 5.4 SpacyTokenizer
- 6. 实体抽取器Entity Extractors
- 6.1 MitieEntityExtractor
- 6.2 SpacyEntityExtractor
- 6.3 EntitySynonymMapper
- 6.4 CRFEntityExtractor
- 6.5 DucklingHTTPExtractor
1. 词向量
1.1 MitieNLP
MitieNLP |
说明 |
Short: |
MITIE initializer,即MitieNLP 是MITIE initializer 的简称。 |
Outputs: |
无 |
Requires: |
无 |
描述: |
初始化mitie结构。每个mitie组件都依赖于此,因此应该将其放在任何使用mitie组件的每个管道的开头。 |
Configuration: |
MITIE库需要一个语言模型文件,必须在配置中指定如下: |
pipeline:
- name: "MitieNLP"
# language model to load
model: "data/total_word_feature_extractor.dat"
更多关于MITIE的可以进一步查阅链接标题
1.2 SpacyNLP
SpacyNLP |
说明 |
Short: |
spacy language initializer |
Outputs: |
无 |
Requires: |
无 |
描述: |
初始化spacy的结构。每个spacy组件都依赖于此,因此应该将其放在使用spacy组件的每个管道的开头。 |
Configuration: |
语言模型,默认将使用配置的语言。如果要使用的spacy模型的名称不同于language标记(“en”、“de”等),则可以使用配置变量指定模型名称,将名称将传递给模型:space.load(name) 。 |
pipeline:
- name: "SpacyNLP"
# language model to load
model: "en_core_web_md"
# when retrieving word vectors, this will decide if the casing
# of the word is relevant. E.g. `hello` and `Hello` will
# retrieve the same vector, if set to `false`. For some
# applications and models it makes sense to differentiate
# between these two words, therefore setting this to `true`.
case_sensitive: false
2. 文本特征化
文本 featurizers 分为两类:稀疏 featurizers 和稠密 featurizers 。稀疏 featurizers 返回的特征向量中有很多缺失值,比如值为0。由于这些特征向量通常会占用大量内存,所以将它们存储为稀疏特征。稀疏特征只存储非零值及其在向量中的位置。因此,可以节省了大量内存,能够在更大的数据集上训练。
默认情况下所有featurizers将返回一个长矩阵(大小为1x特征维度)。所有featurizer(除了ConveRTFeaturizer
)可以选择返回一个序列。如果标志“return_sequence”
设置为True
,featurizer返回大小为token-length x特征维度
的矩阵。所以,返回矩阵中每一个token都有一个对应的entry。否则,矩阵对整个句子将只有一个entry。如果想使用自定义特性CRFEntityExtractor
,应该设置“return_sequence”
真。更多细节,看看传递自定义特性到CRFEntityExtractor。
2.1 MitieFeaturizer
MitieFeaturizer |
说明 |
Short: |
MITIE intent featurizer |
Outputs: |
无,作为意图分类器的输入(例如SklearnIntentClassifier ) |
Requires: |
MitieNLP |
Type: |
稠密featurizer |
描述: |
使用MITIE featurizer为意图分类创建特性。需要注意的是:MitieIntentClassifier 组件中并没有使用。目前,只有SklearnIntentClassifier 能够使用预先计算的特性。 |
Configuration: |
配置方法如下: |
pipeline:
- name: "MitieFeaturizer"
2.2 SpacyFeaturizer
SpacyFeaturizer |
说明 |
Short: |
spacy intent featurizer |
Outputs: |
无,作为意图分类器的输入(例如SklearnIntentClassifier ) |
Requires: |
SpacyNLP |
Type: |
稠密featurizer |
描述: |
使用spacy featurizer为意图分类创建特性 |
Configuration: |
配置方法如下: |
pipeline:
- name: "SpacyFeaturizer"
2.3 ConveRTFeaturizer
ConveRTFeaturizer |
说明 |
Short: |
使用ConveRT模型创建用户消息和响应(如果指定的话)的向量表示 |
Outputs: |
无,作为意图分类器和response selectors的输入,分别对应意图特征和响应特征。比如EmbeddingIntentClassifier 和ResponseSelector |
Requires: |
无 |
Type: |
稠密featurizer |
描述: |
为意图分类和response selection创建特性,使用默认签名来计算输入文本的向量表示。需要注意:(1)由于ConveRT 模型仅在英语语料上训练,因此只有当训练数据是英语语言时才能使用这个featurizer。 (2)使用之前需要安装tensorflow_text 和tensorflow_hub ),可以通过pip install rasa[convert] 来安装。(3)当把return_sequence 设置为True,Rasa将抛出一个错误,表示该选项目前不受支持。不要将此featurizer与任何其他featurizer的选项“return_sequence” 设置为true时进行联合使用,否则训练将失败。但是,可以将这个featurizer与其他任何featurizer一起使用,只要将“return_sequence” 设置为False即可。 |
Configuration: |
配置方法如下: |
pipeline:
- name: "ConveRTFeaturizer"
2.4 RegexFeaturizer
RegexFeaturizer |
说明 |
Short: |
创建正则特征以支持意图和实体分类 |
Outputs: |
text_features and tokens.pattern |
Requires: |
无 |
Type: |
稀疏 featurizer |
描述: |
为实体提取和意图分类创建特性。在训练期间,regex intent featurizer 以训练数据的格式创建一系列正则表达式列表。对于每个正则,都将设置一个特征,标记是否在输入中找到该表达式,然后将其输入到intent classifier / entity extractor 中以简化分类(假设分类器在训练阶段已经学习了该特征集合,该特征集合表示一定的意图)。将Regex特征用于实体提取目前仅CRFEntityExtractor 组件支持! |
注意:
在 featurizer 之前 需要先进行 tokenizer !
2.5 CountVectorsFeaturizer
CountVectorsFeaturizer |
说明 |
Short: |
创建用户信息和标签(意图和响应)的词袋表征 |
Outputs: |
无,用作意图分类器的输入,输入的意图特性以词袋表征(如EmbeddingIntentClassifier ) |
Requires: |
无 |
Type: |
稀疏 featurizer |
描述: |
为意图分类和 response selection创建特征。使用sklearn 的CountVectorizer 创建用户消息和标签特征的词袋表征。所有token仅由数字组成(如123和99,但不会存在a123d)将被分配到相同的功能。 |
Configuration: |
通过analyzer 参数能将featurizer配置为 word 或 character n-grams。默认下,analyzer 是设置为 word ,所以 word token计数作为特征。如果想要设置为character n-grams 可以将analyzer 设置为char 或 char_wb 。char_wb 仅从单词边界内的文本创建character n-grams;单词边缘的n-gram用空格填充。此选项可用于创建Subword Semantic Hashing。对于character n-grams,不要忘记增加min_ngram 和max_ngram 参数。否则,词汇表将只包含单个字母。另外,在处理OOV上,由于训练是在有限的词汇数据上进行的,因此不能保证在预测过程中算法不会遇到未知的单词(在训练过程中没有看到的单词,即OOV)。为了教算法如何处理未知的单词,训练数据中的一些单词可以用通用单词OOV_token 代替。在这种情况下,在预测期间,所有未知单词将被视为通用单词OOV_token 。 |
例如,可以在训练数据中创建单独的intent outofscope
,其中包含不同数量的OOV_token
消息,可能还包含一些附加的通用单词。然后,算法可能会将含有未知单词的消息的意图分类为outofscope
。
pipeline:
- name: "CountVectorsFeaturizer"
# whether to use a shared vocab
"use_shared_vocab": False,
# whether to use word or character n-grams
# 'char_wb' creates character n-grams only inside word boundaries
# n-grams at the edges of words are padded with space.
analyzer: 'word' # use 'char' or 'char_wb' for character
# the parameters are taken from
# sklearn's CountVectorizer
# regular expression for tokens
token_pattern: r'(?u)\b\w\w+\b'
# remove accents during the preprocessing step
strip_accents: None # {'ascii', 'unicode', None}
# list of stop words
stop_words: None # string {'english'}, list, or None (default)
# min document frequency of a word to add to vocabulary
# float - the parameter represents a proportion of documents
# integer - absolute counts
min_df: 1 # float in range [0.0, 1.0] or int
# max document frequency of a word to add to vocabulary
# float - the parameter represents a proportion of documents
# integer - absolute counts
max_df: 1.0 # float in range [0.0, 1.0] or int
# set ngram range
min_ngram: 1 # int
max_ngram: 1 # int
# limit vocabulary size
max_features: None # int or None
# if convert all characters to lowercase
lowercase: true # bool
# handling Out-Of-Vacabulary (OOV) words
# will be converted to lowercase if lowercase is true
OOV_token: None # string or None
OOV_words: [] # list of strings
注意:
如果模型语言中的单词不能用空格分隔,则在此组件之前的管道中需要一个特定语言的tokenizer (例如,对于中文使用JiebaTokenizer)。
3. 意图分类器
3.1 MitieIntentClassifier
MitieIntentClassifier |
说明 |
Short: |
MITIE intent classifier (使用text categorizer) |
Outputs: |
意图 |
Requires: |
tokenizer 和 featurizer |
Output-Example: |
{"intent": {"name": "greet", "confidence": 0.98343}} |
描述: |
该分类器使用MITIE进行意图分类。底层分类器使用的是具有稀疏线性核的多类线性支持向量机(可以查看MITIE trainer code) |
Configuration: |
具体配置如下: |
pipeline:
- name: "MitieIntentClassifier"
3.2 SklearnIntentClassifier
SklearnIntentClassifier |
说明 |
Short: |
sklearn intent classifier |
Outputs: |
意图 和 意图排名 |
Requires: |
一个featurizer |
Output-Example: |
{"intent": {"name": "greet", "confidence": 0.78343},"intent_ranking": [{"confidence": 0.1485910906220309,"name": "goodbye"},{"confidence": 0.08161531595656784,"name":"restaurant_search"}]} |
描述: |
该sklearn意图分类器训练一个线性支持向量机,该支持向量机通过网格搜索得到优化。除了其他分类器,它还提供没有“获胜”的标签的排名。spacy意图分类器需要在管道中的先加入一个featurizer。该featurizer创建用于分类的特征。 |
Configuration: |
在SVM的训练过程中,会运行超参数搜索,以找到最佳的参数集。在配置中,可以指定将要尝试的参数,具体配置如下: |
pipeline:
- name: "SklearnIntentClassifier"
# Specifies the list of regularization values to
# cross-validate over for C-SVM.
# This is used with the ``kernel`` hyperparameter in GridSearchCV.
C: [1, 2, 5, 10, 20, 100]
# Specifies the kernel to use with C-SVM.
# This is used with the ``C`` hyperparameter in GridSearchCV.
kernels: ["linear"]
3.3 EmbeddingIntentClassifier
EmbeddingIntentClassifier |
说明 |
Short: |
Embedding intent classifier |
Outputs: |
意图 和 意图排名 |
Requires: |
一个featurizer |
描述: |
嵌入式意图分类器将用户输入和意图标签嵌入到同一空间中。Supervised embeddings通过最大化它们之间的相似性来训练。该算法基于StarSpace的。但是,在这个实现中,损失函数略有不同,添加了额外的隐藏层和dropout。该算法还提供了未“获胜”标签的相似度排序。在embedding intent classifier之前,需要在管道中加入一个featurizer。该featurizer创建用以embeddings的特征。建议使用CountVectorsFeaturizer ,它可选的预处理有SpacyNLP 和SpacyTokenizer 。 |
Configuration: |
算法涉及大超参数,较多这里就不一一列出。 |
在配置中,可以指定这些参数。在embeddingintentclassifier.default
中定义了默认值:
defaults = {
# nn architecture
# sizes of hidden layers before the embedding layer for input words
# the number of hidden layers is thus equal to the length of this list
"hidden_layers_sizes_a": [256, 128],
# sizes of hidden layers before the embedding layer for intent labels
# the number of hidden layers is thus equal to the length of this list
"hidden_layers_sizes_b": [],
# Whether to share the hidden layer weights between input words and labels
"share_hidden_layers": False,
# training parameters
# initial and final batch sizes - batch size will be
# linearly increased for each epoch
"batch_size": [64, 256],
# how to create batches
"batch_strategy": "balanced", # string 'sequence' or 'balanced'
# number of epochs
"epochs": 300,
# set random seed to any int to get reproducible results
"random_seed": None,
# embedding parameters
# default dense dimension used if no dense features are present
"dense_dim": {"text": 512, "label": 20},
# dimension size of embedding vectors
"embed_dim": 20,
# the type of the similarity
"num_neg": 20,
# flag if minimize only maximum similarity over incorrect actions
"similarity_type": "auto", # string 'auto' or 'cosine' or 'inner'
# the type of the loss function
"loss_type": "softmax", # string 'softmax' or 'margin'
# how similar the algorithm should try
# to make embedding vectors for correct labels
"mu_pos": 0.8, # should be 0.0 < ... < 1.0 for 'cosine'
# maximum negative similarity for incorrect labels
"mu_neg": -0.4, # should be -1.0 < ... < 1.0 for 'cosine'
# flag: if true, only minimize the maximum similarity for incorrect labels
"use_max_sim_neg": True,
# scale loss inverse proportionally to confidence of correct prediction
"scale_loss": True,
# regularization parameters
# the scale of L2 regularization
"C2": 0.002,
# the scale of how critical the algorithm should be of minimizing the
# maximum similarity between embeddings of different labels
"C_emb": 0.8,
# dropout rate for rnn
"droprate": 0.2,
# visualization of accuracy
# how often to calculate training accuracy
"evaluate_every_num_epochs": 20, # small values may hurt performance
# how many examples to use for calculation of training accuracy
"evaluate_on_num_examples": 0, # large values may hurt performance
}
Output-Example如下:
{
"intent": {"name": "greet", "confidence": 0.8343},
"intent_ranking": [
{
"confidence": 0.385910906220309,
"name": "goodbye"
},
{
"confidence": 0.28161531595656784,
"name": "restaurant_search"
}
]
}
注意:
如果在预测期间,一条消息只包含在训练期间没有看到的单词,并且没有使用out - of -vacary
预处理器,则将以置信度0.0
预测为空意图None
。
3.4 KeywordIntentClassifier
KeywordIntentClassifier |
说明 |
Short: |
简单的关键字匹配意图分类器,适于小型、短期的项目 |
Outputs: |
意图 |
Requires: |
无 |
Output-Example: |
{"intent": {"name": "greet", "confidence": 1.0}} |
描述: |
该分类器通过搜索关键字的消息来工作。默认情况下,匹配是大小写敏感的,只精确匹配地搜索用户消息中关键字。意图的关键字是NLU训练数据中意图的例子。这意味着整个示例是关键字,而不是示例中的单个单词。注意:此分类器仅用于小型项目或入门级项目。如果你有很少的NLU训练数据,则可以试试管道选择中一个管道。 |
Configuration: |
配置如下: |
pipeline:
- name: "KeywordIntentClassifier"
case_sensitive: True
4. 选择器Selectors
Response Selector |
说明 |
Short: |
Response Selector |
Outputs: |
一个字典,关键字direct_response_intent ,value 属性包含response 和ranking |
Requires: |
A featurizer |
描述: |
Response Selector组件可用以创建回复的召回模型,从而直接得到机器人的候选回复。模型的预测通过Retrieval Actions实现,将用户输入和回复标签嵌入到同一空间,所使用的神经网络架构和优化方法与EmbeddingIntentClassifier 一样。在管道中的响应选择器 response selector 之前需要有一个featurizer。该featurizer创建用于embeddings的特征。建议使用CountVectorsFeaturizer ,它可以选择由SpacyNLP 先处理。 |
Configuration: |
包含了EmbeddingIntentClassifier 使用的所有超参数。此外,还可以将组件配置为针对特定检索意图训练一个响应选择器。ResponseSelector.defaults 中可以查看默认值: |
defaults = {
# nn architecture
# sizes of hidden layers before the embedding layer for input words
# the number of hidden layers is thus equal to the length of this list
"hidden_layers_sizes_a": [256, 128],
# sizes of hidden layers before the embedding layer for intent labels
# the number of hidden layers is thus equal to the length of this list
"hidden_layers_sizes_b": [256, 128],
# Whether to share the hidden layer weights between input words and intent labels
"share_hidden_layers": False,
# training parameters
# initial and final batch sizes - batch size will be
# linearly increased for each epoch
"batch_size": [64, 256],
# how to create batches
"batch_strategy": "balanced", # string 'sequence' or 'balanced'
# number of epochs
"epochs": 300,
# set random seed to any int to get reproducible results
"random_seed": None,
# embedding parameters
# default dense dimension used if no dense features are present
"dense_dim": {"text": 512, "label": 20},
# dimension size of embedding vectors
"embed_dim": 20,
# the type of the similarity
"num_neg": 20,
# flag if minimize only maximum similarity over incorrect actions
"similarity_type": "auto", # string 'auto' or 'cosine' or 'inner'
# the type of the loss function
"loss_type": "softmax", # string 'softmax' or 'margin'
# how similar the algorithm should try
# to make embedding vectors for correct intent labels
"mu_pos": 0.8, # should be 0.0 < ... < 1.0 for 'cosine'
# maximum negative similarity for incorrect intent labels
"mu_neg": -0.4, # should be -1.0 < ... < 1.0 for 'cosine'
# flag: if true, only minimize the maximum similarity for
# incorrect intent labels
"use_max_sim_neg": True,
# scale loss inverse proportionally to confidence of correct prediction
"scale_loss": True,
# regularization parameters
# the scale of L2 regularization
"C2": 0.002,
# the scale of how critical the algorithm should be of minimizing the
# maximum similarity between embeddings of different intent labels
"C_emb": 0.8,
# dropout rate for rnn
"droprate": 0.2,
# visualization of accuracy
# how often to calculate training accuracy
"evaluate_every_num_epochs": 20, # small values may hurt performance
# how many examples to use for calculation of training accuracy
"evaluate_on_num_examples": 0, # large values may hurt performance,
# selector config
# name of the intent for which this response selector is to be trained
"retrieval_intent": None,
}
其中retrieval_intent
:设置训练此响应选择器模型的意图的名称。默认是None
Output-Example:
{
"text": "What is the recommend python version to install?",
"entities": [],
"intent": {"confidence": 0.6485910906220309, "name": "faq"},
"intent_ranking": [
{"confidence": 0.6485910906220309, "name": "faq"},
{"confidence": 0.1416153159565678, "name": "greet"}
],
"response_selector": {
"faq": {
"response": {"confidence": 0.7356462617, "name": "Supports 3.5, 3.6 and 3.7, recommended version is 3.6"},
"ranking": [
{"confidence": 0.7356462617, "name": "Supports 3.5, 3.6 and 3.7, recommended version is 3.6"},
{"confidence": 0.2134543431, "name": "You can ask me about how to get started"}
]
}
}
}
5. 分词器Tokenizers
5.1 WhitespaceTokenizer
WhitespaceTokenizer |
说明 |
Short: |
Tokenizer using whitespaces as a separator |
Outputs: |
无 |
Requires: |
无 |
描述: |
为每个以空格分隔的字符序列创建token。定义的token可用于MITIE实体提取器。 |
Configuration: |
如果想把意图分成多个标签,例如,为了预测多个意图或为分层的意图结构建模,使用intent_split_symbol 标志。可以通过case_sensitive 设置是否大小写敏感。 |
5.2 JiebaTokenizer
JiebaTokenizer |
说明 |
Short: |
使用Jieba作为 Tokenizer |
Outputs: |
无 |
Requires: |
无 |
描述: |
用于中文的Tokenizer,对于其他语种Jieba会如WhitespaceTokenizer 般工作。JiebaTokenizer可为MITIE实体抽取器定义token。 |
Configuration: |
用户的自定义字典文件可以通过特定的文件目录路径dictionary_path 自动加载。具体示例: |
pipeline:
- name: "JiebaTokenizer"
dictionary_path: "path/to/custom/dictionary/dir"
5.3 MitieTokenizer
MitieTokenizer |
说明 |
Short: |
Tokenizer using MITIE |
Outputs: |
无 |
Requires: |
MitieNLP |
描述: |
用MITIE tokenizer创建tokens,从而服务于 MITIE 实体抽取 |
Configuration: |
示例如下: |
pipeline:
- name: "MitieTokenizer"
5.4 SpacyTokenizer
SpacyTokenizer |
说明 |
Short: |
Tokenizer using spacy |
Outputs: |
无 |
Requires: |
SpacyNLP |
描述: |
用spacy tokenizer创建tokens,从而服务于 MITIE 实体抽取 |
6. 实体抽取器Entity Extractors
6.1 MitieEntityExtractor
MitieEntityExtractor |
说明 |
Short: |
MITIE entity extraction (使用MITIE NER trainer) |
Outputs: |
entities |
Requires: |
MitieNLP |
描述: |
用 MITIE entity extraction抽取语句中的实体。底层分类器使用具有稀疏线性核和自定义特征的多类线性支持向量机。该MITIE组件不提供实体置信值。 |
Configuration: |
配置示例如下: |
pipeline:
- name: "MitieEntityExtractor"
Output-Example:
{
"entities": [{"value": "New York City",
"start": 20,
"end": 33,
"confidence": null,
"entity": "city",
"extractor": "MitieEntityExtractor"}]
}
6.2 SpacyEntityExtractor
SpacyEntityExtractor |
说明 |
Short: |
spaCy entity extraction |
Outputs: |
entities |
Requires: |
SpacyNLP |
描述: |
该组件使用spaCy来预测消息的实体。spacy使用统计BILOU转移模型。到目前为止,该组件只能使用spacy内置的实体提取模型,不能进行再训练。此提取器不提供任何置信评分。 |
Configuration: |
配置spacy组件应该提取哪些维度,比如实体类型。可用维度的完整列表可以在spaCy文档中找到。不指定维度选项将提取所有可用维度。具体示例如下: |
pipeline:
- name: "SpacyEntityExtractor"
# dimensions to extract
dimensions: ["PERSON", "LOC", "ORG", "PRODUCT"]
Output-Example:
{
"entities": [{"value": "New York City",
"start": 20,
"end": 33,
"entity": "city",
"confidence": null,
"extractor": "SpacyEntityExtractor"}]
}
6.3 EntitySynonymMapper
EntitySynonymMapper |
说明 |
Short: |
将同义词映射到同一个值 |
Outputs: |
修改以前的实体提取组件找到的现有实体 |
Requires: |
无 |
描述: |
如果训练数据包含已定义的同义词(通过对实体示例使用value 属性)。此组件将确保检测到的实体值映射到相同的值。例如,如果训练数据包含以下例子: |
[{
"text": "I moved to New York City",
"intent": "inform_relocation",
"entities": [{"value": "nyc",
"start": 11,
"end": 24,
"entity": "city",
}]
},
{
"text": "I got a new flat in NYC.",
"intent": "inform_relocation",
"entities": [{"value": "nyc",
"start": 20,
"end": 23,
"entity": "city",
}]
}]
该组件将实体New York City
和NYC
映射到nyc
。即使消息包含NYC
,实体提取将返回nyc
。当该组件更改现有实体时,它将自己附加到该实体的处理器列表中。
6.4 CRFEntityExtractor
CRFEntityExtractor |
说明 |
Short: |
条件随机场实体抽取器 |
Outputs: |
entities |
Requires: |
一个tokenizer |
描述: |
此组件使用条件随机场来进行命名实体识别。CRFs可以被认为是一个无向的马尔可夫链,其中时间步长是单词,状态是实体类别。单词的特征(大写,词性标注POS,等等)给出了特定实体类别的概率,就像相邻实体标记之间的转换一样:然后计算并返回最可能的标记结果。如果使用POS功能(pos或pos2),则必须安装spaCy。如果想使用额外的功能,如预训练的词嵌入,稠密的featurizer,则可以使用“text_dense_features” 。确保在相应的featurizer中将“return_sequence” 设置为True。 |
Configuration: |
配置示例如下: |
pipeline:
- name: "CRFEntityExtractor"
# The features are a ``[before, word, after]`` array with
# before, word, after holding keys about which
# features to use for each word, for example, ``"title"``
# in array before will have the feature
# "is the preceding word in title case?".
# Available features are:
# ``low``, ``title``, ``suffix5``, ``suffix3``, ``suffix2``,
# ``suffix1``, ``pos``, ``pos2``, ``prefix5``, ``prefix2``,
# ``bias``, ``upper``, ``digit``, ``pattern``, and ``text_dense_features``
features: [["low", "title"], ["bias", "suffix3"], ["upper", "pos", "pos2"]]
# The flag determines whether to use BILOU tagging or not. BILOU
# tagging is more rigorous however
# requires more examples per entity. Rule of thumb: use only
# if more than 100 examples per entity.
BILOU_flag: true
# This is the value given to sklearn_crfcuite.CRF tagger before training.
max_iterations: 50
# This is the value given to sklearn_crfcuite.CRF tagger before training.
# Specifies the L1 regularization coefficient.
L1_c: 0.1
# This is the value given to sklearn_crfcuite.CRF tagger before training.
# Specifies the L2 regularization coefficient.
L2_c: 0.1
Output-Example:
{
"entities": [{"value":"New York City",
"start": 20,
"end": 33,
"entity": "city",
"confidence": 0.874,
"extractor": "CRFEntityExtractor"}]
}
6.5 DucklingHTTPExtractor
DucklingHTTPExtractor |
说明 |
Short: |
借助Duckling可以提取诸如日期、金额、距离等常见实体,且适用于多种语言。 |
Outputs: |
entities |
Requires: |
无 |
描述: |
为了使用该组件需要启动一个duckling server。最简单的选择是使用docker container:docker run -p 8000:8000 rasa/duckling 。另外,也可以直接在机器上安装Duckling再启动服务。Duckling可以识别日期、数字、距离和其他结构化实体和规范。请注意,duckling 试图提取尽可能多的实体类型,但没有提供排名。例如,对于文本I will be there in 10 minutes 。如果在duckling组件内同时指定number 和time 维度,则该组件将提取两个实体:10 作为数字和10 minutes 作为时间。在这种情况下,应用程序必须决定哪些实体类型是正确的。抽取器将始终返回1.0的置信度,因为这是一个基于规则的系统。 |
Configuration: |
配置duckling组件应该提取哪些维度,即实体类型。在duckling文档中可以找到可用维度的完整列表。不指定维度选项将提取所有可用维度。具体的配置示例如下: |
pipeline:
- name: "DucklingHTTPExtractor"
# url of the running duckling server
url: "http://localhost:8000"
# dimensions to extract
dimensions: ["time", "number", "amount-of-money", "distance"]
# allows you to configure the locale, by default the language is
# used
locale: "de_DE"
# if not set the default timezone of Duckling is going to be used
# needed to calculate dates from relative expressions like "tomorrow"
timezone: "Europe/Berlin"
# Timeout for receiving response from http url of the running duckling server
# if not set the default timeout of duckling http url is set to 3 seconds.
timeout : 3
Output-Example:
{
"entities": [{"end": 53,
"entity": "time",
"start": 48,
"value": "2017-04-10T00:00:00.000+02:00",
"confidence": 1.0,
"extractor": "DucklingHTTPExtractor"}]
}