Rasa 入门教程 NLU 系列（六）

rasa_tutorial_nlu_background.png

Rasa 入门教程 NLU 系列包括六个部分，前面介绍了Rasa 入门教程 NLU 系列（五），本文主要介绍 Rasa 框架中的 NLU 系列中的第六部分：组件。这是 Rasa NLU 中每个内置组件的配置项，如果你想自定义组件，请参阅 Rasa 详解自定义 NLU 组件。

本文的目录结构：

词向量
特征器
意图分类
选择器
分词器
实体提取器

1. 词向量

1.1 MitieNLP

内容	说明
Short:	MITIE 初始化
Outputs:	没有
Requires:	不需要
Description:	初始化 mitie 结构，每个 mitie 组件都依赖于此，因此将其放在任何使用 mitie 组件 pipeline 的开头。
Configuration:	MITIE 库需要一个语言模型文件，该文件必须在配置中指定。配置如下：

pipeline:
- name: "MitieNLP"
  # language model to load
  model: "data/total_word_feature_extractor.dat"

有关该文件的更多信息，请安装 MITIE。

1.2 SpacyNLP

内容	说明
Short:	Spacy 语言初始化
Outputs:	没有
Requires:	不需要
Description:	初始化 spacy 结构，每个 spacy 组件都依赖于此，因此将其放在任何使用 spacy 组件 pipeline 的开头。
Configuration:	是个语言模型，默认使用配置的语言。如果要使用的 spacy 模型具有与语言标签（比如："en"，"de"等等）不同的名称，可使用配置变量来指定模型名称，通过 `spacy.load(name)` 传入。配置如下：

pipeline:
- name: "SpacyNLP"
  # language model to load
  model: "en_core_web_md"

  # when retrieving word vectors, this will decide if the casing
  # of the word is relevant. E.g. `hello` and `Hello` will
  # retrieve the same vector, if set to `false`. For some
  # applications and models it makes sense to differentiate
  # between these two words, therefore setting this to `true`.
  case_sensitive: false

2. 特征器

2.1 MitieFeaturizer

内容	说明
Short:	MITIE 意图特征器
Outputs:	没有，用于作为需要意图特征的意图分类器的输入（例如：`SklearnIntentClassifier`）
Requires:	MiteNLP
Description:	使用 MITIE 特征器创建特征用于意图分类。
Configuration:	配置如下：

pipeline:
- name: "MitieFeaturizer"

2.2 SpacyFeaturizer

内容	说明
Short:	Spacy 意图特征器
Outputs:	没有，用于作为需要意图特征的意图分类器的输入（例如：`SklearnIntentClassifier`）
Requires:	SpacyNLP
Description:	使用 Spacy 特征器创建特征用于意图分类。（可选）为每个 `token` 将词向量添加到 `ner_features` 中，可以在 `CRFEntityExtractor` 中引用。
Configuration:	配置如下：

pipeline:
- name: "SpacyFeaturizer"
  # Whether to add word vectors to ``ner_features`` (default: False)
  ner_feature_vectors: True

2.3 NGramFeaturizer

内容	说明
Short:	将特征附加到特征向量
Outputs:	没有，将其特征附加到另一个意图特征器生成的现有的特征向量上
Requires:	SpacyNLP
Description:	该特征器将字符 ngram 特征附加到特征向量，在训练期间，组件会寻找最常见的字符序列（例如：app、ing 等）。如果字符序列在词序列中存在，则添加的特征使用 boolean 为 True 表示，如果不存在，则 boolean 为 False 表示。
Configuration:	配置如下：

pipeline:
- name: "NGramFeaturizer"
  # Maximum number of ngrams to use when augmenting
  # feature vectors with character ngrams
  max_number_of_ngrams: 10

2.4 RegexFeaturizer

内容	说明
Short:	创建正则表达式特征，用于支持意图分类和实体识别
Outputs:	`text_features`和 `tokens.pattern`
Requires:	不需要
Description:	在训练期间，正则表达式特征器会创建一个的正则表达式列表。对于每个正则表达式，将设置一个特征，以标记是否在输入中找到了此表达式，随后将其输入到意图分类器/实体提取器中以简化分类（假设分类器在训练阶段已获悉，此设置的特征表示某种意图）。当前仅支持用于实体提取的正则表达式特征器是 `CRFEntityExtractor` 组件。在 pipeline 中使用该特征器之前需要一个分词器。

2.5 CountVectorsFeaturizer

内容	说明
Short:	创建用户消息和标签（意图和响应）的词袋表示
Outputs:	没有，用作意图分类器的输入，这些分类器需要词袋表示意图特征（例如：`EmbeddingIntentClassifier`）
Requires:	不需要
Description:	使用 sklear's CountVectorizer 创建用户消息和标签特征的词袋表示。所有令牌仅由数字组成的（例如：123和99，但不包括a123d）将分配给同一特征。
Configuration:	有关配置参数的详细说明，请参见 sklearn’s CountVectorizer docs 。由于训练是在有限的词汇数据上进行的，因此，在预测时，算法可能会遇到未知单词（训练中未看到的单词）。为了处理未知单词的问题，可以将训练数据中的某些单词替换为通用单词`OOV_token`（Out-Of-Vacabulary）。在这种情况下，在预测期间，所有未知单词都将被视为该通用单词`OOV_token`。例如：创建一个意图为：`outofscope`，其中包含不同数量的 `OOV_token` 的消息，也可能包含其他一些通用词。然后，算法可能会将带有未知单词的消息分类为 `outofscope` 意图。当 `use_shared_vocab` 为 True 时，用户消息和标签是共享词汇集。配置如下：

pipeline:
- name: "CountVectorsFeaturizer"
  # whether to use a shared vocab
  "use_shared_vocab": False,
  # whether to use word or character n-grams
  # 'char_wb' creates character n-grams only inside word boundaries
  # n-grams at the edges of words are padded with space.
  analyzer: 'word'  # use 'char' or 'char_wb' for character
  # the parameters are taken from
  # sklearn's CountVectorizer
  # regular expression for tokens
  token_pattern: r'(?u)\b\w\w+\b'
  # remove accents during the preprocessing step
  strip_accents: None  # {'ascii', 'unicode', None}
  # list of stop words
  stop_words: None  # string {'english'}, list, or None (default)
  # min document frequency of a word to add to vocabulary
  # float - the parameter represents a proportion of documents
  # integer - absolute counts
  min_df: 1  # float in range [0.0, 1.0] or int
  # max document frequency of a word to add to vocabulary
  # float - the parameter represents a proportion of documents
  # integer - absolute counts
  max_df: 1.0  # float in range [0.0, 1.0] or int
  # set ngram range
  min_ngram: 1  # int
  max_ngram: 1  # int
  # limit vocabulary size
  max_features: None  # int or None
  # if convert all characters to lowercase
  lowercase: true  # bool
  # handling Out-Of-Vacabulary (OOV) words
  # will be converted to lowercase if lowercase is true
  OOV_token: None  # string or None
  OOV_words: []  # list of strings

3. 意图分类器

3.1 KeywordIntentClassifier

内容	说明
Short:	简单的关键字匹配意图分类器，不打算使用。
Outputs:	`intent`
Requires:	不需要
Description:	此分类器主要用作占位符，通过在传递的消息中搜索这些关键字，以便能够识别出您好和再见等这样的意图。
Output-Example:	输出示例：

{
    "intent": {"name": "greet", "confidence": 0.98343}
}

3.2 MitieIntentClassifier

内容	说明
Short:	MITIE 意图分类器（使用文本分类器）
Outputs:	`intent`
Requires:	分词器和特征器
Description:	该分类器使用 MITIE 进行意图分类，
Configuration:	配置如下：
Output-Example:	示例如下：

# Configuration
pipeline:
- name: "MitieIntentClassifier"
    
# Output Example
{
    "intent": {"name": "greet", "confidence": 0.98343}
}

3.3 SklearnIntentClassifier

内容	说明
Short:	sklearn 意图分类器
Outputs:	`intent` 和 `intent_ranking`
Requires:	特征器
Description:	sklearn 意图分类器训练了一个线性 SVM，该 SVM 使用网格搜索进行了优化。除其他分类器外，它还提供未“获胜”的标签的排名。spacy 意图分类器需要在 pipeline 中添加一个特征器。该特征器创建用于分类的特征。
Configuration:	在 SVM 训练期间，将运行超参数搜索以找到最佳参数集。在配置中，你可以指定要尝试使用的参数。配置如下：
Output-Example:	示例如下：

# Configuration
pipeline:
- name: "SklearnIntentClassifier"
  # Specifies the list of regularization values to
  # cross-validate over for C-SVM.
  # This is used with the ``kernel`` hyperparameter in GridSearchCV.
  C: [1, 2, 5, 10, 20, 100]
  # Specifies the kernel to use with C-SVM.
  # This is used with the ``C`` hyperparameter in GridSearchCV.
  kernels: ["linear"]

# Output Example
{
    "intent": {"name": "greet", "confidence": 0.78343},
    "intent_ranking": [
        {
            "confidence": 0.1485910906220309,
            "name": "goodbye"
        },
        {
            "confidence": 0.08161531595656784,
            "name": "restaurant_search"
        }
    ]
}

3.4 EmbeddingIntentClassifier

内容	说明
Short:	嵌入意图分类器
Outputs:	`intent` 和 `intent_ranking`
Requires:	特征器
Description:	嵌入意图分类器将用户输入和意图标签嵌入到同一空间中。通过最大化嵌入之间的相似性来训练有监督的嵌入。该算法基于 StarSpace。但是，在此实现中，损失函数略有不同，并且附加的隐藏层与 dropout 一起添加。该算法还提供了未“获胜”的标签的相似性等级。嵌入意图分类器需要在 pipeline 添加一个特征器，该特征器用于创建嵌入的特征。建议使用 `CountVectorsFeaturizer`。
Configuration:	你可以在配置中，设置以下这些超参数，默认值定义在 `EmbeddingIntentClassifier.defaults` 中，以下为可调的超参数：神经网络的架构： 1. `hidden_layers_sizes_a`：在用户输入的嵌入层之前设置隐藏层大小的列表，隐藏层的数量等于列表的长度。 2. `hidden_layers_sizes_b`：在意图标签的嵌入层之前设置隐藏层大小的列表，隐藏层的数量等于列表的长度。 3. `share_hidden`：如果设置为 True，则在用户输入和意图标签之间共享隐藏层。训练： 1. `batch_size`：设置向前/向后传播中训练示例的数量，批次大小越大，需要的存储空间越大。 2. `batch_strategy`：设置批处理策略的类型，应为 `sequence` 或 `balanced`。 3. `epochs`：设置训练数据的迭代次数，其中 `epoch` 表示训练示例中的一次向前和向后传播的过程。 4. `random_seed`：如果设置为任意的一个 int 值，则对于相同的输入将获得相同的训练结果。嵌入： 1. `embed_dim`：设置嵌入空间的尺寸。 2. `num_neg`：设置不正确的意图标签的数量，算法将在训练过程中，最小化与用户输入的相似性。 3. `similarity_type`：设置相似性的类型，类型有 `auto`、`cosine`、`inner`。如果是 `auto`，它的相似性将取决于 `loss_type` 的类型；如果是 `cosine`，它的类型就是 `margin`；如果是 `inner`，它的类型就是 `softmax`。 4. `loss_type`：设置损失函数的类型，类型为 `softmax` 或 `margin`。 5. `mu_pos`：控制算法为正确的意图标签生成嵌入向量，仅当 `loss_type` 被设置为 `margin` 时使用。 6. `mu_neg`：为不正确的意图标签控制最大负相关性，仅当 `loss_type` 被设置为 `margin` 时使用。 7. `use_max_sim_neg`：如果为 True，将最小化不正确意图标签最大相似度，仅当 `loss_type` 被设置为 `margin` 时使用。 8. `scale_loss`：如果为 True，将降低损失的比例，例如在高置信度下预测正确标签的示例，仅当 `loss_type` 被设置为 `softmax` 时使用。正则化： 1. `c2`：设置 L2 正则化的 scale。 2. `C_emb`：设置最小化不同意图标签的嵌入之间最大相似度的重要性的 scale。 3. `droprate`：设置丢弃率，应在 0和 1 之间，比如 `droprate=0.1` 表示丢掉 10% 的输入单元。配置如下：
Output-Example:	示例如下：

# Configuration
defaults = {
    # nn architecture
    # sizes of hidden layers before the embedding layer for input words
    # the number of hidden layers is thus equal to the length of this list
    "hidden_layers_sizes_a": [256, 128],
    # sizes of hidden layers before the embedding layer for intent labels
    # the number of hidden layers is thus equal to the length of this list
    "hidden_layers_sizes_b": [],
    # Whether to share the hidden layer weights between input words and labels
    "share_hidden_layers": False,
    # training parameters
    # initial and final batch sizes - batch size will be
    # linearly increased for each epoch
    "batch_size": [64, 256],
    # how to create batches
    "batch_strategy": "balanced",  # string 'sequence' or 'balanced'
    # number of epochs
    "epochs": 300,
    # set random seed to any int to get reproducible results
    "random_seed": None,
    # embedding parameters
    # dimension size of embedding vectors
    "embed_dim": 20,
    # the type of the similarity
    "num_neg": 20,
    # flag if minimize only maximum similarity over incorrect actions
    "similarity_type": "auto",  # string 'auto' or 'cosine' or 'inner'
    # the type of the loss function
    "loss_type": "softmax",  # string 'softmax' or 'margin'
    # how similar the algorithm should try
    # to make embedding vectors for correct labels
    "mu_pos": 0.8,  # should be 0.0 < ... < 1.0 for 'cosine'
    # maximum negative similarity for incorrect labels
    "mu_neg": -0.4,  # should be -1.0 < ... < 1.0 for 'cosine'
    # flag: if true, only minimize the maximum similarity for incorrect labels
    "use_max_sim_neg": True,
    # scale loss inverse proportionally to confidence of correct prediction
    "scale_loss": True,
    # regularization parameters
    # the scale of L2 regularization
    "C2": 0.002,
    # the scale of how critical the algorithm should be of minimizing the
    # maximum similarity between embeddings of different labels
    "C_emb": 0.8,
    # dropout rate for rnn
    "droprate": 0.2,
    # visualization of accuracy
    # how often to calculate training accuracy
    "evaluate_every_num_epochs": 20,  # small values may hurt performance
    # how many examples to use for calculation of training accuracy
    "evaluate_on_num_examples": 0,  # large values may hurt performance
}

# Output Example
{
    "intent": {"name": "greet", "confidence": 0.8343},
    "intent_ranking": [
        {
            "confidence": 0.385910906220309,
            "name": "goodbye"
        },
        {
            "confidence": 0.28161531595656784,
            "name": "restaurant_search"
        }
    ]
}

4. 选择器

4.1 Response Selector

内容	说明
Short:	响应选择器
Outputs:	一个 `direct_response_intent` 字典，包括 `response` 和 `ranking`
Requires:	特征器
Description:	响应选择器组件可用于构建响应检索模型，根据一组候选响应直接预测机器人响应。该模型的预测由检索动作使用。它将用户输入和响应标签嵌入到相同的空间，并遵循与 `EmbeddingIntentClassifier` 完全相同的神经网络架构和优化。响应选择器需要在 pipeline 中添加一个特征器。该特征器创建用于嵌入的特征。建议使用`CountVectorsFeaturizer`。
Configuration:	响应选择器需要在管道中添加特征符。该特征化器创建用于嵌入的特征。建议使用`CountVectorsFeaturizer`，可以选择在其前面加上`SpacyNLP`。该算法包括所有 `EmbeddingIntentClassifier` 使用的超参数。此外，该组件还可以配置为针对特定的检索意图训练响应选择器。 `retrieval_intent`：设置为此响应选择器模型训练的意图的名称，默认是 `None`。在配置中，你可以定义这些参数，默认在 `ResponseSelector.defaults` 中定义。配置如下：
Output-Example:	示例如下：

# Configuration
defaults = {
    # nn architecture
    # sizes of hidden layers before the embedding layer for input words
    # the number of hidden layers is thus equal to the length of this list
    "hidden_layers_sizes_a": [256, 128],
    # sizes of hidden layers before the embedding layer for intent labels
    # the number of hidden layers is thus equal to the length of this list
    "hidden_layers_sizes_b": [256, 128],
    # Whether to share the hidden layer weights between input words and intent labels
    "share_hidden_layers": False,
    # training parameters
    # initial and final batch sizes - batch size will be
    # linearly increased for each epoch
    "batch_size": [64, 256],
    # how to create batches
    "batch_strategy": "balanced",  # string 'sequence' or 'balanced'
    # number of epochs
    "epochs": 300,
    # set random seed to any int to get reproducible results
    "random_seed": None,
    # embedding parameters
    # dimension size of embedding vectors
    "embed_dim": 20,
    # the type of the similarity
    "num_neg": 20,
    # flag if minimize only maximum similarity over incorrect actions
    "similarity_type": "auto",  # string 'auto' or 'cosine' or 'inner'
    # the type of the loss function
    "loss_type": "softmax",  # string 'softmax' or 'margin'
    # how similar the algorithm should try
    # to make embedding vectors for correct intent labels
    "mu_pos": 0.8,  # should be 0.0 < ... < 1.0 for 'cosine'
    # maximum negative similarity for incorrect intent labels
    "mu_neg": -0.4,  # should be -1.0 < ... < 1.0 for 'cosine'
    # flag: if true, only minimize the maximum similarity for
    # incorrect intent labels
    "use_max_sim_neg": True,
    # scale loss inverse proportionally to confidence of correct prediction
    "scale_loss": True,
    # regularization parameters
    # the scale of L2 regularization
    "C2": 0.002,
    # the scale of how critical the algorithm should be of minimizing the
    # maximum similarity between embeddings of different intent labels
    "C_emb": 0.8,
    # dropout rate for rnn
    "droprate": 0.2,
    # visualization of accuracy
    # how often to calculate training accuracy
    "evaluate_every_num_epochs": 20,  # small values may hurt performance
    # how many examples to use for calculation of training accuracy
    "evaluate_on_num_examples": 0,  # large values may hurt performance,
    # selector config
    # name of the intent for which this response selector is to be trained
    "retrieval_intent": None,
}

# Output Example
{
    "text": "What is the recommend python version to install?",
    "entities": [],
    "intent": {"confidence": 0.6485910906220309, "name": "faq"},
    "intent_ranking": [
        {"confidence": 0.6485910906220309, "name": "faq"},
        {"confidence": 0.1416153159565678, "name": "greet"}
    ],
    "response_selector": {
      "faq": {
        "response": {"confidence": 0.7356462617, "name": "Supports 3.5, 3.6 and 3.7, recommended version is 3.6"},
        "ranking": [
            {"confidence": 0.7356462617, "name": "Supports 3.5, 3.6 and 3.7, recommended version is 3.6"},
            {"confidence": 0.2134543431, "name": "You can ask me about how to get started"}
        ]
      }
    }
}

5. 分词器

5.1 WhitespaceTokenizer

内容	说明
Short:	使用空格作为分隔符的分词器
Outputs:	没有
Requires:	不需要
Description:	为每个空格分隔的字符序列创建一个分词，可用于 MITIE 实体提取器中定义分词。
Configuration:	如果你想将意图拆分为多个标签，例如：用于预测多个意图或者建模分层意图结构，请使用 `intent_split_symbol`：设置分隔符字符串，以拆分意图标签和响应标签，默认为空格。通过添加 `case_sensitive: false` 选项，让分词器不区分大小，默认区分大小，为 True。配置如下：

pipeline:
- name: "WhitespaceTokenizer"
  case_sensitive: false

5.2 JiebaTokenizer

内容	说明
Short:	用于中文的 Jieba 分词器
Outputs:	没有
Requires:	不需要
Description:	使用专用于中文的 Jieba 分词器。除中文以外的语言，Jieba 将作为 `WhitespaceTokenizer` 处理。可用于为 MITIE 实体提取器定义分词。通过 `pip install jieba` 命令安装 Jieba。
Configuration:	用户通过自定义词典文件，将目录路径配置到 `dictionary_path` 中，将自动加载该内容。配置如下：

pipeline:
- name: "JiebaTokenizer"
  dictionary_path: "path/to/custom/dictionary/dir"

5.3 MitieTokenizer

内容	说明
Short:	使用 MITIE 的分词器
Outputs:	没有
Requires:	MitieNLP
Description:	使用 MITIE 分词器创建分词，可用于为 MITIE 实体提取器定义分词。
Configuration:	配置如下：

pipeline:
- name: "MitieTokenizer"

5.4 SpacyTokenizer

内容	说明
Short:	使用 spacy 的分词器
Outputs:	没有
Requires:	SpacyNLP
Description:	使用 spacy 分词器创建分词，可用于为 MITIE 实体提取器定义分词。

6. 实体提取器

6.1 MitieEntityExtractor

内容	说明
Short:	MITIE 实体提取器（使用 MITIE NER 训练）
Outputs:	追加 `entities`
Requires:	MitieNLP
Description:	使用 MITIE 实体提取器来查找消息中的实体。底层分类器使用具有稀疏线性内核和自定义特征的多类线性 SVM，该提取器不提供实体置信度值。
Configuration:	配置如下：
Output-Example:	示例如下：

# Configuration
pipeline:
- name: "MitieEntityExtractor"

# Output Example
{
    "entities": [{"value": "New York City",
                  "start": 20,
                  "end": 33,
                  "confidence": null,
                  "entity": "city",
                  "extractor": "MitieEntityExtractor"}]
}

6.2 SpacyEntityExtractor

内容	说明
Short:	Spacy 实体提取器
Outputs:	追加 `entities`
Requires:	SpacyNLP
Description:	使用 spacy，此组件可以提取消息的实体，spacy使用基于统计的 BILOU 转换模型。截至目前，该组件只能使用 spacy 内置实体提取模型，不能进行重新训练，该提取器不提供实体置信度值。
Configuration:	配置 spacy 组件应提取哪些维度（即实体类型）。可用维度的完整列表可在 spaCy文档中找到，不指定维度选项将提取所有可用维度。配置如下：
Output-Example:	示例如下：

# Configuration
pipeline:
- name: "SpacyEntityExtractor"
  # dimensions to extract
  dimensions: ["PERSON", "LOC", "ORG", "PRODUCT"]

# Output Example
{
    "entities": [{"value": "New York City",
                  "start": 20,
                  "end": 33,
                  "entity": "city",
                  "confidence": null,
                  "extractor": "SpacyEntityExtractor"}]
}

6.3 EntitySynonymMapper

内容	说明
Short:	实体同义词映射
Outputs:	修改以前实体提取器找到的现有实体，
Requires:	不需要
Description:	训练数据是否包含已定义的同义词（通过使用`value`实体示例上的属性）。该组件将确保将检测到的实体值映射到相同的值。例如，如果您的训练数据包含以下示例：此组件将允许你将 `New York City` 和 `NYC` 实体映射到 `nyc` 上。即使消息包含 `NYC`，实体提取器也将返回 `nyc`。当组件更改现有实体时，它会将自身添加到该实体的处理器列表中。
Output-Example:	示例如下：

[{
  "text": "I moved to New York City",
  "intent": "inform_relocation",
  "entities": [{"value": "nyc",
                "start": 11,
                "end": 24,
                "entity": "city",
               }]
},
{
  "text": "I got a new flat in NYC.",
  "intent": "inform_relocation",
  "entities": [{"value": "nyc",
                "start": 20,
                "end": 23,
                "entity": "city",
               }]
}]

6.4 CRFEntityExtractor

内容	说明
Short:	条件随机场实体提取器
Outputs:	追加 `entities`
Requires:	分词器
Description:	该组件通过条件随机场进行命名实体识别。可以将 CRF 视为无向马可尔夫链，其中时间步长是单词，状态是实体类。单词的特征（大写，POS 标记等）赋予某些实体类几率，相邻实体标签之间的转换也是如此：然后计算并返回最可能的一组标签。如果使用 POS 特征（pos 或 pos2），则必须安装spaCy。要使用特征器提供的自定义特征，请使用`"ner_features"`。
Configuration:	配置如下：
Output-Example:	示例如下：

# Configuration
pipeline:
- name: "CRFEntityExtractor"
  # The features are a ``[before, word, after]`` array with
  # before, word, after holding keys about which
  # features to use for each word, for example, ``"title"``
  # in array before will have the feature
  # "is the preceding word in title case?".
  # Available features are:
  # ``low``, ``title``, ``suffix5``, ``suffix3``, ``suffix2``,
  # ``suffix1``, ``pos``, ``pos2``, ``prefix5``, ``prefix2``,
  # ``bias``, ``upper``, ``digit``, ``pattern``, and ``ner_features``
  features: [["low", "title"], ["bias", "suffix3"], ["upper", "pos", "pos2"]]

  # The flag determines whether to use BILOU tagging or not. BILOU
  # tagging is more rigorous however
  # requires more examples per entity. Rule of thumb: use only
  # if more than 100 examples per entity.
  BILOU_flag: true

  # This is the value given to sklearn_crfcuite.CRF tagger before training.
  max_iterations: 50

  # This is the value given to sklearn_crfcuite.CRF tagger before training.
  # Specifies the L1 regularization coefficient.
  L1_c: 0.1

  # This is the value given to sklearn_crfcuite.CRF tagger before training.
  # Specifies the L2 regularization coefficient.
  L2_c: 0.1

# Output Example
{
    "entities": [{"value":"New York City",
                  "start": 20,
                  "end": 33,
                  "entity": "city",
                  "confidence": 0.874,
                  "extractor": "CRFEntityExtractor"}]
}

6.5 DucklingHTTPExtractor

内容	说明
Short:	Duckling 实体提取器常用用于提取日期，金额，距离以及多种语言的其他实体。
Outputs:	追加 `entities`
Requires:	不需要
Description:	要使用此组件，您需要运行一个 duckling 服务器。最简单方式是在 Docker 容器使用 `docker run -p 8000:8000 rasa/duckling` 来启动它。或者直接安装 duckling 然后启动服务器。 Duckling 可以识别日期、数字、距离和其他结构化实体并将其标准化。Duckling 会尝试在不提供排名的情况下提取尽可能多的实体类型。例如：如果你同时指定`number`和`time`作为 duckling 组件的维度，该组件将从 `I will be there in 10 minutes` 文本中提取两个实体：`10` 作为数字，`in 10 minutes` 作为时间。在这种情况下，你的应用程序将必须确定哪种实体类型是正确的。该提取器的置信度值始终为 1.0，因为它是基于规则的系统。
Configuration:	配置 Duckling 组件应提取哪些维度（即实体类型），duckling 文档中提供了可用维度的完整列表，不指定维度选项将提取所有可用维度。
Output-Example:	示例如下：

# Configuration
pipeline:
- name: "DucklingHTTPExtractor"
  # url of the running duckling server
  url: "http://localhost:8000"
  # dimensions to extract
  dimensions: ["time", "number", "amount-of-money", "distance"]
  # allows you to configure the locale, by default the language is
  # used
  locale: "de_DE"
  # if not set the default timezone of Duckling is going to be used
  # needed to calculate dates from relative expressions like "tomorrow"
  timezone: "Europe/Berlin"
  # Timeout for receiving response from http url of the running duckling server
  # if not set the default timeout of duckling http url is set to 3 seconds.
  timeout : 3

# Output Example
{
    "entities": [{"end": 53,
                  "entity": "time",
                  "start": 48,
                  "value": "2019-11-24T00:00:00.000+02:00",
                  "confidence": 1.0,
                  "extractor": "DucklingHTTPExtractor"}]
}

作者：关于我

备注：转载请注明出处。

如发现错误，欢迎留言指正。