999 - Elasticsearch Analysis 02 - Analyzer

Standard Analyzer

  • 默认的analyzer,适合大多数语言。
  • 根据Unicode Text Segmentation算法的定义,将文本切分成词元。
  • 删除多数标点符号、将词元小写,支持删除停止词。

standard analyzer由以下构成:

  • Standard Tokenizer
    • Standard Tokenizer
  • Token Filters
    • Standard Token Filter
    • Lower Case Token Filter
    • Stop Token Filter(默认禁用)

Standard Analyzer 示例

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

Standard Analyzer 配置

参数 说明
max_token_length 提取单词时,允许的单词长度。默认255。
stopwords 可以使用预定义停止词列表(例如_english_),或一个停止词数组。默认_none_
stopwords_path 包含停止符的文件的路径。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_english_analyzer", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_array_analyzer": {
          "type": "standard",
          "stopwords": ["the","2","quick","brown","foxes","jumped","over","dog's","bone"]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_array_analyzer", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[lazy]

Simple Analyzer

  • simple analyzer遇到非字母就会切分,并且会小写。

simple analyzer由以下构成:

  • Tokenizer
    • Lower Case Tokenizer

Simple Analyzer 示例

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

Whitespace Analyzer

  • 遇到空格符就切分

whitespace analyzer由以下构成:

  • Tokenizer
    • Whitespace Tokenizer

Whitespace Analyzer 示例

POST _analyze
{
  "analyzer": "whitespace"
  , "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

Stop Analyzer

  • 类似simple analyzer,但是支持删除停止词。默认使用_english_停止词。

stop analyzer由以下构成:

  • Tokenizer
    • Lower Case Tokenizer
  • Token filters
    • Stop Token Filter

Stop Analyzer 示例

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

Stop Analyzer 配置

参数 说明
stopwords 可以使用预定义停止词列表(例如_english_),或一个停止词数组。默认_english_
stopwords_path 包含停止符的文件的路径,路径相对于Elasticsearch的config目录。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer":{
          "type": "stop",
          "stopwords":  ["the","2","quick","brown","foxes","jumped","over","dog","s","bone"]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_stop_analyzer", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ lazy ]

Keyword Analyzer

  • 不切分作为一整个词元输出。

keyword analyzer由以下构成:

  • Tokenizer
    • Keyword Tokenizer

Keyword Analyzer 示例

POST _analyze
{
  "analyzer": "keyword", 
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]

Pattern Analyzer

  • 按照正则表示式去切分,默认为\W+

pattern analyzer由以下构成:

  • Tokenizer
    • Pattern Tokenizer
  • Token Filters
    • Lower Case Token Filter
    • Stop Token Filter (默认禁用)

Pattern Analyzer 示例

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

产生[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

Pattern Analyzer 配置

参数 说明
pattern 使用Java正则表达式。默认\W+
flags Java正则表达式flags,多个用|分离,例如"CASE_INSENSITIVE | COMMENTS"。
lowercase 是否小写。默认true
stopwords 可以使用预定义停止词列表(例如_english_),或一个停止词数组。默认_none_
stopwords_path 包含停止符的文件的路径。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_pattern_analyzer": {
          "type": "pattern",
          "pattern": "\\W|_",
          "lowercase": true
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_pattern_analyzer", 
  "text": "[email protected]"
}

产生[ john, smith, foo, bar, com ]

Fingerprint Analyzer

  • 小写,规范化删掉扩展符,排序,去重。
  • 也可以配置停止符。
    fingerprint tokenizer 由以下构成:
  • Tokenizer
    • Standard Tokenizer
  • Token Filters(依次如下)
    • Lower Case Token Filter
    • ASCII Folding Token Filter
    • Stop Token Filter (默认禁用)
    • Fingerprint Token Filter

Fingerprint Analyzer 示例

POST _analyze
{
  "analyzer": "fingerprint",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

产生[ and consistent godel is said sentence this yes ]

Fingerprint Analyzer 配置

参数 说明
separator 连接条件。默认是空格。
max_output_size 词元最大长度,超过会被丢弃(不是超过部分被丢弃,而且超过这个长度整条被丢弃)。默认255。
stopwords 可以使用预定义停止词列表(例如_english_),或一个停止词数组。默认_none_
stopwords_path 包含停止符的文件的路径。

示例

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer":{
          "type": "fingerprint",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

产生[ consistent godel said sentence yes ]

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer":{
          "type": "fingerprint",
          "stopwords": "_english_",
          "separator": "-"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

产生[ consistent-godel-said-sentence-yes ]

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer":{
          "type": "fingerprint",
          "max_output_size": 30
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

什么都不产生,整条被丢弃,

补充说明

  • Whitespace会遇到空格就拆分,而Standard则是提取出单词,例如:对于Brown-Foxes,Whitespace切分之后还是这样,而Standard切分后则是brownfoxes
  • Simple遇到非字母就切分,而Standard未必,例如:对于dog's,Simple会切分成dogs,而Standard切分后则是dog's
  • 总之,Whitespace遇到空格就切分,Simple遇到非字母就切分,Standard切分单词(可以是所有格形式)。

自定义Analyzer

  • 0或更多的character filter
  • 一个tokenizer
  • 0或更多的token filter

自定义Analyzer的配置

参数 说明
tokenizer 内置或自定义的tokenizer
char_filter 内置或自定义的character filter,可选
filter 内置或自定义的token filter,可选
position_increment_gap 当一个字段值为数组且有多个值时,为了防止跨值匹配,修改值的position。默认100。例如[ "John Abraham", "Lincoln Smith"]为拆分之后position为1,2, 103,104,这样就防止了跨值匹配。更具体的看Mapping文章的position_increment_gap部分。

示例1:

  • Character Filter
    • HTML Strip Character Filter
  • Tokenizer
    • Standard Tokenizer
  • Token Filters
    • Lowercase Token Filter
    • ASCII-Folding Token Filter
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter":[
            "html_strip"
            ],
          "filter": [
            "lowercase",
            "asciifolding"
            ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this déjà vu?"
}

产生[ is, this, deja, vu ]

示例2

  • Character Filter
    • Mapping Character Filter:替换:)为_happy_以及 :( 为_sad_
  • Tokenizer
    • Pattern Tokenizer
  • Token Filters
    • Lowercase Token Filter
    • Stop Token Filter
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "char_filter": [
              "emoticons"
            ],
          "tokenizer": "punctuation",
          "filter": [
            "lowercase",
            "english_stop"
            ]
        }
      },
      "tokenizer": {
        "punctuation": {
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": {
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
            ]
        }
      },
      "filter": {
        "english_stop":{
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text":     "I'm a :) person, and you?"
}

产生[ i'm, _happy_, person, you ]

你可能感兴趣的:(999 - Elasticsearch Analysis 02 - Analyzer)