（三）ES基本概念入门2

一、Analysis与Analyzer

Analysis 文本分析是把全文本转换成一系列单词（term/token）的过程，也叫分词；
Analysis 是通过Analyzer来实现的，可已使用es内置的分词器，或者按需定制分词器；
除了在数据写入的时候转换词条，匹配query语句的时候也需要相同的分词器对查询语句进行分析。

Analyzer组成

见下图

Analyzer

ES内置分词器列表

分词列表

Standard Analyzer
- 默认分词器
- 按词切分
- 小写处理
  
  标准分词
Simple Analyzer
- 按照非字母切分，非字母的都给去除
- 小写处理
Whitespace Analyzer
- 按照空格的方式进行切分
Stop Analyzer

Stop分词
Keywords Analyzer
- 不分词，直接将输入当做结果输出
中文分词（几个比较优秀的分词插件）
- ICU Analyzer
- IK 地址
  - 支持自定义词组，支持热更新分词词组
- THULAC 地址
  - 清华大学自然语言处理

二、Search API

查询需要指定查询的索引，例如下图的示例

search api

URI Search
- 在URL中使用查询参数
- 用法：
  - 使用“q” ,指定查询字符串

GET /movies/_search?q=title:GoldenEye     //查询索引为movies，title为GoldenEye的文档

Request Body Search
- 使用elasticsearch提供的，基于JSON格式的Query Domain Specific Language(DSL)
- 用法(功能跟上例一致)

GET /movies/_search
{
 "query":{
   "match": {
     "title": "GoldenEye"
   }
 }
}

常用的几类搜索

脚本字段

GET /movies/_search
{
  "profile": "true",        //显示 profile
  "size": 20,     
  "from": 0,                //实现分页功能
  "script_fields": {        //指定了"new field"字段，基于“painless”，实现了year+hello的作用，例如
    "new field": {
      "script": {
        "lang": "painless",
        "source": "doc['year'].value+'hello'"
      }
    }
  }
}

输出的其中一个文档

{
        "_index" : "movies",
        "_type" : "_doc",
        "_id" : "791",
        "_score" : 1.0,
        "fields" : {
          "new field" : [
            "1994hello"
          ]
        }
      }

match phrase 短语搜索
直接查询一个短语，不做分词处理，“slop”表示偏移的量，即中间允许插入的单词数。

GET /movies/_search
{
  "profile":"true",
  "query":{
    "match_phrase": {
      "title": {
        "query": "His Life Music",                //如果“His Life Music”短语没有匹配到，
        "slop": 1                                 //“His Life is Music”也是可以被匹配到的
      }
    }
  }
}

match 搜索

GET /movies/_search
{
  "profile": "true",
  "from":0,
  "size": 20, 
  "sort":[{"year":"desc"}],
  "query":{
    "match": {
      "title":{
        "query":"last christmas",              //“last ”和“christmas”被分成两个单词
        "operator": "or"                       //或的关系
      }   
    }
  }
}

还可以通过下面这种方式进行搜索，如果想指定单词之间的关系，

GET /movies/_search
{
  "profile": "true",
  "from":0,
  "size": 20, 
  "sort":[{"year":"desc"}],
  "query":{
    "match": {
      "title":"last christmas"                //两个是或的关系
    }
  }
}

三、Mapping

Mapping类似数据库中schema的定义，作用如下
- 定义索引中字段的名称
- 定义字段的数据类型
  
  数据类型
- 字段，倒排索引的相关配置
Mapping会把JSON文档映射成lucene所需要的扁平格式
一个Mapping属于一个索引的Type
- 每个文档都属于一个Type
- 一个Type有一个Mapping定义
- 7.0 开始，不需要在Mapping定义中指定Type信息

四、Dynamic Mapping

当文档写入的时候，如果索引不存在，会自动创建索引，dynamic mapping机制使得我们无需手动定义Mappings，Elasticsearch会根据文档信息推算出字段的类型。有时候可能会有错误，例如地理位置信息。

类型自动识别

Dynamic Mapping对新增字段的处理有一下三种方式

true 可以添加新的field，也可以被检索到，因为一旦有新增字段的文档写入，Mapping也同时被更新；
false 该字段不可以被搜索，因为dynamic已经被设置为false，Mapping不会被更新，所以新增字段的数据无法被索引，但是信息会出现在_source中；
strict 添加新的field会报错，HTTP Code 400

对于已有字段，一旦已有数据写入，就不再支持修改字段定义，lucene实现的倒排索引，一旦生成也就不允许修改。除非reindex

五、Mapping设置

直接上程序

PUT users
{
    "mappings" : {
      "properties" : {
        "firstName" : {
          "type" : "text"
        },
        "lastName" : {
          "type" : "text"
        },
        "mobile" : {
          "type" : "keyword",           //因为null为keyword,所以type为keyword
          "null_value": "NULL"          //如果希望null值能被检索到，就需要这样配置
        }
      }
    }
}

PUT users/_doc/1                        //写数据
{
  "firstName":"Ruan",
  "lastName": "Yiming",
  "mobile": "null"
}

GET  users/_search                     // 检索
{
  "query": {
    "match": {
      "mobile": "null"
    }
  }
}

设置copy_to

PUT users
{
  "mappings": {
    "properties": {
      "firstName":{
        "type": "text",
        "copy_to": "fullName"         //将数据copy到fullname字段，fullname字段在Mapping中有字段，在文档中是不存在的
      },
      "lastName":{
        "type": "text",
        "copy_to": "fullName"
      }
    }
  }
}

PUT users/_doc/1                                 
{
  "firstName":"Ruan",
  "lastName": "Yiming"
}

POST users/_search
{
  "query": {
    "match": {
       "fullName":{
        "query": "Ruan Yiming",
        "operator": "and"
      }
    }
  }
}

Mapping中自定义Analyzer

Character Filters
在Tokenizer之前对文本进行处理，增加/删除/替换字符，介绍一些自带的Character Filters
- HTML strip 去除HTML标签
- mapping 字符串替换
- pattern replace 正则匹配替换
Tokenizer
把文本按照一定的规则，切割成term 或者token，介绍一些内置的Tokenizer
- whitespace 空格切分，不转小写
- standard 按词切分，小写处理
- uax_url_email 邮箱@符号切分
- pattern 正则
- keyword 不分词，直接输出
- path hierarchy 路径切分
  也可以自己用java实现自己的Tokenizer
Token Filters
将Tokenizer输出的term 增加/删除/修改，介绍几个Token Filters
- Lowercase
- stop
- synonym(添加近义词)

//remove 加入lowercase后，The被当成 stopword删除
GET _analyze
{
  "char_filter": [                                               //先对文本进行预处理，删除或者替换或者增加一些字符
      {
        "type" : "mapping",
        "mappings" : [ "- => _"]
      }
    ],
  "tokenizer": "whitespace",                                     //使用whitespace 进行切割
  "filter": ["lowercase","stop","snowball"],                     //先转小写，再移除a an the等词
  "text": ["The gilrs in China are playing this game!"]
}

自定义分词器

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer":{                                      //定义自己的analyzer 命名为“my_analyzer”
          "type":"custom",
          "char_filter":["emoticons"],                      //使用的char_filter，里面的方法emoticons在后文中进行定义
          "tokenizer":"my_tokenizer",                       //使用的tokenizer，里面的方法my_tokenizer在后文中进行定义
          "filter":["lowercase","english_stop"]             //使用的filter，lowercase为ES自带的filter,english_stop为自定义filter,在后文进行定义
        }
      },
      "tokenizer": {
        "my_tokenizer":{                                   //定义tokenizer，名称为my_tokenizer，使用pattern（正则） 的方式
          "type":"pattern",
          "pattern":"[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons":{                                     //定义char_filter，名称为emoticons，使用mapping（替换） 的方式
          "type":"mapping",
          "mappings":[":)=>_happy_",":(=>_sad_"]
        }
      },
      "filter": {
        "english_stop":{                                 //定义filter，名称为english_stop，使用stop（去掉虚词） 的方
          "type":"stop",
          "stopwords":"_english_"
        }
      }
    }
  }
}

使用自定义的分词器，对文本进行分词实验

POST  /my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I am a :) person,and you?"
}

输出结果(已经去掉了一些虚词 )

{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "am",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "_happy_",
      "start_offset" : 7,
      "end_offset" : 9,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "person",
      "start_offset" : 10,
      "end_offset" : 16,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "you",
      "start_offset" : 21,
      "end_offset" : 24,
      "type" : "word",
      "position" : 6
    }
  ]
}