Elasticsearch由浅入深(十一)索引管理

索引的基本操作

  • 创建索引
    PUT /{index}
    {
      "settings": {},
      "mappings": {
        "properties": {
        }
      }
    }

    创建索引示例:

    PUT /my_index
    {
      "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
      },
      "mappings": {
        "my_type":{
          "properties": {
            "my_field":{
              "type": "text"
            }
          }
        }
      }
    }
  • 修改索引
    PUT /{index}/_settings
    {
        "setttings": {}
    }
    
    PUT /my_index/_settings
    {
      "settings": {
        "number_of_replicas": 1
      }
    }
  • 删除索引
    DELETE /{index}

    示例

    DELETE /my_index
    DELETE /index_one,index_two
    DELETE /index_*
    DELETE /_all

    删除索引API也可以通过使用逗号分隔列表应用于多个索引,或者通过使用_all或*作为索引应用于所有索引(小心!)。
    要禁用允许通过通配符删除索引,或者将 elasticsearch.yml 配置中的_all设置action.destructive_requires_name设置为true。也可以通过群集更新设置api更改此设置。

修改分词器以及定义自己的分词器

Elasticsearch附带了各种内置分析器,无需进一步配置即可在任何索引中使用:

standard analyzer: 
所述standard分析器将文本分为在字边界条件,由Unicode的文本分割算法所定义的。它删除了大多数标点符号,小写术语,并支持删除停用词。
Simple analyzer:
该simple分析仪将文本分为方面每当遇到一个字符是不是字母。然后全部变为小写
whitespace analyzer: 
whitespace只要遇到任何空格字符 ,分析器就会将文本划分为术语。它不会进行小写转换。
stop analyzer: 
该stop分析器是像simple,而且还支持去除停止词。
keyword analyzer: 
所述keyword分析器是一个“空操作”分析器接受任何文本它被赋予并输出完全相同的文本作为一个单一的术语,也就是不会分词,进行精确匹配。
pattern analyzer: 
所述pattern分析器使用一个正则表达式对文本进行拆分。它支持小写转换和停用字。
language analyzer: 
Elasticsearch提供了许多特定于语言的分析器,如english或 french。
fingerprint analyzer: 
所述fingerprint分析器是一种专业的指纹分析器,它可以创建一个指纹,用于重复检测。

修改分词器的设置

  • 启动english停用词token filter
    put /my_index
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "es_std":{
              "type":"standard",
              "stopwords":"_english_"
            }
          }
        }
      }
    }
  • 测试分词
    使用原来的standard分词
    # standard分词 
    GET /my_index/_analyze
    {
      "analyzer": "standard",
      "text": "a dog is in the house"
    }
    {
      "tokens": [
        {
          "token": "a",
          "start_offset": 0,
          "end_offset": 1,
          "type": "",
          "position": 0
        },
        {
          "token": "dog",
          "start_offset": 2,
          "end_offset": 5,
          "type": "",
          "position": 1
        },
        {
          "token": "is",
          "start_offset": 6,
          "end_offset": 8,
          "type": "",
          "position": 2
        },
        {
          "token": "in",
          "start_offset": 9,
          "end_offset": 11,
          "type": "",
          "position": 3
        },
        {
          "token": "the",
          "start_offset": 12,
          "end_offset": 15,
          "type": "",
          "position": 4
        },
        {
          "token": "house",
          "start_offset": 16,
          "end_offset": 21,
          "type": "",
          "position": 5
        }
      ]
    }
    View Code

    使用原来的es_sted中的english分词

    # english分词
    GET /my_index/_analyze
    {
      "analyzer": "es_std",
      "text": "a dog is in the house"
    }
    {
      "tokens": [
        {
          "token": "dog",
          "start_offset": 2,
          "end_offset": 5,
          "type": "",
          "position": 1
        },
        {
          "token": "house",
          "start_offset": 16,
          "end_offset": 21,
          "type": "",
          "position": 5
        }
      ]
    }
    View Code

定制自己的分词器

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and":{
          "type":"mapping",
          "mappings":["&=>and"]
        }
      },
      "filter": {
        "my_stopwords":{
          "type":"stop",
          "stopwords":["the","a"]
        }
      },
      "analyzer": {
        "my_analyzer":{
          "type": "custom",
          "char_filter":["html_strip","&_to_and"],
          "tokenizer":"standard",
          "filter":["lowercase", "my_stopwords"]
        }
      }
    }
  }
}

测试分词

GET /my_index/_analyze
{
  "text": "tom&jerry are a friend in the house, , HAHA!!",
  "analyzer": "my_analyzer"
}
{
  "tokens": [
    {
      "token": "tomandjerry",
      "start_offset": 0,
      "end_offset": 9,
      "type": "",
      "position": 0
    },
    {
      "token": "are",
      "start_offset": 10,
      "end_offset": 13,
      "type": "",
      "position": 1
    },
    {
      "token": "friend",
      "start_offset": 16,
      "end_offset": 22,
      "type": "",
      "position": 3
    },
    {
      "token": "in",
      "start_offset": 23,
      "end_offset": 25,
      "type": "",
      "position": 4
    },
    {
      "token": "house",
      "start_offset": 30,
      "end_offset": 35,
      "type": "",
      "position": 6
    },
    {
      "token": "haha",
      "start_offset": 42,
      "end_offset": 46,
      "type": "",
      "position": 7
    }
  ]
}
View Code

设置使用分词自定义

PUT /my_index/_mapping/my_type
{
  "properties": {
    "content": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}

mapping root object深入剖析

  • root object
    就是某个type对应的mapping json,包括了properties,metadata(_id,_source,_type),settings(analyzer),其他settings(比如include_in_all)
    PUT /my_index
    {
      "mappings": {
        "my_type": {
          "properties": {}
        }
      }
    }
  • properties

    PUT /my_index/_mapping/my_type
    {
      "properties": {
        "title": {
          "type": "text"
        }
      }
    }
  • _source

    好处

    (1)查询的时候,直接可以拿到完整的document,不需要先拿document id,再发送一次请求拿document
    (2)partial update基于_source实现
    (3)reindex时,直接基于_source实现,不需要从数据库(或者其他外部存储)查询数据再修改
    (4)可以基于_source定制返回field
    (5)debug query更容易,因为可以直接看到_source

    如果不需要上述好处,可以禁用_source

    PUT /my_index/_mapping/my_type2
    {
      "_source": {"enabled": false}
    }
  • _all
    将所有field打包在一起,作为一个_all field,建立索引。没指定任何field进行搜索时,就是使用_all field在搜索。
    PUT /my_index/_mapping/my_type3
    {
      "_all": {"enabled": false}
    }

    也可以在field级别设置include_in_all field,设置是否要将field的值包含在_all field中

    PUT /my_index/_mapping/my_type4
    {
      "properties": {
        "my_field": {
          "type": "text",
          "include_in_all": false
        }
      }
    }
  • 标识性metadata
    _index,_type,_id

定制化自己的dynamic mapping策略

dynamic参数

  • true: 遇到陌生字段就进行dynamic mapping
  • false: 遇到陌生字段就忽略
  • strict: 遇到陌生字段,就报错

举例:

PUT my_index
{
  "mappings": {
    "my_type":{
      "dynamic": "strict",
      "properties": {
        "title":{
          "type": "text"
        },
        "address":{
          "type": "object",
          "dynamic":"true"
        }
      }
    }
  }
}
PUT /my_index/my_type/1
{
  "title": "my article",
  "content": "this is my article",
  "address": {
    "province": "guangdong",
    "city": "guangzhou"
  }
}

{
  "error": {
    "root_cause": [
      {
        "type": "strict_dynamic_mapping_exception",
        "reason": "mapping set to strict, dynamic introduction of [content] within [my_type] is not allowed"
      }
    ],
    "type": "strict_dynamic_mapping_exception",
    "reason": "mapping set to strict, dynamic introduction of [content] within [my_type] is not allowed"
  },
  "status": 400
}
PUT /my_index/my_type/1
{
  "title": "my article",
  "address": {
    "province": "guangdong",
    "city": "guangzhou"
  }
}

GET /my_index/_mapping/my_type

{
  "my_index": {
    "mappings": {
      "my_type": {
        "dynamic": "strict",
        "properties": {
          "address": {
            "dynamic": "true",
            "properties": {
              "city": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "province": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          },
          "title": {
            "type": "text"
          }
        }
      }
    }
  }
}

定制dynamic mapping策略

  1. date_detection
    elasticsearch默认会按照一定格式识别date,比如yyyy-MM-dd。但是如果某个field先过来一个2017-01-01的值,就会被自动dynamic mapping成date,后面如果再来一个"hello world"之类的值,就会报错。此时的解决方案是可以手动关闭某个type的date_detention,如果有需要,自己手动指定某个field为date类型。
    PUT /my_index/_mapping/my_type
    {
        "date_detection": false
    }
  2. dynamic template
    PUT my_index
    {
      "mappings": {
        "my_type":{
          "dynamic_templates": [
            {
              "en":{
                "match":"*_en",
                "match_mapping_type": "string",
                "mapping": {
                  "type":"string",
                  "analyzer":"english"
                  }
              }
            }
          ]
        }
      }
    }

    初始化数据

    PUT /my_index/my_type/1
    {
      "title": "this is my first article"
    }
    
    PUT /my_index/my_type/2
    {
      "title_en": "this is my first article"
    }

    无模板匹配

    GET /my_index/my_type/_search
    {
      "query": {
        "match": {
          "title":"is"
        }
      }
    }
    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 1,
        "max_score": 0.2824934,
        "hits": [
          {
            "_index": "my_index",
            "_type": "my_type",
            "_id": "1",
            "_score": 0.2824934,
            "_source": {
              "title": "this is my first article"
            }
          }
        ]
      }
    }
    View Code

    有模板匹配

    GET /my_index/my_type/_search
    {
      "query": {
        "match": {
          "title_en":"is"
        }
      }
    }
    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 0,
        "max_score": null,
        "hits": []
      }
    }
    View Code

    此时title没有匹配到任何的dynamic模板,默认就是standard分词器,不会过滤停用词,is会进入倒排索引,用is来搜索就可以搜索到。而title_en匹配到了dynamic模板,就是english分词器,会过滤停用词,is这种停用词就会被过滤掉,用is来搜索就搜索不到了。

基于scoll+bulk+索引别名实现零停机重建索引

  1. 数据准备
    一个field的设置是不能被修改的,如果要修改一个field,那么应该重新按照新的mapping,建立一个index,然后将数据批量查询出来,重新用bulk api写入index中,批量查询的时候,建议采用scroll api,并且采用多线程并发的方式来reindex数据,每次scroll就查询指定日期的一段数据,交给一个线程即可。
    一开始,依靠dynamic mapping,插入数据,但是不小心有些数据是2017-01-01这种日期格式的,所以title的这种field被自动映射为了date类型,实际上它应该是string类型。
    DELETE /my_index
    PUT /my_index/my_type/1
    {
      "title": "2017-01-01"
    }
    
    PUT /my_index/my_type/2
    {
      "title": "2017-01-02"
    }
    
    PUT /my_index/my_type/3
    {
      "title": "2017-01-03"
    }
    GET /my_index/my_type/_search
    {
      "query": {
        "match_all": {}
      }
    }
    
    
    
    {
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 3,
        "max_score": 1,
        "hits": [
          {
            "_index": "my_index",
            "_type": "my_type",
            "_id": "2",
            "_score": 1,
            "_source": {
              "title": "2017-01-02"
            }
          },
          {
            "_index": "my_index",
            "_type": "my_type",
            "_id": "1",
            "_score": 1,
            "_source": {
              "title": "2017-01-01"
            }
          },
          {
            "_index": "my_index",
            "_type": "my_type",
            "_id": "3",
            "_score": 1,
            "_source": {
              "title": "2017-01-03"
            }
          }
        ]
      }
    }
    View Code
  2. 当后期向索引中加入string类型的title值的时候,就会报错
    PUT /my_index/my_type/4
    {
      "title": "my first article"
    }
    {
      "error": {
        "root_cause": [
          {
            "type": "mapper_parsing_exception",
            "reason": "failed to parse [title]"
          }
        ],
        "type": "mapper_parsing_exception",
        "reason": "failed to parse [title]",
        "caused_by": {
          "type": "illegal_argument_exception",
          "reason": "Invalid format: \"my first article\""
        }
      },
      "status": 400
    }
  3. 如果此时想修改title的类型,是不可能的
    PUT /my_index/_mapping/my_type
    {
      "properties": {
        "title": {
          "type": "text"
        }
      }
    }
    {
      "error": {
        "root_cause": [
          {
            "type": "illegal_argument_exception",
            "reason": "mapper [title] of different type, current_type [date], merged_type [text]"
          }
        ],
        "type": "illegal_argument_exception",
        "reason": "mapper [title] of different type, current_type [date], merged_type [text]"
      },
      "status": 400
    } 
  4. 此时,唯一的办法,就是进行reindex,也就是说,重新建立一个索引,将旧索引的数据查询出来,再导入新索引

  5. 如果说旧索引的名字,是old_index,新索引的名字是new_index,终端java应用,已经在使用old_index在操作了,难道还要去停止java应用,修改使用的index为new_index,才重新启动java应用吗?这个过程中,就会导致java应用停机,可用性降低

  6. 所以说,给java应用一个别名,这个别名是指向旧索引的,java应用先用着,java应用先用goods_index alias来操作,此时实际指向的是旧的my_index
    PUT /my_index/_alias/goods_index
  7. 新建一个index,调整其title的类型为string
    PUT my_index_new
    {
      "mappings": {
        "my_type":{
          "properties": {
            "title":{
              "type": "text"
            }
          }
        }
      }
    }
  8. 使用scroll api将数据批量查询出来
    GET my_index/_search?scroll=1m
    {
      "query": {
        "match_all": {}
      },
      "sort": [
        "_doc"
      ],
      "size": 1
    }
    {
      "_scroll_id": "DnF1ZXJ5VGhlbkZldGNoBQAAAAAAARhWFjFMZHFMRnF4UVFxNHhnMk1waElfZ3cAAAAAAAEYWBYxTGRxTEZxeFFRcTR4ZzJNcGhJX2d3AAAAAAABGFoWMUxkcUxGcXhRUXE0eGcyTXBoSV9ndwAAAAAAARhXFjFMZHFMRnF4UVFxNHhnMk1waElfZ3cAAAAAAAEYWRYxTGRxTEZxeFFRcTR4ZzJNcGhJX2d3",
      "took": 1,
      "timed_out": false,
      "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
      },
      "hits": {
        "total": 3,
        "max_score": null,
        "hits": [
          {
            "_index": "my_index",
            "_type": "my_type",
            "_id": "2",
            "_score": null,
            "_source": {
              "title": "2017-01-02"
            },
            "sort": [
              0
            ]
          }
        ]
      }
    }
    View Code
  9. 采用bulk api将scoll查出来的一批数据,批量写入新索引
    POST _bulk
    {"index":{"_index": "my_index_new", "_type": "my_type", "_id": "2"}}
    {"title":"2017-01-02"}
  10. 反复循环8~9,查询一批又一批的数据出来,采取bulk api将每一批数据批量写入新索引
  11. 将goods_index alias切换到my_index_new上去,java应用会直接通过index别名使用新的索引中的数据,java应用程序不需要停机,零提交,高可用
    POST _aliases
    {
      "actions": [
        {
          "remove": {
            "index": "my_index",
            "alias": "goods_index"
          }
        },
        {
          "add": {
            "index": "my_index_new",
            "alias": "goods_index"
          }
        }
      ]
    }
  12. 直接通过goods_index别名来查询,是否ok
    GET /goods_index/my_type/_search

基于alias对client客户端透明切换index

格式:

POST /_aliases
{
    "actions" : [
        { "remove" : { "index" : "test1", "alias" : "alias1" } },
        { "add" : { "index" : "test2", "alias" : "alias1" } }
    ]
}

 

你可能感兴趣的:(Elasticsearch由浅入深(十一)索引管理)