elasticsearch学习笔记(三十五)——Elasticsearch 索引管理

索引的基本操作

创建索引

PUT /{index}
{
  "settings": {},
  "mappings": {
    "properties": {
    }
  }
}

创建索引示例:

PUT /test_index
{
  "settings": {
    "number_of_replicas": 1,
    "number_of_shards": 5
  },
  "mappings": {
    "properties": {
      "field1": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "ctime": {
        "type": "date"
      }
    }
  }
}

修改索引

PUT /{index}/_settings
{
    "setttings": {}
}

PUT /test_index/_settings
{
  "settings": {
    "number_of_replicas": 2
  }
}

删除索引

DELETE /{index}

删除索引API也可以通过使用逗号分隔列表应用于多个索引,或者通过使用_all或*作为索引应用于所有索引(小心!)。

要禁用允许通过通配符删除索引,或者将配置中的_all设置action.destructive_requires_name设置为true。也可以通过群集更新设置api更改此设置。

修改分词器以及定义自己的分词器

Elasticsearch附带了各种内置分析器,无需进一步配置即可在任何索引中使用:

standard analyzer: 
所述standard分析器将文本分为在字边界条件,由Unicode的文本分割算法所定义的。它删除了大多数标点符号,小写术语,并支持删除停用词。
Simple analyzer:
该simple分析仪将文本分为方面每当遇到一个字符是不是字母。然后全部变为小写
whitespace analyzer: 
whitespace只要遇到任何空格字符 ,分析器就会将文本划分为术语。它不会进行小写转换。
stop analyzer: 
该stop分析器是像simple,而且还支持去除停止词。
keyword analyzer: 
所述keyword分析器是一个“空操作”分析器接受任何文本它被赋予并输出完全相同的文本作为一个单一的术语,也就是不会分词,进行精确匹配。
pattern analyzer: 
所述pattern分析器使用一个正则表达式对文本进行拆分。它支持小写转换和停用字。
language analyzer: 
Elasticsearch提供了许多特定于语言的分析器,如english或 french。
fingerprint analyzer: 
所述fingerprint分析器是一种专业的指纹分析器,它可以创建一个指纹,用于重复检测。

修改分词器的设置

启动english停用词token filter

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "es_std": {
          "type": "standard",
          "stopwords": "_english_"
        }
      }
    }
  }
}
GET /my_index/_analyze
{
  "analyzer": "standard",
  "text": "a dog is in the house"
}
{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "dog",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "is",
      "start_offset" : 6,
      "end_offset" : 8,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "in",
      "start_offset" : 9,
      "end_offset" : 11,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "the",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "",
      "position" : 4
    },
    {
      "token" : "house",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "",
      "position" : 5
    }
  ]
}
GET /my_index/_analyze
{
  "analyzer": "es_std",
  "text": "a dog is in the house"
}
{
  "tokens" : [
    {
      "token" : "dog",
      "start_offset" : 2,
      "end_offset" : 5,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "house",
      "start_offset" : 16,
      "end_offset" : 21,
      "type" : "",
      "position" : 5
    }
  ]
}

定制自己的分词器

PUT /test_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "&_to_and": {
          "type": "mapping",
          "mappings": ["&=>and"]
        }
      },
      "filter": {
        "my_stopwords": {
          "type": "stop",
          "stopwords": ["the", "a"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip", "&_to_and"],
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stopwords"]
        }
      }
    }
  }
}
GET /test_index/_analyze
{
  "text": "tom&jerry are a friend in the house, , HAHA!!",
  "analyzer": "my_analyzer"
}
{
  "tokens" : [
    {
      "token" : "tomandjerry",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "are",
      "start_offset" : 10,
      "end_offset" : 13,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "friend",
      "start_offset" : 16,
      "end_offset" : 22,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "in",
      "start_offset" : 23,
      "end_offset" : 25,
      "type" : "",
      "position" : 4
    },
    {
      "token" : "house",
      "start_offset" : 30,
      "end_offset" : 35,
      "type" : "",
      "position" : 6
    },
    {
      "token" : "haha",
      "start_offset" : 42,
      "end_offset" : 46,
      "type" : "",
      "position" : 7
    }
  ]
}

定制化自己的dynamic mapping策略

dynamic参数

true: 遇到陌生字段就进行dynamic mapping
false: 遇到陌生字段就忽略
strict: 遇到陌生字段,就报错
举例:

PUT /test_index
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "title": {
        "type": "text"
      },
      "address": {
        "type": "object",
        "dynamic": "true"
      }
    }
  }
}
PUT /test_index/_doc/1
{
  "title": "my article",
  "content": "this is my article",
  "address": {
    "province": "guangdong",
    "city": "guangzhou"
  }
}
{
  "error": {
    "root_cause": [
      {
        "type": "strict_dynamic_mapping_exception",
        "reason": "mapping set to strict, dynamic introduction of [content] within [_doc] is not allowed"
      }
    ],
    "type": "strict_dynamic_mapping_exception",
    "reason": "mapping set to strict, dynamic introduction of [content] within [_doc] is not allowed"
  },
  "status": 400
}
PUT /test_index/_doc/1
{
  "title": "my article",
  "address": {
    "province": "guangdong",
    "city": "guangzhou"
  }
}
{
  "_index" : "test_index",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 0,
  "_primary_term" : 1
}

date_detection

elasticsearch默认会按照一定格式识别date,比如yyyy-MM-dd。但是如果某个field先过来一个2017-01-01的值,就会被自动dynamic mapping成date,后面如果再来一个"hello world"之类的值,就会报错。此时的解决方案是可以手动关闭某个type的date_detention,如果有需要,自己手动指定某个field为date类型。

PUT /{index}
{
  "mappings": {
    "date_detection": false
  }
}

dynamic template

"dynamic_templates": [
    {
        "my_template_name": {
            ... match conditions ...
            "mapping": {...}
        }
    }
]

示例:

PUT /test_index
{
  "mappings": {
    "dynamic_templates": [
      {
        "en": {
          "match": "*_en",
          "match_mapping_type": "string",
          "mapping": {
            "type": "text",
            "analyzer": "english"
          }
        }
      }  
    ]
  }
}
PUT /test_index/_doc/1
{
  "title": "this is my first article"
}
PUT /test_index/_doc/2
{
  "title_en": "this is my first article"
}
GET /test_index/_mapping
{
  "test_index" : {
    "mappings" : {
      "dynamic_templates" : [
        {
          "en" : {
            "match" : "*_en",
            "match_mapping_type" : "string",
            "mapping" : {
              "analyzer" : "english",
              "type" : "text"
            }
          }
        }
      ],
      "properties" : {
        "title" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "title_en" : {
          "type" : "text",
          "analyzer" : "english"
        }
      }
    }
  }
}
GET /test_index/_search?q=is
{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "title" : "this is my first article"
        }
      }
    ]
  }
}

此时title没有匹配到任何的dynamic模板,默认就是standard分词器,不会过滤停用词,is会进入倒排索引,用is来搜索就可以搜索到。而title_en匹配到了dynamic模板,就是english分词器,会过滤停用词,is这种停用词就会被过滤掉,用is来搜索就搜索不到了。

基于scoll+bulk+索引别名实现零停机重建索引

1、重建索引

一个field的设置是不能被修改的,如果要修改一个field,那么应该重新按照新的mapping,建立一个index,然后将数据批量查询出来,重新用bulk api写入index中,批量查询的时候,建议采用scroll api,并且采用多线程并发的方式来reindex数据,每次scroll就查询指定日期的一段数据,交给一个线程即可。
(1)一开始,依靠dynamic mapping,插入数据,但是不小心有些数据是2017-01-01这种日期格式的,所以title的这种field被自动映射为了date类型,实际上它应该是string类型。

PUT /test_index/_doc/1
{
  "title": "2017-01-01"
}

GET /test_index/_mapping

{
  "test_index" : {
    "mappings" : {
      "properties" : {
        "title" : {
          "type" : "date"
        }
      }
    }
  }
}

(2)当后期向索引中加入string类型的title值的时候,就会报错

PUT /test_index/_doc/2
{
  "title": "my first article"
}

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse field [title] of type [date] in document with id '2'"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse field [title] of type [date] in document with id '2'",
    "caused_by": {
      "type": "illegal_argument_exception",
      "reason": "failed to parse date field [my first article] with format [strict_date_optional_time||epoch_millis]",
      "caused_by": {
        "type": "date_time_parse_exception",
        "reason": "Failed to parse with all enclosed parsers"
      }
    }
  },
  "status": 400
}

(3)如果此时想修改title的类型,是不可能的

PUT /test_index
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      }
    }
  }
}
{
  "error": {
    "root_cause": [
      {
        "type": "resource_already_exists_exception",
        "reason": "index [test_index/mZALkQ8IQV67SjCVqkhq4g] already exists",
        "index_uuid": "mZALkQ8IQV67SjCVqkhq4g",
        "index": "test_index"
      }
    ],
    "type": "resource_already_exists_exception",
    "reason": "index [test_index/mZALkQ8IQV67SjCVqkhq4g] already exists",
    "index_uuid": "mZALkQ8IQV67SjCVqkhq4g",
    "index": "test_index"
  },
  "status": 400
}

(4)此时,唯一的办法就是reindex,也就是说,重新建立一个索引,将旧索引的数据查询出来,在导入新索引
(5)如果说旧索引的名字是old_index,新索引的名字是new_index,终端应用,已经在使用old_index进行操作了,难到还要去终止应用,修改使用的index为new_index,在重新启动应用吗
(6)所以说此时应该采用别名的方式,给应用一个别名,这个别名指向旧索引,应用先用着。指向的还是旧索引
格式:

POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "test",
        "alias": "alias1"
      }
    }
  ]
}
POST /_aliases
{
  "actions": [
    {
      "add": {
        "index": "test_index",
        "alias": "test_index_alias"
      }
    }
  ]
}

(7)新建一个index,调整title为string

PUT /test_index_new
{
  "mappings": {
    "properties": {
      "title": {
        "type": "text"
      }
    }
  }
}

(8)使用scroll api将数据批量查询出来

GET /test_index/_search?scroll=1m
{
  "query": {
    "match_all": {}
  },
  "sort": [
    "_doc"
  ],
  "size": 1
}
{
  "_scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAACz3UWUC1iLVRFdnlRT3lsTXlFY01FaEFwUQ==",
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : null,
        "_source" : {
          "title" : "2017-01-01"
        },
        "sort" : [
          0
        ]
      }
    ]
  }
}
POST /_search/scroll
{
  "scroll_id": "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAC0GYWUC1iLVRFdnlRT3lsTXlFY01FaEFwUQ==",
  "scroll": "1m"
}
{
  "_scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAC0GYWUC1iLVRFdnlRT3lsTXlFY01FaEFwUQ==",
  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

(9)采用bulk api将scroll查出来的一批数据,批量写入新索引

POST /_bulk
{"index": {"_index": "test_index_new", "_id": "1"}}
{"title": "2017-01-01"}

GET /test_index_new/_search
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test_index_new",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "title" : "2017-01-01"
        }
      }
    ]
  }
}

(10)反复循环,查询一批又一批的数据出来,再批量写入新索引
(11)将test_index_alias切换到test_index_new上面去,应用会直接通过index别名使用新的索引中的数据,应用不需要停机,高可用

POST /_aliases
{
    "actions" : [
        { "remove" : { "index" : "test_index", "alias" : "test_index_alias" } },
        { "add" : { "index" : "test_index_new", "alias" : "test_index_alias" } }
    ]
}

GET /test_index_alias/_search

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test_index_new",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "title" : "2017-01-01"
        }
      }
    ]
  }
}

2、基于alias对client客户端透明切换index

格式:

POST /_aliases
{
    "actions" : [
        { "remove" : { "index" : "test1", "alias" : "alias1" } },
        { "add" : { "index" : "test2", "alias" : "alias1" } }
    ]
}

注意actions里面的json一定不要换行,否则无法解析会报错

你可能感兴趣的:(elasticsearch)