映射:创建索引的时候,可以预先定义字段的类型以及相关属性
elasticsearch会根据json源数据的基础类型猜测你想要的字段映射,将输入的数据转换成可搜索的索引项,mapping就是我们自己定义的字段数据类型,同时告诉elasticsearch如何索引数据以及是否可以被搜索
作用:会让索引建立的更加细致和完善
类型:静态映射和动态映射
类型 | 字段类型 |
string类型 | text 和 keyword两种 text类型:会进行 分词,抽取词干,建立倒排索引 keyword类型:一个普通字符串,只能完全匹配才能搜索到 |
数字类型 | long,integer,short,byte,double,float |
日期类型 | date |
bool(布尔)类型 | boolean |
binary(二进制)类型 | binary |
复杂类型 | object,nested |
geo(地区)类型 | geo-point,geo-shape |
专业类型 | ip,competion |
PUT my-index-000002
{
"mappings": {
"properties": {
"tags": {
"type": "keyword"
}
}
}
}
PUT my-index-000002/_doc/1
{
"tags":"yangb yan ping"
}
GET my-index-000002/_search?q=tags:yang
# 查询输出
{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
属性 | 描述 | 适合类型 |
store | 值为true表示存储,为false 表示不存储,默认为false | all |
index | true表示分析,false表示不分析,默认为true | string |
null_value | 如果字段为空,可以设置一个默认值,比如:“NA” | all |
analyzer | 可以设置索引和搜索时用的分析器,默认使用的是 standard分析器,还可以使用whitespace, simple ,english |
all |
include_in_all | 默认ES为每个文档定义一个特殊域_all, 它的作用是让每个字段被搜索到,如果不想某个字段被搜索到,可以设置false | all |
format | 时间格式字符串的模式 | date |
更多属性:Mapping parameters | Elasticsearch Guide [8.2] | Elastic
默认情况下,对字段值进行索引以使其可搜索,但不会存储这些值。这意味着可以查询字段,但无法检索原始字段值。
通常这并不重要。字段值已经是默认存储的源字段的一部分。如果只想检索单个字段或几个字段的值,而不是整个源,那么可以通过源筛选来实现。
在某些情况下,存储字段是有意义的。例如,如果您有一个包含标题、日期和非常大的内容字段的文档,您可能希望只检索标题和日期,而不必从大型源字段中提取这些字段:
PUT my-index-000001
{
"mappings": {
"properties": {
"title": {
"type": "text",
"store": true
},
"date": {
"type": "date",
"store": true
},
"content": {
"type": "text"
}
}
}
}
PUT my-index-000001/_doc/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
GET my-index-000001/_search?q=content:long
{
"stored_fields": [ "title", "date" ]
}
## 查询输出
{
"took" : 11,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "my-index-000001",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"fields" : {
"date" : [
"2015-01-01T00:00:00.000Z"
],
"title" : [
"Some short title"
]
}
}
]
}
}
#创建索引(设置字段类型)
PUT jobbole #创建索引设置索引名称
{
"mappings": { #设置mappings映射字段类型
"job": { #表名称
"properties": { #设置字段类型
"title":{ #title字段
"type": "text" #text类型,text类型可以分词,建立倒排索引
},
"salary_min":{ #salary_min字段
"type": "integer" #integer数字类型
},
"city":{ #city字段
"type": "keyword" #keyword普通字符串类型
},
"company":{ #company字段,是嵌套字段
"properties":{ #设置嵌套字段类型
"name":{ #name字段
"type":"text" #text类型
},
"company_addr":{ #company_addr字段
"type":"text" #text类型
},
"employee_count":{ #employee_count字段
"type":"integer" #integer数字类型
}
}
},
"publish_date":{ #publish_date字段
"type": "date", #date时间类型
"format":"yyyy-MM-dd" #yyyy-MM-dd格式化时间样式
},
"comments":{ #comments字段
"type": "integer" #integer数字类型
}
}
}
}
}
#保存文档(相当于数据库的写入数据)
PUT jobbole/job/1 #索引名称/表/id
{
"title":"python分布式爬虫开发", #字段名称:字段值
"salary_min":15000, #字段名称:字段值
"city":"北京", #字段名称:字段值
"company":{ #嵌套字段
"name":"百度", #字段名称:字段值
"company_addr":"北京市软件园", #字段名称:字段值
"employee_count":50 #字段名称:字段值
},
"publish_date":"2017-4-16", #字段名称:字段值
"comments":15 #字段名称:字段值
}
插入几条数据,es会自动为我们建立一个索引,以及对应的mapping,mapping中包含了每个field对应的数据类型,以及如何分词等设置
// 创建文档请求
PUT localhost:9200/blog/_doc/1
{
"title":"内蒙古科右中旗:沃野千里织锦绣---修改操作",
"description":"内蒙古兴安盟科右中旗巴彦呼舒镇乌逊嘎查整洁的村容村貌。近年来,内蒙古自治区兴安盟科尔沁右翼中旗按照“产业兴旺、生态宜居、乡风文明、治理有效、生活富裕”的总要求,坚持科学规划、合理布...国际在线",
"publish_time":"2020-07-08"
}
查看动态映射
GET http://localhost:9200/blog/_mapping
{
"blog" : {
"mappings" : {
"properties" : {
"description" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"id" : {
"type" : "long"
},
"publish_time" : {
"type" : "date"
},
"title" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
创建索引
PUT http://localhost:9200/book
{
"acknowledged" : true
}
创建映射
PUT localhost:9200/book/_mapping
{
"properties":{
"name":{
"type":"text"
},
"description":{
"type":"text",
"analyzer":"english",
"search_analyzer":"english"
},
"pic":{
"type":"text",
"index":"false"
},
"publish_time":{
"type":"date"
}
}
}
GET localhost:9200/blog/_mapping
{
"blog": {
"mappings": {
"properties": {
"description": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"id": {
"type": "long"
},
"publish_time": {
"type": "date"
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
插入文档
PUT localhost:9200/book/_doc/1
{
"name":"Java核心技术",
"description":"本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
"pic":"item.jd.com",
"publish_time":"2022-04-19"
}
{
"_index" : "book",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
测试查询 localhost:9200/book/_search?q=name:java
GET localhost:9200/book/_search?q=name:java
{
"took": 1126,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "book",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "Java核心技术",
"description": "本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
"pic": "item.jd.com",
"publish_time": "2022-04-19"
}
}
]
}
}
测试查询 localhost:9200/book/_search?q=description:java
GET localhost:9200/book/_search?q=description:java
{
"took": 15,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.45390707,
"hits": [
{
"_index": "book",
"_type": "_doc",
"_id": "1",
"_score": 0.45390707,
"_source": {
"name": "Java核心技术",
"description": "本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
"pic": "item.jd.com",
"publish_time": "2022-04-19"
}
}
]
}
}
测试查询 localhost:9200/book/_search?q=pic:www.jd.com
GET localhost:9200/book/_search?q=pic:item.jd.com
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
通过测试发现:name和description都支持全文检索,pic不可作为查询条件。
只能在创建index时手动建立mapping,或者新增field mapping,但是不能update field mapping。
因为已有数据按照映射早已分词存储好。如果修改,那这些存量数据怎么办。
新增一个字段mapping
PUT localhost:9200/book/_mapping
{
"properties":{
"ISBN":{
"type":"text",
"fields":{
"raw":{
"type":"keyword"
}
}
}
}
}
修改数据
PUT localhost:9200/book/_doc/1
{
"name":"Java核心技术",
"description":"本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
"pic":"item.jd.com",
"publish_time":"2022-04-19",
"ISBN":"12800420"
}
搜索ISBN
GET localhost:9200/book/_search?q=ISBN:12800420
{
"took": 949,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 0.2876821,
"hits": [
{
"_index": "book",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"name": "Java核心技术",
"description": "本书由拥有20多年教学与研究经验的资深Java技术专家撰写(获Jolt大奖),是程序员的优选Java指南。本版针对Java SE 9、10和 11全面更新。",
"pic": "item.jd.com",
"publish_time": "2022-04-19",
"ISBN": "12800420"
}
}
]
}
}
分词器 接受一个字符串作为输入,将这个字符串拆分成独立的词或 语汇单元(token) (可能会丢弃一些标点符号等字符),然后输出一个 语汇单元流(token stream) 。
有趣的是用于词汇 识别 的算法。 whitespace
(空白字符)分词器按空白字符 —— 空格、tabs、换行符等等进行简单拆分 —— 然后假定连续的非空格字符组成了一个语汇单元。
将用户输入的一段文本,按照一定逻辑,分析成多个词语的一种工具。常用的内置分词器
standard analyzer、simple analyzer、whitespace analyzer、stop analyzer、language analyzer、pattern analyzer
标准分析器是默认分词器,如果未指定,则使用该分词器。
POST http://127.0.0.1:9200/_analyze
{
"analyzer":"standard",
"text":"我是程序员"
}
{
"tokens": [
{
"token": "我",
"start_offset": 0,
"end_offset": 1,
"type": "",
"position": 0
},
{
"token": "是",
"start_offset": 1,
"end_offset": 2,
"type": "",
"position": 1
},
{
"token": "程",
"start_offset": 2,
"end_offset": 3,
"type": "",
"position": 2
},
{
"token": "序",
"start_offset": 3,
"end_offset": 4,
"type": "",
"position": 3
},
{
"token": "员",
"start_offset": 4,
"end_offset": 5,
"type": "",
"position": 4
}
]
}
simple 分析器当它遇到只要不是字母的字符,就将文本解析成 term,而且所有的 term 都是小写的。
POST http://127.0.0.1:9200/_analyze
{
"analyzer":"simple",
"text":"this is a book"
}
{
"tokens": [
{
"token": "this",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "a",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 2
},
{
"token": "book",
"start_offset": 10,
"end_offset": 14,
"type": "word",
"position": 3
}
]
}
whitespace 分析器,当它遇到空白字符时,就将文本解析成terms
POST http://127.0.0.1:9200/_analyze
{
"analyzer":"whitespace",
"text":"this is a book"
}
{
"tokens": [
{
"token": "this",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "is",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "a",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 2
},
{
"token": "book",
"start_offset": 10,
"end_offset": 14,
"type": "word",
"position": 3
}
]
}
stop 分析器 和 simple 分析器很像,唯一不同的是,stop 分析器增加了对删除停止词的支持,默认使用了 english 停止词
stopwords 预定义的停止词列表,比如 (the,a,an,this,of,at)等等。
POST http://127.0.0.1:9200/_analyze
{
"analyzer":"stop",
"text":"this is a book"
}
{
"tokens": [
{
"token": "book",
"start_offset": 10,
"end_offset": 14,
"type": "word",
"position": 3
}
]
}
安装 下载地址:https://github.com/medcl/elasticsearch-analysis-ik/releases。解压到 es/plugins/ik中,如图:
ik分词器的使用
配置文件
文件 | 描述 |
IKAnalyzer.cfg.xml | 用来配置自定义词库 |
main.dic | ik原生内置的中文词库,总共有27万多条,只要是这些单词,都会被分在一起 |
preposition.dic | 介词 |
quantifier.dic | 放了一些单位相关的词,量词 |
suffix.dic | 放了一些后缀 |
surname.dic | 中国的姓氏 |
stopword.dic | 英文停用词 |
IKAnalyzer.cfg.xml
IK Analyzer 扩展配置
使用ik_smart分词:
POST localhost:9200/_analyze
{
"analyzer":"ik_smart",
"text":"中华人民共和国国歌"
}
{
"tokens": [
{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
},
{
"token": "国歌",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 1
}
]
}
使用ik_max_word分词:
POST localhost:9200/_analyze
{
"analyzer":"ik_max_word",
"text":"中华人民共和国国歌"
}
{
"tokens": [
{
"token": "中华人民共和国",
"start_offset": 0,
"end_offset": 7,
"type": "CN_WORD",
"position": 0
},
{
"token": "中华人民",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
},
{
"token": "中华",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 2
},
{
"token": "华人",
"start_offset": 1,
"end_offset": 3,
"type": "CN_WORD",
"position": 3
},
{
"token": "人民共和国",
"start_offset": 2,
"end_offset": 7,
"type": "CN_WORD",
"position": 4
},
{
"token": "人民",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 5
},
{
"token": "共和国",
"start_offset": 4,
"end_offset": 7,
"type": "CN_WORD",
"position": 6
},
{
"token": "共和",
"start_offset": 4,
"end_offset": 6,
"type": "CN_WORD",
"position": 7
},
{
"token": "国",
"start_offset": 6,
"end_offset": 7,
"type": "CN_CHAR",
"position": 8
},
{
"token": "国歌",
"start_offset": 7,
"end_offset": 9,
"type": "CN_WORD",
"position": 9
}
]
}
自定义词库
POST localhost:9200/_analyze
{
"analyzer":"ik_smart",
"text":"魔兽世界"
}
//分词
{
"tokens": [
{
"token": "魔兽",
"start_offset": 0,
"end_offset": 2,
"type": "CN_WORD",
"position": 0
},
{
"token": "世界",
"start_offset": 2,
"end_offset": 4,
"type": "CN_WORD",
"position": 1
}
]
}
在/plugins/ik/config/ 文件夹下 创建mydic.dic 文件,添加内容“魔兽世界”
修改IKAnalyzer.cfg.xml后重启ES,在测试分词效果
IK Analyzer 扩展配置
mydic.dic
POST localhost:9200/_analyze
{
"analyzer":"ik_smart",
"text":"魔兽世界"
}
//分词
{
"tokens": [
{
"token": "魔兽世界",
"start_offset": 0,
"end_offset": 4,
"type": "CN_WORD",
"position": 0
}
]
}
参考:esmapping映射管理 · Elasticsearch · 看云