Elasticsearch优化

目录

1.索引库优化

1.1 Refresh 间隔

1.2 字段相似性算法调节

2.JVM优化

3.查询优化

3.1 multi_match优化 

3.2 过滤

3.3 业务排序

3.4 避免查询深度翻页

3.5 boost

3.6 minimum_should_match

4.部署优化

4.1调大文件句柄


1.索引库优化

1.1 Refresh 间隔

        为了提高索引性能,Elasticsearch 在写入数据时候,采用延迟写入的策略,即数据先写到内存中,当超过index.refresh_interval(默认 1 秒) 会进行一次写入操作,就是将内存中 segment 数据刷新到操作系统中,此时我们才能将数据搜索出来;在不要求非常高的实时性索引库上,我们可以将该设置调大,建议调为20s。

PUT traded_item_e-local-alias/_settings

{

   "index.refresh_interval" : "20s"

}

1.2 字段相似性算法调节

        为节省硬盘、内存空间,提高索引性能,对于一些不需要很好搜索效果的字段,如userId、phone、hotElement、picUrl等(注意类型是text的,keyword、int、时间这些不适用),对于这些字段只需要判断是否包含(注意是分词后的搜索关键字是否在搜索字段中,若字符类型为keyword,则必须完全匹配,有些场景不适用),而不需要进行分词和进行一些复杂搜索效果处理,可以更换字段的相似性算法(BM25),指定字段的相似性算法为boolean,详见Elasticsearch评分(score)及算法调节_leadseczgw01的博客-CSDN博客。

        某公司traded_item_e-prod索引库字段经过优化后,差不多的文档数量,优化后的store大小是优化前的25.1/40.9=61%,同时搜索速度也加快了。

index                shard prirep state    docs  store ip           node
traded_item_e-prod   0     r      STARTED 18865 41.2mb 172.18.46.41 node-5
traded_item_e-prod   0     p      STARTED 18865 40.9mb 172.18.46.40 node-6
traded_item_e-prod01 0     r      STARTED 18868 25.1mb 172.18.55.79 node-3
traded_item_e-prod01 0     p      STARTED 18868 25.1mb 172.18.46.40 node-6

2.JVM优化

        es是使用java语言编写的,运行时也使用了JVM,可对JVM中的堆内存进行优化,尽量降低FGC、堆内存占用量、YGC。

        查看es进程,可看到-Xms512m -Xmx512m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly,说明es最小堆内存为512M,最大堆内存为512M,使用CMS垃圾回收器,RSS占用内存约938M。

elk      12716  0.4  5.9 4299760 960792 ?      Sl   Nov15 265:30 /usr/jdk1.8/bin/java -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=COMPAT -Xms512m -Xmx512m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/tmp/elasticsearch-4333137265122664372 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=data -XX:ErrorFile=logs/hs_err_pid%p.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -XX:+PrintGCApplicationStoppedTime -Xloggc:logs/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=32 -XX:GCLogFileSize=64m -XX:MaxDirectMemorySize=268435456 -Des.path.home=/home/elk/es/cluster/app/elasticsearch-7.5.2 -Des.path.conf=/home/elk/es/cluster/app/elasticsearch-7.5.2/config -Des.distribution.flavor=default -Des.distribution.type=tar -Des.bundled_jdk=true -cp /home/elk/es/cluster/app/elasticsearch-7.5.2/lib/* org.elasticsearch.bootstrap.Elasticsearch

        CMS为基于标记清除算法实现的多线程老年代垃圾回收器。CMS为响应时间优先的垃圾回收器。与其他垃圾回收器不同的是,CMS是与应用程序并发执行的,即在CMS对老年代进行垃圾回收时,应用程序大部分时间里是可以继续执行的,应用程序只需进行非常短暂的停顿。由于与应用程序并发执行,同一时刻同时存在垃圾回收线程和应用线程,故对服务器的CPU消耗较大。

        -XX:CMSInitiatingOccupancyFraction=75 是指设定CMS在对内存占用率达到75%的时候开始GC(因为CMS会有浮动垃圾,所以一般都较早启动GC);

        -XX:+UseCMSInitiatingOccupancyOnly 只是用设定的回收阈值(上面指定的75%),如果不指定,JVM仅在第一次使用设定值,后续则自动调整.

        查看堆内存分配如下。

JVM version is 25.131-b11

using parallel threads in the new generation.
using thread-local object allocation.
Concurrent Mark-Sweep GC

Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 536870912 (512.0MB)
   NewSize                  = 178913280 (170.625MB)
   MaxNewSize               = 178913280 (170.625MB)
   OldSize                  = 357957632 (341.375MB)
   NewRatio                 = 2
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 0 (0.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 161021952 (153.5625MB)
   used     = 37167336 (35.445533752441406MB)
   free     = 123854616 (118.1169662475586MB)
   23.082154661744507% used
Eden Space:
   capacity = 143130624 (136.5MB)
   used     = 36188288 (34.5118408203125MB)
   free     = 106942336 (101.9881591796875MB)
   25.283399868360807% used
From Space:
   capacity = 17891328 (17.0625MB)
   used     = 979048 (0.9336929321289062MB)
   free     = 16912280 (16.128807067871094MB)
   5.472193008814102% used
To Space:
   capacity = 17891328 (17.0625MB)
   used     = 0 (0.0MB)
   free     = 17891328 (17.0625MB)
   0.0% used
concurrent mark-sweep generation:
   capacity = 357957632 (341.375MB)
   used     = 259770000 (247.73597717285156MB)
   free     = 98187632 (93.63902282714844MB)
   72.57004091478625% used

38658 interned Strings occupying 4858088 bytes.

        查看堆内存增长情况,如下:

[root@iZwz982lz6444cwmn40t61Z ~]# jstat -gcutil 12716 1000
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
  0.00   2.46  90.48  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  90.57  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  91.68  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  91.88  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  91.89  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  92.35  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  92.40  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  93.58  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  93.70  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  93.80  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  93.91  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  94.42  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  95.22  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  95.51  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  96.57  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  96.71  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  97.42  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  97.74  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  97.93  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  98.49  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  98.96  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  0.00   2.46  99.73  58.71  88.74  79.80  16961   84.173    38    1.382   85.554
  1.92   0.00   1.18  58.71  88.74  79.80  16962   84.177    38    1.382   85.559
  1.92   0.00   1.73  58.71  88.74  79.80  16962   84.177    38    1.382   85.559
  1.92   0.00   1.73  58.71  88.74  79.80  16962   84.177    38    1.382   85.559
  1.92   0.00   1.86  58.71  88.74  79.80  16962   84.177    38    1.382   85.559

    查看上面的堆内存增长情况,我们可以看到YGCT为84秒,FGCT为1秒,观察前10次记录,发现Eden区内存增长由90.48增长到93.80,平均下来大约1秒增长0.332,大约经过301(5分钟)秒就要进行一次YGC,YGCT增长4ms,初步估算经过301秒JVM将停顿4ms,一个月将累计停顿34.445秒。

        同时发现,在YGC由16961变为16962的过程中,FGC未变大,同时S1的占用只有1.92(偏低),这说明未出现过早晋升的情况,Eden区的大量对象被垃圾回收了。这时优化的思路有2种:

        1.为充分利用空间,可减少老年代大小,增大新生代大小,同时调大新生代中Eden与S0/S1的比值。Eden区变大后,内存占用速率自然会降下来,这将减少YGC频率,间接减少YGCT时间。(内存紧张时优先考虑)。详细优化请参见:JVM配置、监控、调优_leadseczgw01的博客-CSDN博客

        2.通过更换垃圾回收算法来减少YGCT时间;当前es使用CMS垃圾回收方式,可改为使用G1垃圾回收器来缩短YGCT时间。G1能充分利用CPU、多核环境下的硬件优势,使用多个CPU(CPU或者CPU核心)来缩短stop-The-World停顿时间;部分其他收集器原本需要停顿线程执行的GC动作,G1收集器仍然可以通过并发的方式让java程序继续执行。(es内存6g以上优先考虑),详细请参考

关于elasticsearch使用G1垃圾回收替换CMS - Elastic 中文社区

3.查询优化

3.1 multi_match优化 

        multi_match或query_string的查询字段越多, 查询越慢。可以在mapping阶段,利用copy_to属性将多字段的值索引到一个新字段,multi_match时,用新的字段查询。例如下面的商品索引库,将productName、companyName、productCode、productStyle、material、hotElement等都添加到了default中

PUT /traded_item_e-local/_mapping
{
	"dynamic": "false",
	"properties": {
		"goodsClass": {
			"type": "keyword"
		},
		"goodsClassName": {
			"type": "text",
			"analyzer": "ik_max_word",
			"copy_to": "default"
		},
		"payAmount": {
			"type": "integer"
		},
		"productSales": {
			"type": "integer"
		},
		"ageRange": {
			"type": "text",
			"fields": {
				"keyword": {
					"type": "keyword",
					"ignore_above": 256
				}
			}
		},
		"itemType": {
			"type": "byte"
		},
		"productType": {
			"type": "byte"
		},
		"supplierType": {
			"type": "byte"
		},
		"supplierLevel": {
			"type": "byte"
		},
		"priceMode": {
			"type": "byte"
		},
		"showInventory": {
			"type": "byte"
		},
		"platformType": {
			"type": "byte"
		},
		"sendCycle": {
			"type": "integer"
		},
		"material": {
			"type": "text",
			"analyzer": "ik_max_word",
			"copy_to": "default"
		},
		"userId": {
			"type": "keyword"
		},
		"sellerPhone": {
			"type": "keyword"
		},
		"productSupplier": {
			"type": "byte"
		},
		"productId": {
			"type": "keyword"
		},
		"activeId": {
			"type": "keyword"
		},
		"season": {
			"type": "text",
			"fields": {
				"keyword": {
					"type": "keyword",
					"ignore_above": 256
				}
			},
"copy_to": "default"
		},
		"model": {
			"type": "byte"
		},
		"modelName": {
			"type": "text",
"analyzer": "ik_max_word",
"copy_to": "default"
		},
		"productStyle": {
			"type": "text",
			"analyzer": "ik_max_word",
			"fields": {
				"keyword": {
					"type": "keyword",
					"ignore_above": 256
				}
			},
			"copy_to": "default"
		},
		"hotElement": {
			"type": "text",
			"analyzer": "ik_max_word",
			"copy_to": "default"
		},
		"productCode": {
			"type": "keyword",
			"copy_to": "default"
		},
		"supplierCode": {
			"type": "keyword",
"copy_to": "default"
		},
		"supplierName": {
			"type": "text",
			"analyzer": "ik_max_word",
			"copy_to": "default"
		},
		"companyName": {
			"type": "text",
			"analyzer": "ik_max_word",
			"fields": {
				"keyword": {
					"type": "keyword",
					"ignore_above": 256
				}
			},
			"copy_to": "default"
		},
		"productName": {
			"type": "text",
			"analyzer": "ik_max_word",
			"fields": {
				"keyword": {
					"type": "keyword",
					"ignore_above": 256
				}
			},
			"copy_to": "default"
		},
		"province": {
			"type": "keyword",
"copy_to": "default"
		},
		"city": {
			"type": "keyword",
"copy_to": "default"
		},
		"activeName": {
			"type": "text",
			"analyzer": "ik_max_word",
			"fields": {
				"keyword": {
					"type": "keyword",
					"ignore_above": 256
				}
			},
"copy_to": "default"
		},
		"inventoryLimit": {
			"type": "byte"
		},
		"endMode": {
			"type": "byte"
		},
		"supplierDisplay": {
			"type": "byte"
		},
		"isContract": {
			"type": "byte"
		},
		"isActive": {
			"type": "byte"
		},
		"companyId": {
			"type": "keyword"
		},
		"aimedNumber": {
			"type": "integer"
		},
		"itemId": {
			"type": "keyword"
		},
		"itemState": {
			"type": "byte"
		},
		"showPrice": {
			"type": "scaled_float",
			"scaling_factor": 100
		},
		"finalPrice": {
			"type": "scaled_float",
			"scaling_factor": 100
		},
		"sales": {
			"type": "integer"
		},
		"realSales": {
			"type": "integer"
		},
		"sevenSales": {
			"type": "integer"
		},
		"firstPayAmount": {
			"type": "integer"
		},
		"scanNum": {
			"type": "integer"
		},
		"concernNum": {
			"type": "integer"
		},
		"minAmount": {
			"type": "integer"
		},
		"minAimedNumber": {
			"type": "integer"
		},
		"closeDate": {
			"type": "date",
			"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
		},
		"passDate": {
			"type": "date",
			"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
		},
		"startDate": {
			"type": "date",
			"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
		},
		"endDate": {
			"type": "date",
			"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
		},
		"displayStartDate": {
			"type": "date",
			"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
		},
		"displayEndDate": {
			"type": "date",
			"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
		},
		"shelvesDate": {
			"type": "date",
			"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
		},
		"modifyDate": {
			"type": "date",
			"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
		},
		"createDate": {
			"type": "date",
			"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
		},
		"default": {
			"type": "text",
			"analyzer": "ik_max_word"
		}
	}
}

3.2 过滤

         当搜索目标既可以通过匹配来查到,也可以通过过滤来查到,建议优先使用过滤的方式,否则当存在多个匹配时,es搜索速度将变慢。

        同时ElasticSearch提供了filter cache,它会将部分符合条件的结果和该次过滤缓存起来,下次查询符合缓存过滤条件时,将直接使用已缓存的结果,在部分场景下能提高查询速度。

3.3 业务排序

        当搜索时存在业务要求排序时,需要注意在业务排序前,添加评分(_score)排序,否则搜索出来的结果会非常不如人意。

        比如在traded_item_e-test02中搜索泡泡袖、展示开始时间小于xx、展示结束时间大于xx或为空、状态为6的商品时,添加了payAmount降序、shelvesDate降序,搜索后发现有些标题含泡泡袖的商品反而在后面,不含泡泡袖的商品在前面,这不是用户想用的,这就牵扯到了搜索评分_score,当然还牵扯到一些其它优化。

post /traded_item_e-test02/_search
{
	"from": 0,
	"size": 10,
	"query": {
		"bool": {
			"must": [{
				"match": {
					"default": {
						"query": "泡泡袖",
						"operator": "OR",
						"prefix_length": 0,
						"max_expansions": 50,
						"fuzzy_transpositions": true,
						"lenient": false,
						"zero_terms_query": "NONE",
						"auto_generate_synonyms_phrase_query": true,
						"boost": 1.0
					}
				}
			}],
			"filter": [{
				"range": {
					"displayStartDate": {
						"from": null,
						"to": "1640274097447",
						"include_lower": true,
						"include_upper": true,
						"boost": 1.0
					}
				}
			}, {
				"bool": {
					"should": [{
						"range": {
							"displayEndDate": {
								"from": "1640274097447",
								"to": null,
								"include_lower": true,
								"include_upper": true,
								"boost": 1.0
							}
						}
					}, {
						"bool": {
							"must_not": [{
								"exists": {
									"field": "displayEndDate",
									"boost": 1.0
								}
							}],
							"adjust_pure_negative": true,
							"boost": 1.0
						}
					}],
					"adjust_pure_negative": true,
					"boost": 1.0
				}
			}, {
				"term": {
					"itemState": {
						"value": "6",
						"boost": 1.0
					}
				}
			}],
			"adjust_pure_negative": true,
			"boost": 1.0
		}
	},
	"_source": {
		"includes": [],
		"excludes": []
	},
	"sort": [{
		"payAmount": {
			"order": "desc"
		}
	}, {
		"shelvesDate": {
			"order": "desc"
		}
	}]
}

        优化:在sort中添加评分(_score)降序,如下。

post /traded_item_e-test02/_search
{
	"from": 0,
	"size": 10,
	"query": {
		"bool": {
			"must": [{
				"match": {
					"default": {
						"query": "泡泡袖",
						"operator": "OR",
						"prefix_length": 0,
						"max_expansions": 50,
						"fuzzy_transpositions": true,
						"lenient": false,
						"zero_terms_query": "NONE",
						"auto_generate_synonyms_phrase_query": true,
						"boost": 1.0
					}
				}
			}],
			"filter": [{
				"range": {
					"displayStartDate": {
						"from": null,
						"to": "1640274097447",
						"include_lower": true,
						"include_upper": true,
						"boost": 1.0
					}
				}
			}, {
				"bool": {
					"should": [{
						"range": {
							"displayEndDate": {
								"from": "1640274097447",
								"to": null,
								"include_lower": true,
								"include_upper": true,
								"boost": 1.0
							}
						}
					}, {
						"bool": {
							"must_not": [{
								"exists": {
									"field": "displayEndDate",
									"boost": 1.0
								}
							}],
							"adjust_pure_negative": true,
							"boost": 1.0
						}
					}],
					"adjust_pure_negative": true,
					"boost": 1.0
				}
			}, {
				"term": {
					"itemState": {
						"value": "6",
						"boost": 1.0
					}
				}
			}],
			"adjust_pure_negative": true,
			"boost": 1.0
		}
	},
	"_source": {
		"includes": [],
		"excludes": []
	},
	"sort": [{
		"_score": {
			"order": "desc"
		}
	},{
		"payAmount": {
			"order": "desc"
		}
	}, {
		"shelvesDate": {
			"order": "desc"
		}
	}]
}

3.4 避免查询深度翻页

        Elasticsearch 默认只允许查看排序前 10000 条的结果,当翻页查看排序靠后的记录时,响应耗时一般较长。使用 search_after 方式查询会更轻量级,如果每次只需要返回 10 条结果,则每个 shard 只需要返回 search_after 之后的 10 个结果即可,返回的总数据量只是和 shard 个数以及本次需要的个数有关,和历史已读取的个数无关。

# search_after查询语法示例
curl -XGET "http://localhost:9200/twitter/_search" -H 'Content-Type: application/json' 
-d'{
    "size": 10,
    "query": {
        "match": {
            "message": "Elasticsearch"
        }
    },
    "sort": [
        {
            "_score": {
                "order": "desc"
            }
        },
        {
            "_id": {
                "order": "asc"
            }
        }
    ],
    "search_after": [
        0.84290016,     //上一次response中某个doc的score
        "1024"          //上一次response中某个doc的id
    ]
}'

3.5 boost

    在进行全文搜索时,可对重点展示字段进行权重提升,如下面在对default进行搜索【娃娃领】时,对重点展示字段productName设置boost为2(默认为1)。

{
	"from": 0,
	"size": 10,
	"query": {
		"bool": {
			"must": [{
				"match": {
					"default": {
						"query": "娃娃领",
						"operator": "OR",
						"prefix_length": 0,
						"max_expansions": 50,
						"fuzzy_transpositions": true,
						"lenient": false,
						"zero_terms_query": "NONE",
						"auto_generate_synonyms_phrase_query": true,
						"boost": 1.0
					}
				}
			}, {
				"match": {
					"productName": {
						"query": "娃娃领",
						"operator": "OR",
						"prefix_length": 0,
						"max_expansions": 50,
						"fuzzy_transpositions": true,
						"lenient": false,
						"zero_terms_query": "NONE",
						"auto_generate_synonyms_phrase_query": true,
						"boost": 2.0
					}
				}
			}],
			"filter": [{
				"range": {
					"displayStartDate": {
						"from": null,
						"to": "1641289904822",
						"include_lower": true,
						"include_upper": true,
						"boost": 1.0
					}
				}
			}, {
				"bool": {
					"should": [{
						"range": {
							"displayEndDate": {
								"from": "1641289904822",
								"to": null,
								"include_lower": true,
								"include_upper": true,
								"boost": 1.0
							}
						}
					}, {
						"bool": {
							"must_not": [{
								"exists": {
									"field": "displayEndDate",
									"boost": 1.0
								}
							}],
							"adjust_pure_negative": true,
							"boost": 1.0
						}
					}],
					"adjust_pure_negative": true,
					"boost": 1.0
				}
			}, {
				"term": {
					"itemState": {
						"value": "6",
						"boost": 1.0
					}
				}
			}],
			"adjust_pure_negative": true,
			"boost": 1.0
		}
	},
	"_source": {
		"includes": ["platformType", "productId", "itemId", "productName", "firstPayAmount", "scanNum", "priceMode", "showPrice", "finalPrice", "itemState", "briefPath", "introducePath", "finePaths", "goodsClass", "goodsClassName"],
		"excludes": []
	},
	"sort": [{
		"_score": {
			"order": "desc"
		}
	}, {
		"payAmount": {
			"order": "desc"
		}
	}, {
		"shelvesDate": {
			"order": "desc"
		}
	}]
}

3.6 minimum_should_match

        要理解minimum_should_match的作用,需要有一点倒排索引的知识,es中搜索时是以匹配的term为主的,比如"【JL】2191#韩国chic撞色娃娃领喇叭袖菱格纹短款娃娃连衣裙LL 12.15",数据库存储时只要把这些字符完整保存就行了,但es为了实现倒排索引,在进行存储时,必须先进行分词,保存时不仅要保存完整字符串(doc),还得保存分词信息(term),以及term和doc的关联信息等等;es在进行搜索【娃娃 短款 连衣裙 韩国】时,会对搜索语句也进行分词,然后在索引库中找出匹配搜索分词的文档,这其中就涉及到了到底匹配多少个搜索分词的文档才是符合要求的结果(即minimum_should_match,7.5.2版本默认为1)。

       es文档字段到底会被分词为哪些信息,可通过_termvectors接口查询。如查询索引库traded_item_e-prod01中的100302016395366502040000(id)文档的productName字段(【JL】2191#韩国chic撞色娃娃领喇叭袖菱格纹短款娃娃连衣裙LL 12.15)有哪些term,向es执行get请求。

http://www.ymhcnet.com:9200/traded_item_e-prod01/_doc/100302016395366502040000/_termvectors?fields=productName

        结果如下:

{
    "_index": "traded_item_e-prod01",
    "_type": "_doc",
    "_id": "100302016395366502040000",
    "_version": 14,
    "found": true,
    "took": 1,
    "term_vectors": {
        "productName": {
            "field_statistics": {
                "sum_doc_freq": 449137,
                "doc_count": 25711,
                "sum_ttf": 454805
            },
            "terms": {
                "12.15": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 19,
                            "start_offset": 36,
                            "end_offset": 41
                        }
                    ]
                },
                "2191": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 2,
                            "start_offset": 4,
                            "end_offset": 8
                        }
                    ]
                },
                "2191#": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 1,
                            "start_offset": 4,
                            "end_offset": 9
                        }
                    ]
                },
                "chic": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 4,
                            "start_offset": 11,
                            "end_offset": 15
                        }
                    ]
                },
                "jl": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 0,
                            "start_offset": 1,
                            "end_offset": 3
                        }
                    ]
                },
                "ll": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 18,
                            "start_offset": 33,
                            "end_offset": 35
                        }
                    ]
                },
                "喇叭": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 9,
                            "start_offset": 20,
                            "end_offset": 22
                        }
                    ]
                },
                "娃娃": {
                    "term_freq": 2,
                    "tokens": [
                        {
                            "position": 7,
                            "start_offset": 17,
                            "end_offset": 19
                        },
                        {
                            "position": 15,
                            "start_offset": 28,
                            "end_offset": 30
                        }
                    ]
                },
                "撞": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 5,
                            "start_offset": 15,
                            "end_offset": 16
                        }
                    ]
                },
                "格": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 12,
                            "start_offset": 24,
                            "end_offset": 25
                        }
                    ]
                },
                "短款": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 14,
                            "start_offset": 26,
                            "end_offset": 28
                        }
                    ]
                },
                "纹": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 13,
                            "start_offset": 25,
                            "end_offset": 26
                        }
                    ]
                },
                "色": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 6,
                            "start_offset": 16,
                            "end_offset": 17
                        }
                    ]
                },
                "菱": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 11,
                            "start_offset": 23,
                            "end_offset": 24
                        }
                    ]
                },
                "衣裙": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 17,
                            "start_offset": 31,
                            "end_offset": 33
                        }
                    ]
                },
                "袖": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 10,
                            "start_offset": 22,
                            "end_offset": 23
                        }
                    ]
                },
                "连衣裙": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 16,
                            "start_offset": 30,
                            "end_offset": 33
                        }
                    ]
                },
                "韩国": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 3,
                            "start_offset": 9,
                            "end_offset": 11
                        }
                    ]
                },
                "领": {
                    "term_freq": 1,
                    "tokens": [
                        {
                            "position": 8,
                            "start_offset": 19,
                            "end_offset": 20
                        }
                    ]
                }
            }
        }
    }
}

     使用HiJson工具,可以看到,"【JL】2191#韩国chic撞色娃娃领喇叭袖菱格纹短款娃娃连衣裙LL 12.15"被分为了14个term。 

Elasticsearch优化_第1张图片

        这时,使用term中的【娃娃 短款 连衣裙 韩国】作为productName搜索参数,并设置评分升序排列(用于观察最低匹配多少个term),minimum_should_match不设置使用默认值,进行搜索。

{
	"from": 0,
	"size": 1,
	"query": {
		"bool": {
			"must": [{
				"match": {
					"productName": {
						"query": "娃娃 短款 连衣裙 韩国",
						"operator": "OR",
						"prefix_length": 0,
						"max_expansions": 50,
						"fuzzy_transpositions": true,
						"lenient": false,
						"zero_terms_query": "NONE",
						"auto_generate_synonyms_phrase_query": true,
						"boost": 1.0
					}
				}
			}],
			"filter": [{
				"range": {
					"displayStartDate": {
						"from": null,
						"to": "1641289904822",
						"include_lower": true,
						"include_upper": true,
						"boost": 1.0
					}
				}
			}, {
				"bool": {
					"should": [{
						"range": {
							"displayEndDate": {
								"from": "1641289904822",
								"to": null,
								"include_lower": true,
								"include_upper": true,
								"boost": 1.0
							}
						}
					}, {
						"bool": {
							"must_not": [{
								"exists": {
									"field": "displayEndDate",
									"boost": 1.0
								}
							}],
							"adjust_pure_negative": true,
							"boost": 1.0
						}
					}],
					"adjust_pure_negative": true,
					"boost": 1.0
				}
			}, {
				"term": {
					"itemState": {
						"value": "6",
						"boost": 1.0
					}
				}
			}],
			"adjust_pure_negative": true,
			"boost": 1.0
		}
	},
	"_source": {
		"includes": ["platformType", "productId", "itemId", "productName", "firstPayAmount", "scanNum", "priceMode", "showPrice", "finalPrice", "itemState", "briefPath", "introducePath", "finePaths", "goodsClass", "goodsClassName"],
		"excludes": []
	},
	"sort": [{
		"_score": {
			"order": "asc"
		}
	}, {
		"payAmount": {
			"order": "desc"
		}
	}, {
		"shelvesDate": {
			"order": "desc"
		}
	}]
}

        搜索结果如下,可以看到匹配到的最低评分productName为"【XX】4196#加绒加厚双层帽连帽卫衣裙女长款宽松慵懒裙潮咸甜zz12.25",其中看似没有匹配上一个搜索关键字,但对该文档productName进行分词后,即可发现存在【衣裙】【连】等term,其实是匹配上了【连衣裙】关键字(站在用户角度看,这条结果是不想看到的,这时就可以通过调大minimum_should_match值来实现)。

{
    "took": 5,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 3731,
            "relation": "eq"
        },
        "max_score": null,
        "hits": [
            {
                "_index": "traded_item_e-prod01",
                "_type": "_doc",
                "_id": "100125016404174365510000",
                "_score": 1.6476705,
                "_source": {
                    "introducePath": "ymhc-studio/020303015619455933670000/2/16404174235702zaH.jpg,ymhc-studio/020303015619455933670000/2/1640417423570xs58.jpg,ymhc-studio/020303015619455933670000/2/1640417423570JESh.jpg,ymhc-studio/020303015619455933670000/2/1640417423570Nta7.jpg,ymhc-studio/020303015619455933670000/2/1640417423570H623.jpg,ymhc-studio/020303015619455933670000/2/1640417423570wixX.jpg,ymhc-studio/020303015619455933670000/2/16404174235703GFz.jpg,ymhc-studio/020303015619455933670000/2/1640417423570hi2S.jpg,ymhc-studio/020303015619455933670000/2/1640417423571WSAn.jpg,ymhc-studio/020303015619455933670000/2/1640417423571Myra.jpg,ymhc-studio/020303015619455933670000/2/1640417423571BweH.jpg,ymhc-studio/020303015619455933670000/2/1640417423571sXC3.jpg,ymhc-studio/020303015619455933670000/2/1640417423571zQzf.jpg,ymhc-studio/020303015619455933670000/2/1640417423571FaNQ.jpg,ymhc-studio/020303015619455933670000/2/1640417423571rK2F.jpg,ymhc-studio/020303015619455933670000/2/16404174235733WWP.jpg,ymhc-studio/020303015619455933670000/2/1640417423573XSGi.jpg,ymhc-studio/020303015619455933670000/2/1640417423573HpDR.jpg,ymhc-studio/020303015619455933670000/2/1640417423573wfwd.jpg,ymhc-studio/020303015619455933670000/2/1640417423573yJCA.jpg,ymhc-studio/020303015619455933670000/2/1640417423573wk2C.jpg,ymhc-studio/020303015619455933670000/2/1640417423574pdnJ.jpg",
                    "productId": "090125016404174329640000",
                    "priceMode": 3,
                    "platformType": 1,
                    "goodsClass": [
                        "110000",
                        "1",
                        "112"
                    ],
                    "itemState": 6,
                    "finalPrice": 39.0,
                    "goodsClassName": "女装,上装,长款卫衣",
                    "productName": "【XX】4196#加绒加厚双层帽连帽卫衣裙女长款宽松慵懒裙潮咸甜zz12.25",
                    "finePaths": "ymhc-studio/020303015619455933670000/2/16404174180716b2J.jpg,ymhc-studio/020303015619455933670000/2/1640417418071z66s.jpg,ymhc-studio/020303015619455933670000/2/1640417418071pcwF.jpg,ymhc-studio/020303015619455933670000/2/1640417418072fnZr.jpg,ymhc-studio/020303015619455933670000/2/1640417418072EaEs.jpg",
                    "scanNum": 0,
                    "itemId": "100125016404174365510000",
                    "firstPayAmount": 46,
                    "showPrice": 39.0,
                    "briefPath": "ymhc-studio/020303015619455933670000/2/16404174180716b2J.jpg"
                },
                "sort": [
                    1.6476705,
                    0,
                    1640446506000
                ]
            }
        ]
    }
}

Elasticsearch优化_第2张图片

        在es的商品索引库上搜索【娃娃 短款 连衣裙 韩国】,分别对minimum_should_match进行不同设置进行测试,结果如下,发现minimum_should_match为1时,匹配命中的文档数最多,耗时较大;为5时,匹配命中文档数最少,耗时较小实际应用时需根据业务数据和用户需求来设置,比较好的一种设置方式为"minimum_should_match":"2<50%",意思为匹配分词term2个及以下时,需完全匹配,2个以上时需达到50%,注意由于分词影响,term数通常比用户关键字多很多,即使是一个【连衣裙】,经过IK ik_max_word分词后,term数会变为2个(连衣裙、衣裙)。

minimum_should_match 命中文档数 花费时间
1 3734 10ms
2 2471 12ms
3 166 6ms
4 7 5ms
5 1 4ms

4.部署优化

4.1调大文件句柄

        Lucene使用了大量的文件,同时,Elasticsearch 在节点和 HTTP 客户端之间进行通信也使用了大量的套接字,所有这一切都需要足够的文件描述符(linux叫文件句柄)。

        查看文件句柄数量

cat /proc/sys/fs/file-nr

        调整文件句柄数量,需要修改/etc/security/limits.conf 文件;可参考Linux:如何获取打开文件和文件描述符数量_赵民勇的博客-CSDN博客_文件描述符的数量

你可能感兴趣的:(搜索,elasticsearch)