In order for aggregations (or any operation that requires access to field values) to be fast, access to fielddata must be fast, which is why it is loaded into memory. But loading too much data into memory will cause slow garbage collections as the JVM tries to find extra space in the heap, or possibly even an OutOfMemory exception.
It may surprise you to find that Elasticsearch does not load into fielddata just the values for the documents that match your query. It loads the values for all documents in your index, even documents with a different _type !
The logic is: if you need access to documents X, Y, and Z for this query, you will probably need access to other documents in the next query. It is cheaper to load all values once, and to keep them in memory, than to have to scan the inverted index on every request.
The indices.fielddata.cache.size controls how much heap space is allocated to fielddata. When you run a query that requires access to new field values, it will load the values into memory and then try to add them to fielddata. If the resulting fielddata size would exceed the specified size, other values would be evicted in order to make space.
By default, this setting is unbounded—Elasticsearch will never evict data from fielddata 坑爹啊,默认只进不出,我擦.
This default was chosen deliberately: fielddata is not a transient cache (可以理解,不是瞬间的缓存). It is an in-memory data structure that must be accessible for fast execution, and it is expensive to build. If you have to reload data for every request, performance is going to be awful.
A bounded size forces the data structure to evict data. We will look at when to set this value, but first a warning:
应该设置一个值,让fielddata自动evict,但是有个如下的警告
This setting is a safeguard, not a solution for insufficient memory.
If you don’t have enough memory to keep your fielddata resident in memory, Elasticsearch will constantly have to reload data from disk, and evict other data to make space. Evictions cause heavy disk I/O and generate a large amount of garbage in memory, which must be garbage collected later on. 内存大小,如果使用evict,将会不断的进行移除数据,再load数据,io压力大,内存中的垃圾过多。
Imagine that you are indexing logs, using a new index every day. Normally you are interested in data from only the last day or two. Although you keep older indices around, you seldom need to query them. However, with the default settings, the fielddata from the old indices is never evicted! fielddata will just keep on growing until you trip the fielddata circuit breaker (see Circuit Breaker), which will prevent you from loading any more fielddata.
看来如果不设置这个size,那么就一直往fielddata中放数据,当达到breaker设置的阈值的时候,异常产生了,并且这种默认的情况下,es是不进行fielddata evict的。
如果我们设置了size,那么fielddata达到了size的时候,就会自动evict,那么可以说,这个size应该小于breaker设置的阈值了。
At that point, you’re stuck. While you can still run queries that access fielddata from the old indices, you can’t load any new values. Instead, we should evict old values to make space for the new values.
To prevent this scenario, place an upper limit on the fielddata by adding this setting to theconfig/elasticsearch.yml file:
indices.fielddata.cache.size: 40%
|
Can be set to a percentage of the heap size, or a concrete value like 5gb |
In Fielddata Size, we spoke about adding a limit to the size of fielddata, to ensure that old unused fielddata can be evicted. The relationship between indices.fielddata.cache.size and indices.breaker.fielddata.limit is an important one. If the circuit-breaker limit is lower than the cache size, no data will ever be evicted. In order for it to work properly, the circuit breaker limit must be higher than the cache size.
It is important to keep a close watch on how much memory is being used by fielddata, and whether any data is being evicted. High eviction counts can indicate a serious resource issue and a reason for poor performance.
Fielddata usage can be monitored:
per-index using the indices-stats API:
GET /_stats/fielddata?fields=*
per-node using the nodes-stats API:
GET /_nodes/stats/indices/fielddata?fields=*
GET /_nodes/stats/indices/fielddata?level=indices&fields=*
By setting ?fields=*, the memory usage is broken down for each field.
就算设置了size,但是需要load的进入的数据,还是超过了breaker的设置,咋办?