term vector用于获取document中某个field内各个不可分割的term(词条)的相关统计信息,它们包括以下内容:
Elasticsearch官方指出,term statistic和field statistic 并不准确,在统计时不会考虑某些document已经被删除的情况。这是因为Elasticsearch在收到删除请求后,只是简单的在数据上更新被删除的标记,并不会立刻删除数据。
通常来说,term vector很少使用,一般只会在对某些数据进行数据探查时使用。比如美团上可以查看到顾客搜索热度最高的词语,用于搜索推荐和词条推荐。
term vector涉及到了许多关于term和field的统计信息,Elasticsearch提供了两种方式来收集这些统计信息。
举例,仔细观察my_text和fullname数据结构的区别:
PUT index_name
{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"analysis": {
"analyzer": {
"fulltext_analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"type_as_payload"
]
}
}
}
},
"mappings": {
"properties": {
"my_text": { # index time
"type": "text",
"term_vector": "with_positions_offsets_payloads", # term_vector有no、yes、with_offset、with_positions等可选值
"store": true,
"analyzer": "fulltext_analyzer"
},
"fullname": { # query time
"type": "text",
"analyzer": "fulltext_analyzer"
}
}
}
}
测试数据:
POST /index_name/_doc/1
{
"fullname" : "Kerwin Kim",
"my_text" : "hello test test test "
}
PUT /index_name/_doc/2
{
"fullname" : "Kerwin Kim",
"my_text" : "other hello test ..."
}
使用termvectors api探查某一个document中的term vector统计信息。
GET /index_name/_doc/1/_termvectors
{
"fields" : ["my_text"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
得到结果
{
"_index" : "index_name",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"found" : true,
"took" : 2,
"term_vectors" : {
"my_text" : {
"field_statistics" : {
"sum_doc_freq" : 6,
"doc_count" : 2,
"sum_ttf" : 8
},
"terms" : {
"hello" : {
"doc_freq" : 2,
"ttf" : 2,
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 5,
"payload" : "d29yZA=="
}
]
},
"test" : {
"doc_freq" : 2,
"ttf" : 4,
"term_freq" : 3,
"tokens" : [
{
"position" : 1,
"start_offset" : 6,
"end_offset" : 10,
"payload" : "d29yZA=="
},
{
"position" : 2,
"start_offset" : 11,
"end_offset" : 15,
"payload" : "d29yZA=="
},
{
"position" : 3,
"start_offset" : 16,
"end_offset" : 20,
"payload" : "d29yZA=="
}
]
}
}
}
}
}
真实项目中,仅仅只是统计某一个document的term vector显然过于片面了,一般我们会针对某几个term在整个index中统计term vector。
在"doc"中写明需要探查的 term。
GET /index_name/_termvectors
{
"doc": {
"fullname": "Kerwin Kim",
"my_text": "hello test"
},
"fields" : ["my_text", "fullname"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
得到结果:
{
"_index" : "index_name",
"_type" : "_doc",
"_version" : 0,
"found" : true,
"took" : 8,
"term_vectors" : {
"fullname" : {
"field_statistics" : {
"sum_doc_freq" : 4,
"doc_count" : 2,
"sum_ttf" : 4
},
"terms" : {
"kerwin" : {
"doc_freq" : 2,
"ttf" : 2,
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 6
}
]
},
"kim" : {
"doc_freq" : 2,
"ttf" : 2,
"term_freq" : 1,
"tokens" : [
{
"position" : 1,
"start_offset" : 7,
"end_offset" : 10
}
]
}
}
},
"my_text" : {
"field_statistics" : {
"sum_doc_freq" : 6,
"doc_count" : 2,
"sum_ttf" : 8
},
"terms" : {
"hello" : {
"doc_freq" : 2,
"ttf" : 2,
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 5
}
]
},
"test" : {
"doc_freq" : 2,
"ttf" : 4,
"term_freq" : 1,
"tokens" : [
{
"position" : 1,
"start_offset" : 6,
"end_offset" : 10
}
]
}
}
}
}
}
如果doc中需要探查的term不想使用创建index时指定的分词器,则我们可以使用per_field_analyzer来分别指定doc中每一个field使用的分词器。
比如下述语句中,针对my_text字段指定了"english"分词器,而非创建index时指定的"fulltext_analyzer"分词器。(english会忽略时态,testing->test)
GET /index_name/_termvectors
{
"doc": {
"fullname": "Kerwin Kim",
"my_text": "hello testing"
},
"fields" : ["my_text", "fullname"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true,
"per_field_analyzer": {
"my_text": "english",
"fullname": "standard"
}
}
对index进行数据探查后,得到的结果中并非都是我们想要的数据,Elasticsearch可以帮助我们过滤掉这部分数据。
过滤时,使用了以下api:
GET /index_name/_termvectors
{
"doc": {
"fullname": "Kerwin Kim",
"my_text": "hello test"
},
"fields" : ["my_text", "fullname"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true,
"per_field_analyzer": {
"my_text": "english"
},
"filter": {
"max_doc_freq": 1,
"min_term_freq": 2,
"max_num_terms": 3
}
}
一次性对多个document进行数据探查,可以看作是对3.1节的补充。
GET _mtermvectors
{
"docs": [
{
"_index": "index_name",
"_id": 1,
"term_statistics": true,
"offsets": false
},
{
"_index": "index_name",
"_id": 2,
"fields": [
"my_text"
],
"offsets": true
}
]
}
得到结果:
{
"docs" : [
{
"_index" : "index_name",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"found" : true,
"took" : 0,
"term_vectors" : {
"my_text" : {
"field_statistics" : {
"sum_doc_freq" : 6,
"doc_count" : 2,
"sum_ttf" : 8
},
"terms" : {
"hello" : {
"doc_freq" : 2,
"ttf" : 2,
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"payload" : "d29yZA=="
}
]
},
"test" : {
"doc_freq" : 2,
"ttf" : 4,
"term_freq" : 3,
"tokens" : [
{
"position" : 1,
"payload" : "d29yZA=="
},
{
"position" : 2,
"payload" : "d29yZA=="
},
{
"position" : 3,
"payload" : "d29yZA=="
}
]
}
}
}
}
},
{
"_index" : "index_name",
"_type" : "_doc",
"_id" : "2",
"_version" : 1,
"found" : true,
"took" : 0,
"term_vectors" : {
"my_text" : {
"field_statistics" : {
"sum_doc_freq" : 6,
"doc_count" : 2,
"sum_ttf" : 8
},
"terms" : {
"..." : {
"term_freq" : 1,
"tokens" : [
{
"position" : 3,
"start_offset" : 17,
"end_offset" : 20,
"payload" : "d29yZA=="
}
]
},
"hello" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 1,
"start_offset" : 6,
"end_offset" : 11,
"payload" : "d29yZA=="
}
]
},
"other" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 0,
"start_offset" : 0,
"end_offset" : 5,
"payload" : "d29yZA=="
}
]
},
"test" : {
"term_freq" : 1,
"tokens" : [
{
"position" : 2,
"start_offset" : 12,
"end_offset" : 16,
"payload" : "d29yZA=="
}
]
}
}
}
}
}
]
}