ElasticSearch 全文搜索

ElasticSearch 全文搜索

对文档执行全文检索,包括单个或多个单词或词组查询,返回匹配条件的搜索结果。
ElasticSearch 是基于Apache Lucene的搜索引擎,一个开源、免费信息检索软件库。基于HTTP协议web接口和无模式json文档方式提供分布式、全文检索引擎。

本文探究下 ElasticSearch REST API 并演示基于HTTP 请求的基本查询操作。

安装环境

安装ElasticSearch,请参考官方安装指南。

RESTfull API运行端口为9200,让我们使用下面curl命令测试程序是否运行正确:

curl -XGET 'http://localhost:9200/'

如果你观察到下面响应,表明实例已经启动成功:

name	"USER-20170915LA"
cluster_name	"elasticsearch"
cluster_uuid	"7k8ij9iPT1uAkVqJxanAYQ"
version	
number	"7.1.1"
build_flavor	"oss"
build_type	"zip"
build_hash	"7a013de"
build_date	"2019-05-23T14:04:00.380842Z"
build_snapshot	false
lucene_version	"8.0.0"
minimum_wire_compatibility_version	"6.8.0"
minimum_index_compatibility_version	"6.0.0-beta1"
tagline	"You Know, for Search"

安装Elasticsearch Head
为了执行命令,我们可以安装Elasticsearch Head,也有相应chrome的插件,从应用商店中搜索安装。先启动elasticSearch,点击Head插件:

索引文档

ElasticSearch是面向文档NoSql应用,主要用于存储和索引文档。索引创建或更新文档,有了索引,即可搜索、排序、过滤完整的文档————不仅是行列类型的数据。这是一种完全不同的数据思考方式,也是ElasticSearch能够执行复杂全文搜索的原因之一。

文档以JSON对象方式表示,因为其简单、简洁其易读。JSON序列化被大多数编程语言支持,集合成为NoSql软件的标准格式.

下面我们使用一些随机文本执行全文检索:

{
  "title": "He went",
  "random_text": "He went such dare good fact. The small own seven saved man age."
}
 
{
  "title": "He oppose",
  "random_text": 
    "He oppose at thrown desire of no. \
      Announcing impression unaffected day his are unreserved indulgence."
}
 
{
  "title": "Repulsive questions",
  "random_text": "Repulsive questions contented him few extensive supported."
}
 
{
  "title": "Old education",
  "random_text": "Old education him departure any arranging one prevailed."
}

索引文档之前,我们需要决定其存储在哪。ElasticSearch可以包括多个索引,每个索引可以多文档。我们打算使用下列schema:

text:索引名称
article:类型名称
id:文本实体的唯一ID

可以使用下面命令增加索引:

put http://localhost:9200/text/artice/1/
{
  "title": "He went",
  "random_text": "He went such dare good fact. The small own seven saved man age."
}

这里使用id=1,其他文本实体也可以使用相同的命令进行增加,只是id是递增的。

当然我们也可以在插入索引之前,手动建立Mapping,明确哪些字段需要全文检索,哪些字段仅需要严格匹配。示例代码如下:

PUT /text
{ 
  "mappings": { 
    "artice":{
      "properties": { 
        "title": { "type" : "text" },
        "desc": { "type" : "text" }
      }
    }
  } 
}

检索文档

前面增加了4个文档,我们现在可以检索有多少个文档,使用下面命令:

GET http://localhost:9200/text/_count/
{
  "query": {
    "match_all": {}
  }
}

返回结果:

{
"count": 4,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
}
}

与我们插入的文档数量相符,下面查询特定文档:

http://localhost:9200/text/artice/1/

查询结果如下:

{
    "_index": "text",
    "_type": "artice",
    "_id": "1",
    "_version": 1,
    "_seq_no": 0,
    "_primary_term": 1,
    "found": true,
    "_source": {
    "title": "He went",
    "random_text": "He went such dare good fact. The small own seven saved man age."
    }
}

返回结果是我们之前增加ID为1的文档。

查询文档

现在测试全文检索,使用下面命令:

GET 'localhost:9200/text/article/_search
{
  "query": {
    "match": {
      "random_text": "him departure"
    }
  }
}

返回结果如下:

{
  "took": 32,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1.4513469,
    "hits": [
      {
        "_index": "text",
        "_type": "article",
        "_id": "4",
        "_score": 1.4513469,
        "_source": {
          "title": "Old education",
          "random_text": "Old education him departure any arranging one prevailed."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "3",
        "_score": 0.28582606,
        "_source": {
          "title": "Repulsive questions",
          "random_text": "Repulsive questions contented him few extensive supported."
        }
      }
    ]
  }
}

我们查询 “him departure”,获得两个不同score的查询结果。第一条结果很明显,因为完全包括查询文本,其得分为 1.4513469.

第二条结果是因为目标文档包括单次“him”,得分为0.28582606。

缺省情况下ElasticSearch 根据相关性得分对查询结果进行排序,即每个文档匹配程度。注意,第二条结果得分比第一条低,表示相关性低。

模糊(Fuzzy)查询

模糊查询处理两个“模糊”相似的单词,就好像它们是同一个单词一样。首先,我们需要定义什么是模糊。
Elasticsearch支持最大编辑距离,使用模糊度参数指定为2。模糊度参数可设置为AUTO,编辑距离可以为:

  • 0 表示一个或两个字符的字符串
  • 1 表示三个、四个或五个字符的字符串
  • 2 表示多于五个字符的字符串

如果使用编辑距离为2,返回结果似乎不相关。为了使返回结果更好、性能更好,使用编辑距离为1。距离指的是Levenshtein距离,这是一个字符串度量,用于测量两个序列之间的差异。下面执行模糊搜索:

GET localhost:9200/text/article/_search
{ 
  "query": 
  { 
    "match": 
    { 
      "random_text": 
      {
        "query": "him departure",
        "fuzziness": "2"
      }
    } 
  } 
}

返回结果:

{
  "took": 88,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1.5834423,
    "hits": [
      {
        "_index": "text",
        "_type": "article",
        "_id": "4",
        "_score": 1.4513469,
        "_source": {
          "title": "Old education",
          "random_text": "Old education him departure any arranging one prevailed."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "2",
        "_score": 0.41093433,
        "_source": {
          "title": "He oppose",
          "random_text":
            "He oppose at thrown desire of no. 
              \ Announcing impression unaffected day his are unreserved indulgence."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "Repulsive questions",
          "random_text": "Repulsive questions contented him few extensive supported."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "1",
        "_score": 0.0,
        "_source": {
          "title": "He went",
          "random_text": "He went such dare good fact. The small own seven saved man age."
        }
      }
    ]
  }
}

我们看到模糊查询返回结果更多。使用模糊查询需要小心,因为可能返回根本不相干的结果。

总结

本文我们主要解释了索引文档,使用ElasticSearch Rest Api执行全文检索查询文档。

你可能感兴趣的:(Elasticsearch)