对文档执行全文检索,包括单个或多个单词或词组查询,返回匹配条件的搜索结果。
ElasticSearch 是基于Apache Lucene的搜索引擎,一个开源、免费信息检索软件库。基于HTTP协议web接口和无模式json文档方式提供分布式、全文检索引擎。
本文探究下 ElasticSearch REST API 并演示基于HTTP 请求的基本查询操作。
安装ElasticSearch,请参考官方安装指南。
RESTfull API运行端口为9200,让我们使用下面curl命令测试程序是否运行正确:
curl -XGET 'http://localhost:9200/'
如果你观察到下面响应,表明实例已经启动成功:
name "USER-20170915LA"
cluster_name "elasticsearch"
cluster_uuid "7k8ij9iPT1uAkVqJxanAYQ"
version
number "7.1.1"
build_flavor "oss"
build_type "zip"
build_hash "7a013de"
build_date "2019-05-23T14:04:00.380842Z"
build_snapshot false
lucene_version "8.0.0"
minimum_wire_compatibility_version "6.8.0"
minimum_index_compatibility_version "6.0.0-beta1"
tagline "You Know, for Search"
安装Elasticsearch Head
为了执行命令,我们可以安装Elasticsearch Head,也有相应chrome的插件,从应用商店中搜索安装。先启动elasticSearch,点击Head插件:
ElasticSearch是面向文档NoSql应用,主要用于存储和索引文档。索引创建或更新文档,有了索引,即可搜索、排序、过滤完整的文档————不仅是行列类型的数据。这是一种完全不同的数据思考方式,也是ElasticSearch能够执行复杂全文搜索的原因之一。
文档以JSON对象方式表示,因为其简单、简洁其易读。JSON序列化被大多数编程语言支持,集合成为NoSql软件的标准格式.
下面我们使用一些随机文本执行全文检索:
{
"title": "He went",
"random_text": "He went such dare good fact. The small own seven saved man age."
}
{
"title": "He oppose",
"random_text":
"He oppose at thrown desire of no. \
Announcing impression unaffected day his are unreserved indulgence."
}
{
"title": "Repulsive questions",
"random_text": "Repulsive questions contented him few extensive supported."
}
{
"title": "Old education",
"random_text": "Old education him departure any arranging one prevailed."
}
索引文档之前,我们需要决定其存储在哪。ElasticSearch可以包括多个索引,每个索引可以多文档。我们打算使用下列schema:
text:索引名称
article:类型名称
id:文本实体的唯一ID
可以使用下面命令增加索引:
put http://localhost:9200/text/artice/1/
{
"title": "He went",
"random_text": "He went such dare good fact. The small own seven saved man age."
}
这里使用id=1,其他文本实体也可以使用相同的命令进行增加,只是id是递增的。
当然我们也可以在插入索引之前,手动建立Mapping,明确哪些字段需要全文检索,哪些字段仅需要严格匹配。示例代码如下:
PUT /text
{
"mappings": {
"artice":{
"properties": {
"title": { "type" : "text" },
"desc": { "type" : "text" }
}
}
}
}
前面增加了4个文档,我们现在可以检索有多少个文档,使用下面命令:
GET http://localhost:9200/text/_count/
{
"query": {
"match_all": {}
}
}
返回结果:
{
"count": 4,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
}
}
与我们插入的文档数量相符,下面查询特定文档:
http://localhost:9200/text/artice/1/
查询结果如下:
{
"_index": "text",
"_type": "artice",
"_id": "1",
"_version": 1,
"_seq_no": 0,
"_primary_term": 1,
"found": true,
"_source": {
"title": "He went",
"random_text": "He went such dare good fact. The small own seven saved man age."
}
}
返回结果是我们之前增加ID为1的文档。
现在测试全文检索,使用下面命令:
GET 'localhost:9200/text/article/_search
{
"query": {
"match": {
"random_text": "him departure"
}
}
}
返回结果如下:
{
"took": 32,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1.4513469,
"hits": [
{
"_index": "text",
"_type": "article",
"_id": "4",
"_score": 1.4513469,
"_source": {
"title": "Old education",
"random_text": "Old education him departure any arranging one prevailed."
}
},
{
"_index": "text",
"_type": "article",
"_id": "3",
"_score": 0.28582606,
"_source": {
"title": "Repulsive questions",
"random_text": "Repulsive questions contented him few extensive supported."
}
}
]
}
}
我们查询 “him departure”,获得两个不同score的查询结果。第一条结果很明显,因为完全包括查询文本,其得分为 1.4513469.
第二条结果是因为目标文档包括单次“him”,得分为0.28582606。
缺省情况下ElasticSearch 根据相关性得分对查询结果进行排序,即每个文档匹配程度。注意,第二条结果得分比第一条低,表示相关性低。
模糊查询处理两个“模糊”相似的单词,就好像它们是同一个单词一样。首先,我们需要定义什么是模糊。
Elasticsearch支持最大编辑距离,使用模糊度参数指定为2。模糊度参数可设置为AUTO,编辑距离可以为:
如果使用编辑距离为2,返回结果似乎不相关。为了使返回结果更好、性能更好,使用编辑距离为1。距离指的是Levenshtein距离,这是一个字符串度量,用于测量两个序列之间的差异。下面执行模糊搜索:
GET localhost:9200/text/article/_search
{
"query":
{
"match":
{
"random_text":
{
"query": "him departure",
"fuzziness": "2"
}
}
}
}
返回结果:
{
"took": 88,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 4,
"max_score": 1.5834423,
"hits": [
{
"_index": "text",
"_type": "article",
"_id": "4",
"_score": 1.4513469,
"_source": {
"title": "Old education",
"random_text": "Old education him departure any arranging one prevailed."
}
},
{
"_index": "text",
"_type": "article",
"_id": "2",
"_score": 0.41093433,
"_source": {
"title": "He oppose",
"random_text":
"He oppose at thrown desire of no.
\ Announcing impression unaffected day his are unreserved indulgence."
}
},
{
"_index": "text",
"_type": "article",
"_id": "3",
"_score": 0.2876821,
"_source": {
"title": "Repulsive questions",
"random_text": "Repulsive questions contented him few extensive supported."
}
},
{
"_index": "text",
"_type": "article",
"_id": "1",
"_score": 0.0,
"_source": {
"title": "He went",
"random_text": "He went such dare good fact. The small own seven saved man age."
}
}
]
}
}
我们看到模糊查询返回结果更多。使用模糊查询需要小心,因为可能返回根本不相干的结果。
本文我们主要解释了索引文档,使用ElasticSearch Rest Api执行全文检索查询文档。