ES版本:7.3.1
词频:所查找的单词在文档中出现次数越多,得分越高
逆文档词频:如果某个单词在所有文档中比较少见,那么该词的权重越高,得分也会越高
和关系型数据库对应理解:
ES | 索引 | 类型 | 文档 |
RMDB | 数据库 | 表 | 行 |
逻辑设计——搜索应用所要注意的。用于索引和搜索的基本单位是文档,可以将其认为是关系数据库里的一行。文档以类型来分组,类型包含文档,类似表格包含若干行,最终一个或多个类型存于同一个索引中,索引是更大的容器,类似于SQL世界中的数据库。
物理设计——在后台Elasticsearch是如何处理数据的。Elasticsearch将每个索引划分为分片,每份分片可以在集群中的不同服务器间迁移。集群管理的时候需要留心,物理设计的配置方式决定来集群的性能、可扩展性和可用性。
默认情况下每个索引由5个分片组成,而每份分片又有一个副本。
分片:分片是es处理的最小单元。一份分片是Lucene的索引:一个包含倒排索引的文件目录。
副本分片可以在运行的时候进行添加和移动,而主分片不可以。
索引由一个或者多个多个主分片以及零个或多个副本分片构成。
倒排索引:倒排索引的结构使得es在不扫描所有文档的情况下,就能告诉你哪些文档包含特定的词条(单词)。
例子:get-togetoer索引的首个主分片可能包含何种信息:该分片称为get-together0,他是一个Lucene索引、一个倒排索引。他默认存储原始文档内容,再加上一些额外的信息,如词条字典和词频,这些都能帮助到搜索。
词条字典:词条字典将每个词条和包含该词条的文档映射起来。搜索的时候es没必要为了某个词条而扫描所有文档,而是根据这个字典快速的识别匹配的文档。
索引一篇文档时发生了什么:默认情况下,系统首先根据文档ID散列值选择一个主分片,并将该文档发送到主分片。然后文档被发送到该主分片的所有副本分片进行索引。这使得副本分片和主分片之间保持数据同步,使得副本分片可以服务于搜索请求,并在原有主分片无法使用的情况下升级为主分片,同时搜索时可以在主分片和副本分片之间进行请求负载。
集群:一个节点就是Elasticsearch的实例。在服务器上启动es之后你就拥了一个节点,如果在另外一台上启动一个es实例,你就拥有另一个节点。甚至可以启动多个es进程在同一个服务器上拥有多个节点。多个节点可以加入同一个集群。集群对性能和稳定性都有好处,但他也有个缺点:必须确定节点之间足够快速通信,并且不会产生大脑分裂(集群的2个部分不能彼此交流,都认为对方宕机了)。
水平扩展:添加更多的节点到同一个集群中,现有的分片在所有的节点中进行负载均衡。
垂直扩展:为es节点增加更多的硬件资源。
下载地址:https://www.elastic.co/cn/downloads/elasticsearch
将压缩包解压
运行 bin/elasticsearch
(或者 bin\elasticsearch.bat
在Windows系统上)
访问地址:http://localhost:9200/
1、创建文档
curl -X PUT "localhost:9200/customer/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
"name": "John Doe"
}
'
说明:
参数:pretty=true或者仅仅是pretty,无论请求是否通过curl处理,我们使用后者。默认返回的json在一行里显示,而这个pretty参数使得返回json有更好的可读性。
es默认自动创建get-together索引,并且为group类型创建一个新的映射。
-d为可选参数,表示后面带参数。
-H 表示携带的请求头部。
返回如下:
{
"_index" : "customer",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "John Doe"
}
}
2、搜索文档
curl -X GET "localhost:9200/customer/_doc/1?pretty"
返回:
{
"_index" : "customer",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"_seq_no" : 0,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "John Doe"
}
}
3、用_bulk批量操作文档
curl -H "Content-Type: application/json" -XPOST "localhost:9200/bank/_bulk?pretty&refresh" --data-binary "@accounts.json"
说明:数量在1000到5000之间同时内容大小在5M到15M之间比较佳
测试数据地址:https://raw.githubusercontent.com/elastic/elasticsearch/master/docs/src/test/resources/accounts.json
返回:
{
"took" : 533,
"errors" : false,
"items" : [
{
"index" : {
"_index" : "bank",
"_type" : "_doc",
"_id" : "1",
"_version" : 1,
"result" : "created",
"forced_refresh" : true,
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1,
"status" : 201
}
},
...//省略
}
查看:
curl "localhost:9200/_cat/indices?v"
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open commodity 9srG6uQ4SCaM8A6U7q4epQ 5 1 0 0 1.3kb 1.3kb
yellow open bank M9VIEPM2TaSkPrRrhS-R9Q 1 1 1000 0 414.3kb 414.3kb
yellow open book-index g7zhCgucSoeJknd08ygxww 3 2 4 0 7.5kb 7.5kb
yellow open get-together Xb9iBol9R1WHJGQfs0GIeA 1 1 1 0 4.3kb 4.3kb
yellow open book xJ3-ThlRSHChfrRIxOOOyw 5 1 0 0 1.3kb 1.3kb
yellow open customer StLOulvsSlSM4aU_o6LL1Q 1 1 1 0 3.5kb 3.5kb
4、使用_search进行搜索
curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" }
]
}
'
说明:默认hits返回最开始到10条
返回:
{
"took" : 14,//查询花费到时间,单位ms
"timed_out" : false,//是否查询请求超时
"_shards" : {//分片
"total" : 1,//总分片数
"successful" : 1,//查询击中的分片数
"skipped" : 0,//跳过的分片数
"failed" : 0//失败的分片数
},
"hits" : {
"total" : {
"value" : 1000,//找到的文档总数
"relation" : "eq"
},
"max_score" : null,//相关性得分最高的文档
"hits" : [
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "0",
"_score" : null,//文档相关性得分(使用match_all时将不起效)
"_source" : {
"account_number" : 0,
"balance" : 16623,
"firstname" : "Bradshaw",
"lastname" : "Mckenzie",
"age" : 29,
"gender" : "F",
"address" : "244 Columbus Place",
"employer" : "Euron",
"email" : "[email protected]",
"city" : "Hobucken",
"state" : "CO"
},
"sort" : [//文档的排序位置(未按相关性分数排序时),例如该文档在返回结果的第三条,则值为2
0
]
},
...//省略
}
5、使用from和size进行搜索分页
curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": { "match_all": {} },
"sort": [
{ "account_number": "asc" }
],
"from": 10,
"size": 10
}
'
说明:from开始位置,size取的条数
返回:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "10",
"_score" : null,
"_source" : {
"account_number" : 10,
"balance" : 46170,
"firstname" : "Dominique",
"lastname" : "Park",
"age" : 37,
"gender" : "F",
"address" : "100 Gatling Place",
"employer" : "Conjurica",
"email" : "[email protected]",
"city" : "Omar",
"state" : "NJ"
},
"sort" : [
10
]
},
...//省略
}
6、使用match进行查询
curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": { "match": { "address": "mill lane" } }
}
'
说明:索引bank中address字段包含mill和lane的客户文档信息
返回:
{
"took" : 15,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 19,
"relation" : "eq"
},
"max_score" : 9.507477,
"hits" : [
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "136",
"_score" : 9.507477,
"_source" : {
"account_number" : 136,
"balance" : 45801,
"firstname" : "Winnie",
"lastname" : "Holland",
"age" : 38,
"gender" : "M",
"address" : "198 Mill Lane",
"employer" : "Neteria",
"email" : "[email protected]",
"city" : "Urie",
"state" : "IL"
}
},
...//省略
}
7、match_phrase 使用整个词组搜索而不是单个词
curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": { "match_phrase": { "address": "mill lane" } }
}
'
说明:查找address字段包含词“mill lane”的客户文档信息
返回:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 9.507477,
"hits" : [
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "136",
"_score" : 9.507477,
"_source" : {
"account_number" : 136,
"balance" : 45801,
"firstname" : "Winnie",
"lastname" : "Holland",
"age" : 38,
"gender" : "M",
"address" : "198 Mill Lane",
"employer" : "Neteria",
"email" : "[email protected]",
"city" : "Urie",
"state" : "IL"
}
}
]
}
}
8、使用bool进行复杂查询
curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": [
{ "match": { "age": "40" } }
],
"must_not": [
{ "match": { "state": "ID" } }
]
}
}
}
'
说明:must match要求,should matc合意的,must not match不合意的。例如查询要求年龄为40岁并且居住地址不在ID的客户文档信息。
返回:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 43,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "474",
"_score" : 1.0,
"_source" : {
"account_number" : 474,
"balance" : 35896,
"firstname" : "Obrien",
"lastname" : "Walton",
"age" : 40,
"gender" : "F",
"address" : "192 Ide Court",
"employer" : "Suremax",
"email" : "[email protected]",
"city" : "Crucible",
"state" : "UT"
}
},
...//省略
}
9、使用range查询
curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"bool": {
"must": { "match_all": {} },
"filter": {
"range": {
"balance": {
"gte": 20000,
"lte": 30000
}
}
}
}
}
}
'
说明:查找存款在20000到30000之间到客户文档信息。
返回:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 217,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "bank",
"_type" : "_doc",
"_id" : "49",
"_score" : 1.0,
"_source" : {
"account_number" : 49,
"balance" : 29104,
"firstname" : "Fulton",
"lastname" : "Holt",
"age" : 23,
"gender" : "F",
"address" : "451 Humboldt Street",
"employer" : "Anocha",
"email" : "[email protected]",
"city" : "Sunriver",
"state" : "RI"
}
},
...//省略
}
10、使用聚合aggregations
curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword"
}
}
}
}
'
说明:按州分组bank索引中的所有帐户,并按降序返回帐户最多的十个州
返回:
{
"took" : 22,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 743,
"buckets" : [
{
"key" : "TX",
"doc_count" : 30
},
{
"key" : "MD",
"doc_count" : 28
},
{
"key" : "ID",
"doc_count" : 27
},
{
"key" : "AL",
"doc_count" : 25
},
{
"key" : "ME",
"doc_count" : 25
},
{
"key" : "TN",
"doc_count" : 25
},
{
"key" : "WY",
"doc_count" : 25
},
{
"key" : "DC",
"doc_count" : 24
},
{
"key" : "MA",
"doc_count" : 24
},
{
"key" : "ND",
"doc_count" : 24
}
]
}
}
}
11、使用嵌套聚合
curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword"
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
'
说明:在先前的group_by_state聚合中嵌套平均聚合,以计算每个州的平均帐户余额
返回:
{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 743,
"buckets" : [
{
"key" : "TX",
"doc_count" : 30,
"average_balance" : {
"value" : 26073.3
}
},
{
"key" : "MD",
"doc_count" : 28,
"average_balance" : {
"value" : 26161.535714285714
}
},
{
"key" : "ID",
"doc_count" : 27,
"average_balance" : {
"value" : 24368.777777777777
}
},
{
"key" : "AL",
"doc_count" : 25,
"average_balance" : {
"value" : 25739.56
}
},
{
"key" : "ME",
"doc_count" : 25,
"average_balance" : {
"value" : 21663.0
}
},
{
"key" : "TN",
"doc_count" : 25,
"average_balance" : {
"value" : 28365.4
}
},
{
"key" : "WY",
"doc_count" : 25,
"average_balance" : {
"value" : 21731.52
}
},
{
"key" : "DC",
"doc_count" : 24,
"average_balance" : {
"value" : 23180.583333333332
}
},
{
"key" : "MA",
"doc_count" : 24,
"average_balance" : {
"value" : 29600.333333333332
}
},
{
"key" : "ND",
"doc_count" : 24,
"average_balance" : {
"value" : 26577.333333333332
}
}
]
}
}
}
12、对聚合结果进行排序
curl -X GET "localhost:9200/bank/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 0,
"aggs": {
"group_by_state": {
"terms": {
"field": "state.keyword",
"order": {
"average_balance": "desc"
}
},
"aggs": {
"average_balance": {
"avg": {
"field": "balance"
}
}
}
}
}
}
'
返回:
{
"took" : 5,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1000,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
},
"aggregations" : {
"group_by_state" : {
"doc_count_error_upper_bound" : -1,
"sum_other_doc_count" : 827,
"buckets" : [
{
"key" : "CO",
"doc_count" : 14,
"average_balance" : {
"value" : 32460.35714285714
}
},
{
"key" : "NE",
"doc_count" : 16,
"average_balance" : {
"value" : 32041.5625
}
},
{
"key" : "AZ",
"doc_count" : 14,
"average_balance" : {
"value" : 31634.785714285714
}
},
{
"key" : "MT",
"doc_count" : 17,
"average_balance" : {
"value" : 31147.41176470588
}
},
{
"key" : "VA",
"doc_count" : 16,
"average_balance" : {
"value" : 30600.0625
}
},
{
"key" : "GA",
"doc_count" : 19,
"average_balance" : {
"value" : 30089.0
}
},
{
"key" : "MA",
"doc_count" : 24,
"average_balance" : {
"value" : 29600.333333333332
}
},
{
"key" : "IL",
"doc_count" : 22,
"average_balance" : {
"value" : 29489.727272727272
}
},
{
"key" : "NM",
"doc_count" : 14,
"average_balance" : {
"value" : 28792.64285714286
}
},
{
"key" : "LA",
"doc_count" : 17,
"average_balance" : {
"value" : 28791.823529411766
}
}
]
}
}
}
1、创建文档
RestHighLevelClient client = new RestHighLevelClient(
RestClient.builder(
new HttpHost("localhost", 9200, "http")
));
ndexRequest request = new IndexRequest("posts");
request.id("2");
String jsonString = "{" +
"\"user\":\"张三\"," +
"\"postDate\":\"2013-01-30\"," +
"\"message\":\"测试创建文档 Elasticsearch\"" +
"}";
request.source(jsonString, XContentType.JSON);
//同步执行
IndexResponse indexResponse = client.index(request, RequestOptions.DEFAULT);
浏览器访问:http://localhost:9200/posts/_doc/2
返回:
{
"_index":"posts",
"_type":"_doc",
"_id":"2",
"_version":1,
"_seq_no":1,
"_primary_term":1,
"found":true,
"_source":{
"user":"张三",
"postDate":"2013-01-30",
"message":"测试创建文档 Elasticsearch"
}
}
2、获取文档
RestHighLevelClient client = new RestHighLevelClient(
RestClient.builder(
new HttpHost("localhost", 9200, "http")
));
GetRequest request = new GetRequest(
"posts",
"2");
/**配置source返回包含的字段*/
String[] includes = new String[]{"message", "*Date"};
String[] excludes = Strings.EMPTY_ARRAY;
FetchSourceContext fetchSourceContext =
new FetchSourceContext(true, includes, excludes);
request.fetchSourceContext(fetchSourceContext);
//同步执行
GetResponse getResponse = null;
try {
getResponse = client.get(request, RequestOptions.DEFAULT);
} catch (ElasticsearchException e) {
if (e.status() == RestStatus.NOT_FOUND) {
//未找到文档
}
}
if (getResponse.isExists()) {
//获取文档String类型
String sourceAsString = getResponse.getSourceAsString();
System.out.println(sourceAsString);
//获取文档map类型
Map sourceAsMap = getResponse.getSourceAsMap();
//获取文档byte[]类型
byte[] sourceAsBytes = getResponse.getSourceAsBytes();
} else {
//没获取到文档的场景
}
返回:
3、判断文档是否存在
RestHighLevelClient client = new RestHighLevelClient(
RestClient.builder(
new HttpHost("localhost", 9200, "http")/*,
new HttpHost("localhost", 9201, "http")*/
));
GetRequest getRequest = new GetRequest(
"posts",
"2");
//Disable fetching _source
getRequest.fetchSourceContext(new FetchSourceContext(false));
//Disable fetching stored fields.
getRequest.storedFields("_none_");
boolean exists = client.exists(getRequest, RequestOptions.DEFAULT);
System.out.println(exists);
返回:
4、删除文档
RestHighLevelClient client = new RestHighLevelClient(
RestClient.builder(
new HttpHost("localhost", 9200, "http")/*,
new HttpHost("localhost", 9201, "http")*/
));
DeleteRequest request = new DeleteRequest(
"posts",
"2");
DeleteResponse deleteResponse = client.delete(
request, RequestOptions.DEFAULT);
if (deleteResponse.getResult() == DocWriteResponse.Result.NOT_FOUND) {
System.out.println("文档不存在");
} else {
System.out.println("删除成功");
}
返回:
5、更新文档
RestHighLevelClient client = new RestHighLevelClient(
RestClient.builder(
new HttpHost("localhost", 9200, "http")/*,
new HttpHost("localhost", 9201, "http")*/
));
UpdateRequest request = new UpdateRequest(
"posts",
"2");
String jsonString = "{" +
"\"updated\":\"2017-01-01\"," +
"\"reason\":\"daily update\"" +
"}";
request.doc(jsonString, XContentType.JSON);
UpdateResponse updateResponse = null;
try {
updateResponse = client.update(
request, RequestOptions.DEFAULT);
} catch (ElasticsearchException e) {
if (e.status() == RestStatus.NOT_FOUND) {
System.out.println("不存在");
}
}
浏览器访问:http://localhost:9200/posts/_doc/2
返回:
{"_index":"posts","_type":"_doc","_id":"2","_version":2,"_seq_no":6,"_primary_term":1,"found":true,"_source":{"user":"张三","postDate":"2013-01-30","message":"测试创建文档 Elasticsearch","reason":"daily update","updated":"2017-01-01"}}
6、词向量api
RestHighLevelClient client = new RestHighLevelClient(
RestClient.builder(
new HttpHost("localhost", 9200, "http")/*,
new HttpHost("localhost", 9201, "http")*/
));
TermVectorsRequest request = new TermVectorsRequest("posts", "2");
request.setFields("reason");
/**同步执行*/
TermVectorsResponse response =
client.termvectors(request, RequestOptions.DEFAULT);
/**获取词向量更多信息**/
for (TermVectorsResponse.TermVector tv : response.getTermVectorsList()) {
/** 当前字段名称*/
String fieldname = tv.getFieldName();
/**字段统计 当前字段文档数 */
int docCount = tv.getFieldStatistics().getDocCount();
/**总词频**/
long sumTotalTermFreq =
tv.getFieldStatistics().getSumTotalTermFreq();
/**逆文档频率**/
long sumDocFreq = tv.getFieldStatistics().getSumDocFreq();
if (tv.getTerms() != null) {
/**当前字段terms*/
List terms =
tv.getTerms();
for (TermVectorsResponse.TermVector.Term term : terms) {
/**词条名称*/
String termStr = term.getTerm();
/**Term frequency of the term*/
/**词频*/
int termFreq = term.getTermFreq();
/**逆文档频率*/
int docFreq = term.getDocFreq();
/**总词频*/
long totalTermFreq = term.getTotalTermFreq();
/**词条得分*/
float score = term.getScore();
if (term.getTokens() != null) {
/**词条分词*/
List tokens =
term.getTokens();
for (TermVectorsResponse.TermVector.Token token : tokens) {
/**分词位置*/
int position = token.getPosition();
/**分词开始偏移量*/
int startOffset = token.getStartOffset();
/**分词结束偏移量*/
int endOffset = token.getEndOffset();
/**分词 Payload */
String payload = token.getPayload();
}
}
}
}
}
控制台打印:
返回:
{
"_index":"posts",
"_type":"_doc",
"_id":"2",
"_version":2,
"found":true,
"took":2,
"term_vectors":{
"reason":{
"field_statistics":{
"sum_doc_freq":2,
"doc_count":1,
"sum_ttf":2
},
"terms":{
"daily":{
"term_freq":1,
"tokens":[
{
"position":0,
"start_offset":0,
"end_offset":5
}
]
},
"update":{
"term_freq":1,
"tokens":[
{
"position":1,
"start_offset":6,
"end_offset":12
}
]
}
}
}
}
}
参考文档:
https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started.html
https://www.elastic.co/guide/en/elasticsearch/client/java-rest/7.3/index.html
书籍:《Elasticsearch实战》