关系数据库 ⇒ 数据库 ⇒ 表 ⇒ 行 ⇒ 列(Columns)
Elasticsearch ⇒ 索引(Index) ⇒ 类型(type) ⇒ 文档(Docments) ⇒ 字段(Fields)
download :https://www.elastic.co/cn/downloads/elasticsearch
结构化数据------>关系型数据库存储,查找
非结构化数据—>1.顺序扫描法(从头到尾) 2. 全文搜索(建立文本库,创建索引搜索)
全文搜索实现ElasticSearch
ES本身使用的是UDP协议, 该协议没有握手机制,也无法保证通讯过程中数据不丢失
9300 是tcp通讯端口,集群间和TCPClient都走的它
9200 是http协议的RESTful接口
基于HTTP协议,以JSON为数据交互格式的RESTful API
数据存储在安装目录data下
测试ES启动成功 http://localhost:9200?pretty
插入这些数据到Elasticsearch的同时,Elasticsearch还为数据的每个字段建立索引–倒排索引
倒排索引(inverted index) 的结构来做快速的全文搜索
倒排索引由在文档中出现的唯一的单词列表,以及对于每个单词在文档中的位置组成。
https://es.xiaoleilu.com/052_Mapping_Analysis/35_Inverted_index.html
倒排索引中的所有词项对应一个或多个文档
倒排索引中的词项根据字典顺序升序排列
List All Indicescurl localhost:9200/_cat/indices?v
添加索引curl -X PUT http://0.0.0.0:9200/customer?pretty
删除索引curl -X DELETE 'localhost:9200/customer?pretty'
https://blog.csdn.net/zhangbin666/article/details/73332538
添加文档
curl -H "Content-Type:application/json" -H "Data_Type:msg" -X POST --data '{
"first_name" : "Zhao",
"last_name" : "Jianyu",
"age" : 22,
"about" : "I love to go rock climbing",
"interests": [ "music" ]
}' localhost:9200/megacorp/employee/2
搜索文档/删除GET/delete /megacorp/employee/1
DSL删除文档_delete_by_query
curl -H "Content-Type:application/json" -X POST --data '{
"query": {
"bool": {
"filter": [
{
"exists": {
"field": "record_file",
"boost": 1
}
},
{
"term": {
"account_login_name": {
"value": "3004703",
"boost": 1
}
}
},
{
"range": {
"channel_time": {
"from": 1565280000,
"to": 1565971199,
"include_lower": true,
"include_upper": true,
"boost": 1
}
}
},
{
"bool": {
"must_not": [
{
"exists": {
"field": "asr_customer",
"boost": 1
}
},
{
"exists": {
"field": "asr_agent",
"boost": 1
}
},
{
"exists": {
"field": "stat_silence_rate",
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
},
{
"bool": {
"should": [
{
"terms": {
"cdr_type": [
"cdr_ib",
"cdr_ob_agent",
"cdr_ob_customer"
],
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
],
"adjust_pure_negative": true,
"boost": 1
}
}
}' 'localhost:9200/index_20190809_alias%2Cindex_20190810_alias/type/_delete_by_query'
更新文档POST
替换文档PUT
查询总数
curl -XGET http://localhost:9200/_count?pretty
http://**.**.***.**:9200/_count?pretty `?pretty`使请求返回参数将JSON数据
六种搜索:
phrase search
highlight search
搜索全部文档 GET /megacorp/employee/_search
条件搜索 curl 'localhost:9200/megacorp/employee/_search?q=first_name:zhao'
按照价格降序搜索(不常用)GET /ecommerce/product/_search?q=name:yagao&sort=price:desc
DSL(Domain Specific Language特定领域语言)
“query”: {
match 进行分词
,包含这三个词中的一个或多个的文档就会被搜索出来。
match_phrase 完全匹配,包含所有
multi_match 对多个字段进行匹配
term 是代表完全匹配
,即不进行分词器分析,文档中必须包含整个搜索的词汇(数值型,日期型)
terms
bool 联合查询: must,should,must_not
https://blog.csdn.net/tanga842428/article/details/75127418
match语句–查询类型之一
GET /megacorp/employee/_search
{
"query" : {
"match" : {
"last_name" : "Smith"
}
}
}
curl -H "Content-Type:application/json" -H "Data_Type:msg" -X GET --data '
{
"query" : {
"match" : {
"last_name" : "Smith"
}
}
}' 'localhost:9200/megacorp/employee/_search'
过滤器(filter)用于执行区间搜索
GET /megacorp/employee/_search
{
"query" : {
"filtered" : {
"filter" : {
"range" : {
"age" : { "gt" : 30 }
}
},
"query" : {
"match" : {
"last_name" : "smith"
}
}
}
}
}
从about
字段中搜索"rock climbing"
GET /megacorp/employee/_search
{
"query" : {
"match" : {
"about" : "rock climbing"
}
}
}
搜索结果
{
...
"hits": {
"total": 2,
"max_score": 0.16273327,
"hits": [
{
...
"_score": 0.16273327, <1> 结果相关性评分
"_source": {
"first_name": "John",
"last_name": "Smith",
"age": 25,
"about": "I love to go rock climbing",
"interests": [ "sports", "music" ]
}
},
{
...
"_score": 0.016878016,
"_source": {
"first_name": "Jane",
"last_name": "Smith",
"age": 32,
"about": "I like to collect rock albums",
"interests": [ "music" ]
}
}
]
}
}
参考文章:https://www.jianshu.com/p/873a363de2ee
{
"size":10, 搜索10个
"query":{
"bool":{
"adjust_pure_negative":true,
"boost":1
}
},
"_source":true 是否显示结果
}
// 质检总数
.subAggregation(AggregationBuilders.filter("qcCount", QueryBuilders.boolQuery()
.should(QueryBuilders.existsQuery("qc_review_score"))
.should(QueryBuilders.existsQuery("qc_score"))))
// 自动质检总分
.subAggregation(AggregationBuilders.filter("autoQcScoreCountFilter", QueryBuilders.boolQuery()
.must(QueryBuilders.existsQuery("qcScore"))
.mustNot(QueryBuilders.existsQuery("qcReviewScore")))
.subAggregation(AggregationBuilders.sum("qcScoreCount").field("qcScore").missing(0)))
// 人工复合总分 (进行自动质检后被人工修改过的总分 + 未进行自动质检直接被人工评分的总分)
.subAggregation(AggregationBuilders.filter("qcReviewScoreCountFilter", QueryBuilders.boolQuery()
.filter(QueryBuilders.existsQuery(AsrField.QC_REVIEW_SCORE)))
.subAggregation(AggregationBuilders
.sum("qcScoreCount").field(AsrField.QC_REVIEW_SCORE).missing(0)))
数据上生成复杂的分析统计。它很像SQL中的GROUP BY但是功能更强大。
curl -H "Content-Type:application/json" -H "Data_Type:msg" -X GET --data '
{
"aggs": {
"all_interests": {
"terms": { "field": "interests" }
}
}
}' 'localhost:9200/megacorp/employee/_search'
聚合遇到的问题
5.x后对排序,聚合这些操作用单独的数据结构(fielddata)缓存到内存里了,需要单独开启
参考文章:https://blog.csdn.net/u011403655/article/details/71107415/
{
"took": 10, 花费的时间,单位毫秒
"timed_out": false, 是否超时, false表示没有超时
"_shards": {
"total": 5, 数据拆成了5个分片
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 4, 查询结果的数量,4个document
"max_score": 1, 越相关,就越匹配,分数也越高
"hits": [ 包含了匹配搜索的document的详细数据
{
"_index": "megacorp",
"_type": "employee",
"_id": "2",
"_score": 1,
"_source": {
"first_name": "Zhao",
"last_name": "Jianyu",
"age": 22,
"about": "I love to go rock climbing",
"interests": [
"music"
]
}
},
{
"_index": "megacorp", 索引名
"_type": "employee", 类型名
"_id": "4",
"_score": 1,
"_source": {
"first_name": "Zhao",
"last_name": "Jianyu",
"age": 22,
"about": "I love to go rock climbing",
"interests": [
"music"
]
}
},
{
...已省略数据
}
]
},
"aggregations": { 聚合:实时的从匹配查询语句的文档中动态计算生成的
"all_interests": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "music",
"doc_count": 3 喜欢音乐3人
},
{
"key": "play",
"doc_count": 1 喜欢玩 1人
},
{
"key": "sports",
"doc_count": 1
}
]
}
}
}
curl -H "Content-Type:application/json" -H "Data_Type:msg" -X GET --data '
{
"query": {
"match": {
"last_name": "jianyu"
}
},
"aggs": {
"all_interests": {
"terms": {
"field": "interests"
}
}
}
}' 'localhost:9200/megacorp/employee/_search'
结果:
{
...省略部分结果
"hits": {
"total": 2, 只查询出和query语句相匹配的文档
"max_score": 0.18232156,
"hits": [
{
"_index": "megacorp",
"_type": "employee",
"_id": "2",
"_score": 0.18232156,
"_source": {
"first_name": "Zhao",
"last_name": "Jianyu",
"age": 22,
"about": "I love to go rock climbing",
"interests": [
"music"
]
}
},
{
"_index": "megacorp",
"_type": "employee",
"_id": "4",
"_score": 0.18232156,
"_source": {
"first_name": "Zhao",
"last_name": "Jianyu",
"age": 22,
"about": "I love to go rock climbing",
"interests": [
"music"
]
}
}
]
},
"aggregations": { 对查询出结果的文档进行聚合
"all_interests": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [ 桶
{
"key": "music",
"doc_count": 2
}
]
}
}
}
根据兴趣分组 求平均年龄
curl -H "Content-Type:application/json" -H "Data_Type:msg" -X GET --data '
{
"aggs" : {
"all_interests" : {
"terms" : { "field" : "interests" },
"aggs" : {
"avg_age" : {
"avg" : { "field" : "age" }
}
}
}
}
}' 'localhost:9200/megacorp/employee/_search'
搜索结果:
{
"aggregations": {
"all_interests": {
"doc_count_error_upper_bound": 0, 表示没有在这次聚合中返回、但是可能存在的潜在聚合结果
"sum_other_doc_count": 0, 表示这次聚合中没有统计到的文档数。因为ES为分布式部署,
"buckets": [{ 不同文档分散于多个分片
"key": "music",
"doc_count": 3,
"avg_age": {
"value": 23 喜欢音乐的平均年龄23岁
}
}, {
"key": "play",
"doc_count": 1,
"avg_age": {
"value": 33
}
}, {
"key": "sports",
"doc_count": 1,
"avg_age": {
"value": 25
}
}
]
}
}
}
实现倒排索引分词
从文本中切分出一个一个的词条,并对每个词条标准化
3.1.1 character filter 分词之前的预处理 (过滤html标签,特殊符号转化)
3.1.2 tokenizer 分词
3.1.3 token filter 标准化 (大小写转换,同义词转化)
3.2.1 standard 分词器 : (默认)将词汇单元转成小写|去除停用词[a,an,the]和标点符号,支持中文采用的方法单子切分
3.2.2 simple 分词器 : 首先通过非字母字符分割文本信息|然后将词汇单元转成小写&&去除数字类型字符|
3.2.3 Whitespace : 仅仅去除空格 不支持中文
3.2.4 language : 为特定语言制定的分词器 不支持中文
git上下载
字段
if coord factor is enabled (by default "disable_coord": false) then it means: if we have more search keywords in text then this result would be more relevant and will get higher score.
if coord factor is disabled(“disable_coord”: true) then it means: no matter how many keywords we have in search text it will be counted just once.
如果启用coord因子(默认情况下为“disadycoord”:false),那么它意味着:如果文本中有更多的搜索关键字,那么这个结果将更相关并获得更高的分数。
如果coord因子被禁用(“disablecoord”:true),那么它的意思是:无论搜索文本中有多少关键字,它都只被计算一次。
分词模式
bulk size最佳大小
bulk request会加载到内存里,如果太大的话,性能反而会下降,因此需要反复尝试一个最佳的bulk size。一般从1000~ 5000条数据开始,尝试逐渐增加。另外,如果看大小的话,最好是在5~15MB之间。
究其原因是 ES本身使用的是UDP协议, 该协议没有握手机制,也无法保证通讯过程中数据不丢失
bulk线程池满了的产生拒绝的问题。
同时可以看一下es端的日志,有异常也会在日志里面展示的
https://www.cnblogs.com/jianxuanbing/p/9410800.html