文档的三大元数据:
自己设置文档的id:
PUT /{index}/{type}/{id}
curl -X PUT 127.0.0.1:9200/articles/article/150000 -H 'Content-Type:application/json' -d '
{
"article_id": 150000,
"user_id": 1,
"title": "python是世界上最好的语言",
"content": "确实如此",
"status": 2,
"create_time": "2019-04-03"
}'
使用es自动生成的id:(不加id即可)
PUT /{index}/{type}
curl 127.0.0.1:9200/articles/article/150000?pretty
# 获取一部分
curl 127.0.0.1:9200/articles/article/150000?_source=title,content\&pretty
curl -i -X HEAD 127.0.0.1:9200/articles/article/150000
在 Elasticsearch 中文档是 不可改变 的,不能修改它们(感觉矛盾)。 相反,如果想要更新现有的文档,需要 重建索引或者进行替换。我们可以使用相同的 index
API 进行实现。
例如修改title字段的内容,不可进行以下操作(仅传递title字段内容)
curl -X PUT 127.0.0.1:9200/articles/article/150000 -H 'Content-Type:application/json' -d '
{
"title": "python必须是世界上最好的语言"
}'
而是要索引完整文档内容
curl -X PUT 127.0.0.1:9200/articles/article/150000 -H 'Content-Type:application/json' -d '
{
"article_id": 150000,
"user_id": 1,
"title": "python必须是世界上最好的语言",
"content": "确实如此",
"status": 2,
"create_time": "2019-04-03"
}'
上面的意思也就是说,默认是不可以修改文档的,如果你去修改的话**,底层是先删除原来的文档,然后在建立一个新的文档**,这也是为什么要索引完整的文档内容了。
curl -X DELETE 127.0.0.1:9200/articles/article/150000
curl -X GET 127.0.0.1:9200/_mget -d '
{
"docs": [
{
"_index": "articles",
"_type": "article",
"_id": 150000
},
{
"_index": "articles",
"_type": "article",
"_id": 150001
}
]
}'
需求:从Mysql导入数据到ES中
创建配置文件logstash_mysql.conf
input{
jdbc {
jdbc_driver_library => "/home/python/mysql-connector-java-8.0.13/mysql-connector-java-8.0.13.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://127.0.0.1:3306/toutiao?tinyInt1isBit=false"
jdbc_user => "root"
jdbc_password => "mysql"
jdbc_paging_enabled => "true" //分批量导入
jdbc_page_size => "1000" //每批1000条
jdbc_default_timezone =>"Asia/Shanghai" //时区
statement => "select a.article_id as article_id,a.user_id as user_id, a.title as title, a.status as status, a.create_time as create_time, b.content as content from news_article_basic as a inner join news_article_content as b on a.article_id=b.article_id"
use_column_value => "true"
tracking_column => "article_id" //让es库的文档id和mysql的主键一致,而不是由es产生
clean_run => true
}
}
output{
elasticsearch {
hosts => "127.0.0.1:9200"
index => "articles"
document_id => "%{article_id}"
document_type => "article"
}
stdout {
codec => json_lines
}
}
导入数据: sudo /usr/share/logstash/bin/logstash -f ./logstash_mysql.conf
(其实就是两个系统进行数据传输,规定源数据地址和目标地址即可)
根据文档ID
curl -X GET 127.0.0.1:9200/articles/article/1
curl -X GET 127.0.0.1:9200/articles/article/1?_source=title,user_id //指定的字段
curl -X GET 127.0.0.1:9200/articles/article/1?_source=false//返回结果没有source字段
查询所有
curl -X GET 127.0.0.1:9200/articles/article/_search?_source=title,user_id
分页
from 起始
size 返回文档的数量
curl -X GET 127.0.0.1:9200/articles/article/_search?_source=title,user_id\&size=3
curl -X GET 127.0.0.1:9200/articles/article/_search?_source=title,user_id\&size=3\&from=10
全文检索
curl -X GET 127.0.0.1:9200/articles/article/_search?q=content:python%20web\&_source=title,article_id\&pretty
curl -X GET 127.0.0.1:9200/articles/article/_search?q=title:python%20web,content:python%20web\&_source=title,article_id\&pretty
curl -X GET 127.0.0.1:9200/articles/article/_search?q=_all:python%20web\&_source=title,article_id\&pretty
# 如果知道字段就从对应字段检索,q=xxx,不然就些_all,_source只是返回结果展现出来的字段!!!
全文检索 match
curl -X GET 127.0.0.1:9200/articles/article/_search -d'
{
"query" : {
"match" : {
"title" : "python web"
}
}
}'
# 这样漂亮多了
curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
"from": 0, //分页
"size": 5,
"_source": ["article_id","title"],
"query" : {
"match" : {
"title" : "python web"
}
}
}'
curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
"from": 0,
"size": 5,
"_source": ["article_id","title"],
"query" : {
"match" : {
"_all" : "python web 编程"
}
}
}'
短语搜索 match_phrase
curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
"size": 5,
"_source": ["article_id","title"],
"query" : {
"match_phrase" : {
"_all" : "python web" #不拆分,整体比较
}
}
}'
精确查找 term
curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
"size": 5,
"_source": ["article_id","title", "user_id"],
"query" : {
"term" : {
"user_id" : 1
}
}
}'
范围查找 range
curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
"size": 5,
"_source": ["article_id","title", "user_id"],
"query" : {
"range" : {
"article_id": {
"gte": 3,
"lte": 5
}
}
}
}'
高亮搜索 highlight
curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d '
{
"size":2,
"_source": ["article_id", "title", "user_id"],
"query": {
"match": {
"title": "python web 编程"
}
},
"highlight":{
"fields": {
"title": {} #对应上面的字段,将它高亮显示
}
}
}
'
组合查询
must
文档 必须 匹配这些条件才能被包含进来。
must_not
文档 必须不 匹配这些条件才能被包含进来。
should
如果满足这些语句中的任意语句,将增加 _score
,否则,无任何影响。它们主要用于修正每个文档的相关性得分。
filter
必须 匹配,但它以不评分、过滤模式来进行。这些语句对评分没有贡献,只是根据过滤标准来排除或包含文档。
curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d '
{
"_source": ["title", "user_id"],
"query": {
"bool": { #bool表示组合一堆条件
"must": {
"match": {
"title": "python web"
}
},
"filter": { #直接过滤出id为2的文档,不影响分数
"term": {
"user_id": 2
}
}
}
}
}
'
排序
curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
"size": 5,
"_source": ["article_id","title"],
"query" : {
"match" : {
"_all" : "python web"
}
},
"sort": [
{ "create_time": { "order": "desc" }}, 1.首先按照时间排序
{ "_score": { "order": "desc" }} 2.然后再按分数排序
]
}'
boost 提升权重,优化排序
curl -X GET 127.0.0.1:9200/articles/article/_search?pretty -d'
{
"size": 5,
"_source": ["article_id","title"],
"query" : {
"match" : {
"title" : {
"query": "python web",
"boost": 4 #分数翻4倍
}
}
}
}'
对于elasticsearch 5.x 版本 需要按以下方式导入
from elasticsearch5 import Elasticsearch
# elasticsearch集群服务器的地址
ES = [
'127.0.0.1:9200'
]
# 创建elasticsearch客户端
es = Elasticsearch(
ES,
# 启动前嗅探es集群服务器(就是说我们只连接了一个节点,顺着这个节点发现整个集群)
sniff_on_start=True,
# es集群服务器结点连接异常时是否刷新es结点信息
sniff_on_connection_fail=True,
# 每60秒刷新结点信息
sniffer_timeout=60
)
搜索使用方式
query = {
'query': {
'bool': {
'must': [
{'match': {'_all': 'python web'}}
],
'filter': [
{'term': {'status': 2}}
]
}
}
}
ret = es.search(index='articles', doc_type='article', body=query)
就是在自媒体发布文章后,将文章加入es库即可
doc = {
'article_id': article.id,
'user_id': article.user_id,
'title': article.title,
'content': article.content.content,
'status': article.status,
'create_time': article.ctime
}
current_app.es.index(index='articles', doc_type='article', body=doc, id=article.id)
对于已经建立的articles索引库,elasticsearch还提供了一种查询模式,suggest建议查询模式
curl 127.0.0.1:9200/articles/article/_search?pretty -d '
{
"from": 0,
"size": 10,
"_source": false,
"suggest": { #对错误数据的建议
"text": "phtyon web", #错误输入,es根据索引库得出正确结果
"word-phrase": { #自定义的返回字段名称,依据来源_all(所有字段)
"phrase": {
"field": "_all",
"size": 1 #这个就是设置1条建议即可哦
}
}
}
}'
#返回结果
"suggest":{
"word-phrase":[
{
"text":"phtyon web",
"offset":0
"length":10
"option":[
"text":"python web",#一条建议的结果
"score":0.0001213
]
}
]
}
当我们输入错误的关键词phtyon web
时,es可以提供根据索引库数据得出的正确拼写python web
使用elasticsearch提供的自动补全功能,因为文档的类型映射要特殊设置-completion,所以原先建立的文章索引库不能用于自动补全,需要再建立一个自动补全的索引库
curl -X PUT 127.0.0.1:9200/completions -H 'Content-Type: application/json' -d'
{
"settings" : {
"index": {
"number_of_shards" : 3,
"number_of_replicas" : 1
}
}
}
'
curl -X PUT 127.0.0.1:9200/completions/_mapping/words -H 'Content-Type: application/json' -d'
{
"words": {
"properties": {
"suggest": {
"type": "completion", # 按照completion构建倒排索引!!!
"analyzer": "ik_max_word"
}
}
}
}
'
使用logstash导入初始数据
编辑logstash_mysql_completion.conf
input{
jdbc {
jdbc_driver_library => "/home/python/mysql-connector-java-8.0.13/mysql-connector-java-8.0.13.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://127.0.0.1:3306/toutiao?tinyInt1isBit=false"
jdbc_user => "root"
jdbc_password => "mysql"
jdbc_paging_enabled => "true"
jdbc_page_size => "1000"
jdbc_default_timezone =>"Asia/Shanghai"
statement => "select title as suggest from news_article_basic"
clean_run => true
}
}
output{
elasticsearch {
hosts => "127.0.0.1:9200"
index => "completions"
document_type => "words"
}
}
执行命令导入数据
sudo /usr/share/logstash/bin/logstash -f ./logstash_mysql_completion.conf
自动补全建议查询
curl 127.0.0.1:9200/completions/words/_search?pretty -d '
{
"suggest": {
"title-suggest" : { #自定义建议字段
"prefix" : "pyth", #指明前半部分
"completion" : {
"field" : "suggest"
}
}
}
}
'
class SuggestionResource(Resource):
"""
联想建议
"""
def get(self):
"""
获取联想建议
"""
qs_parser = RequestParser()
qs_parser.add_argument('q', type=inputs.regex(r'^.{1,50}$'), required=True, location='args')
args = qs_parser.parse_args()
q = args.q
# 先尝试自动补全建议查询
query = {
'from': 0,
'size': 10,
'_source': False,
'suggest': {
'word-completion': {
'prefix': q,
'completion': {
'field': 'suggest'
}
}
}
}
ret = current_app.es.search(index='completions', body=query)
options = ret['suggest']['word-completion'][0]['options']
# 如果没得到查询结果,进行纠错建议查询
if not options:
query = {
'from': 0,
'size': 10,
'_source': False,
'suggest': {
'text': q,
'word-phrase': {
'phrase': {
'field': '_all',
'size': 1
}
}
}
}
ret = current_app.es.search(index='articles', doc_type='article', body=query)
options = ret['suggest']['word-phrase'][0]['options']
results = []
for option in options:
if option['text'] not in results:
results.append(option['text'])
return {'options': results}
【如果文章有错误,或者想一起学习大数据或者人工智能的朋友可以加下面微信,
在朋友圈不定期直播自己的一些学习心得和面试经历。(请备注CSDN)】