elasticsearch 搜索功能搭建

标签（空格分隔）： python scrapy elasticserach

elasticserach介绍

传统搜索
- 无法打分
- 无法分布式
- 无法解析搜索请求
- 效率低
- 分词

安装与使用

elasticsearch-rtf(github下载使用)
- jdk版本为1.8

head插件和kibana的安装

nodejs安装
npm 镜像设置

npm install -g cnpm --registry=https://registry.npm.taobao.org

head文件插件安装
第三方插件的允许使用

http.cors.enabled: true
http.cors.allow-origin: "*"
http.cors.allow-methods: OPTIONS,HEAD,GET,POST,PUT,DELETE
http.cors.allow-headers:"X-Requested-With,Content-Type,Content-Length,X-Use"

kibana安装
elastic 官网下载,使用与上方的插件一样
三个工具的启动方式

elasticsearch.bat //搜索引擎库
cnpm run start  //数据库查看
kibana.bat  //数据库操作

elasticserach概念

集群：一个或多个节点组织在一起
节点：一个节点是一个集群中的一个服务器,有表示，是漫v的名字
分片：将索引划分为多份的能力,允许水平分割和扩展容量，多个分片响应请求，提高性能和吞吐量

副本：数据的备份，一个节点宕机，另一个顶上

elasticsearch	mysql
index (索引)	数据库
type(类型)	表
document(文档)	行
fields	列

倒排索引

倒排索引源于实际应用中需要根据属性的值来查找记录。这种索引表中的每一项都包含一个属性值和具有该属性值得各记录的地址。由于不是由记录来确定属性值，而是由属性值来确定记录的位置，因而称为倒排索引，带有倒排索引的文件我们称为倒排索引文件，简称倒排文件。
TF-IDF

语法结构

#新建索引
PUT lagou
{
  "settings": {
    "index":{
      "number_of_shards":5,
      "number_of_replicas":1
    }
  }
}
#新建表添加文件
PUT lagou/job/1
{
    "name":"张三",
    "phone":"13333333333",
    "address":"滁州学院",
    "openid":"110110",
    "items":{
        "productid":"123132",
        "productQuantity":2
    }
}
POST lagou/job/2
{
    "name":"李四",
    "phone":"13333333333",
    "address":"滁州学院",
    "openid":"111110",
    "items":{
        "productid":"123132",
        "productQuantity":2
    }
}

#get参数
GET lagou/job/1
GET lagou/job/1?_source

#修改数据
POST lagou/job/1/_update
{
  "doc":{
    "item":{
      "productid":"111111"
    }
  }
}

批量操作

_mget

GET _mget
{
    "docs":[
        {
            "_index":"testdb",
            "_type":"job1",
            "_id":1
        },
        {
            "_index":"testdb",
            "_type":"job2",
            "_id":2
        }
    ]
}
GET textdb/_mget
{
    "docs":[
        {
            "_type":"job1",
            "_id":1
        },
        {
            "_type":"job2",
            "_id":2
        }
    ]
}
GET textdb/job1/_mget
{
    "docs":[
        {
            "_id":1
        },
        {
            "_id":2
        }
    ]
}
GET textdb/job1/_mget
{
    "ids":[1,2]
}

bulk批量操作

action_and_meta_data\n
optional_source\n

{"index":{"_index":"textdb","type":"job1","_id":1}}
{"fields":"values"}
{"update":{"_index":"textdb","type":"job1","_id":1}}
{"doc":{fields":"values"}}
{"create":{"_index":"textdb","type":"job1","_id":1}}
{"fields":"values"}
{"delete":{"_index":"textdb","type":"job1","_id":1}}


POST _bulk
{"index":{"_index":"textdb","type":"job1","_id":1}}
{
        "name":"李四",
        "phone":"13333333333",
        "address":"滁州学院",
        "openid":"111110",
        "items":{
            "productid":"123132",
            "productQuantity":2
        }
    
}

映射(mapping)
为每个字段定义一个类型

string类型：text,keyword(string类型在es5开始已经作废)
数字类型：long,integer,short,byte,double,float
日期类型: date
bool类型: boolean
binary类型：binary
复杂类型：object,nested
geo类型：geo-point,geo-shape
专业类型：ip,competion

属性	描述	适合类型
store	值是yes表示存储，no表示不存储，默认为no	all
index	yes表示分析，no表示不分析，默认值为true	string
null_value	如字段为空，可以设置一个默认值，比如“na”	all
analyzer	可以设置索引和搜索是用的分析器，默认使用standard分析器，还可以使用whitespace,simple,english	all
include_in_all	默认es为每个文档定义一个特殊域_all，它的作用是让每个字段被搜索到，如果不想摸个字段被搜索到，可以设置为false	all
format	时间格式字符串的模式	date

PUT TEST
{
    "mappings":{
        "job":{
            "properties":{
                "title"{
                    "type":"text"
                },
                "city":{
                    "type":"keyword"
                },
                commenty:{
                    "properties":{
                        "name":{
                            "type":"text"
                        }
                    }
                },
                "publish_time":{
                    "type":"date",
                    "format":"yy-mm-dd"
                }
            }
        }
    }
}

elasticsearch查询

简单查询

match查询(存储前分析过的分词处理可以查询)

GET lagou/job/_search
{
    "query":{
        "match":{
            "title":"Python"
        }
    }
}
# 控制返回结果
GET lagou/job/_search
{
    "query":{
        "match":{
            "title":"Python"
        }
    }
    "from":0,
    "size":2
}
# 所有结果
GET lagou/job/_search
{
    "query":{
        "match_all":{}
}
# 短语查询(满足"prthon系统")
GET lagou/job/_search
{
    "query":{
        "match_phrase":{
            "title":{
                "query":"prthon系统",
                "slop":6  #两个词之间最小距离
            }
        }
}

term查询(存储前分析过的分词处理不可以查询)

GET lagou/_search
{
    "query":{
        "term":{
            "title":"python"
        }
    }
}

trems查询

GET lagou/_search
{
    "query":{
        "terms":{
            "title":["python","django"]
        }
    }
}

multi_match查询

GET lagou/_search
{
    "query":{
        "multi_match":{
            "query":"python",
            "fields":["title^3","desc"] #title^3出现的权重
        }
    }
}

指定返回字段

GET lagou/_search
{
    "stored_fields":["title","company_name"], # store为true的字段
    "query":{
        "match":{
            "title":"python"
        }
    }
}

通过sort把结果进行排序

GET lagou/_search
{
    "query":{
        "multi_all":{}
    "sort":[{
        "comments":{
            "order":"desc" # 降序
        }
    }]
    }
}

范围查询

GET lagou/_search
{
    "query":{
        "range":{
            "comments":{
                "gte":10,   # 大于等于
                "lte":20,   # 小于等于
                "boost":2.0 # 权重
            }
        }
    }
}
GET lagou/_search
{
    "query":{
        "range":{
            "add_time":{
                "gte":"2017-04-01", # 大于等于
                "lte":"now"         # 小于等于
            }
        }
    }
}

wildcard 模糊查询

GET lagou/_search
{
    "query":{
        "wildcard":{
            "title":{
                "value":"pyth*n", # 模糊
                "boost":2.0
            }
        }
    }
}

组合查询

bool查询 (包括must should must_not filter)

GET lagou/_search
{
    "query":{
        "bool":{
            "must":{
                "match_all":{}
            },
            "filter":{
                "term":{
                    "aslary":20
                }
            }
        }
    }
}
GET lagou/_search
{
    "query":{
        "bool":{
            "must":{
                "match_all":{}
            },
            "filter":{
                "terms":{
                    "aslary":[10,20]
                }
            }
        }
    }
}

查询分析器解析的结果

GET _analyze
{
    "analyze":"ik_max_word" # ik_smart(最少的数据分词)
    "text":""
}

组合过滤查询

GET lagou/_search
{
    "query":{
        "bool":{
            "should":[
            {
              "term":{"salary":20}
            },
             {
              "term":{"title":"python"}
            }
            ],
            "must_not":{
                "term":{
                    "price":30
                }
            }
        }
    }
}

嵌套查询

GET lagou/_search
{
    "query":{
        "bool":{
            "should":[
            {
              "term":{"salary":20}
            },
            "bool":{
                "must":[
                {
                  "term":{"salary":20}
                },
                 {
                  "term":{"title":"python"}
                }
                ]
            }
        }
    }
}

过滤空和非空

GET lagou/_search
{
    "query":{
        "bool":{
            "filter":{
                "exists":{  
                    "field":"tags"
                }
            }
        }
    }
}
GET lagou/_search
{
    "query":{
        "bool":{
            "must_not":{
                "exists":{  
                    "field":"tags"
                }
            }
        }
    }
}

scrapy数据插入elasticsearch当中

elasticsearch-dsl
```
pip intsall elasticsearch-dsl
```

数据对接

# mapping映射创建
# models
#!/usr/bin/env python3
# _*_ coding: utf-8 _*_
"""
 @author 金全 JQ
 @version 1.0 , 2017/11/9
 @description es中jobbole文章模型
"""

from datetime import datetime
from elasticsearch_dsl import DocType, Date, Nested, Boolean, \
    analyzer, InnerObjectWrapper, Completion, Keyword, Text,Integer
from elasticsearch_dsl.connections import connections
connections.create_connection(hosts='localhost')

class ESArticleJobbole(DocType):
    #jobbole 文章内容注入es
    title = Text(analyzer='ik_max_word')
    url = Keyword()
    url_object_id = Keyword()
    create_time = Date()
    praise_nums = Integer()
    fav_nums = Integer()
    comment_nums = Integer()
    content = Text(analyzer='ik_max_word')
    tags = Text(analyzer='ik_max_word')
    fornat_image_url = Keyword()
    fornat_image_path = Keyword()

    class Meta:
        index = "jobbole"
        doc_type = "article"

if __name__ == "__main__":
    ESArticleJobbole.init()

# item
class JooblogArticleItem(scrapy.Item):
title = scrapy.Field()
url = scrapy.Field()
url_object_id = scrapy.Field()
create_time = scrapy.Field(
    input_processor=MapCompose(get_datetime)
)
praise_nums = scrapy.Field(
    input_processor=MapCompose(get_num)
)
fav_nums = scrapy.Field(
    input_processor=MapCompose(get_num)
)
comment_nums = scrapy.Field(
    input_processor=MapCompose(get_num)
)
content = scrapy.Field()
tags = scrapy.Field(
    input_processor=MapCompose(remove_comment_tag),
    output_processor=Join(",")
)
fornat_image_url = scrapy.Field(
    output_processor=MapCompose(return_value)
)
fornat_image_path = scrapy.Field()

def save_es(self):
    article = ESArticleJobbole()
    article.create_time = self['create_time']
    article.praise_nums = self['praise_nums']
    article.content = remove_tags(self['content'])
    article.fav_nums = self['fav_nums']
    article.comment_nums = self['comment_nums']
    article.title = self['title']
    article.tags = self['tags']
    article.fornat_image_path = self['fornat_image_path']
    article.fornat_image_url = self['fornat_image_url']
    article.url = self['url']
    article.meta.id = self['url_object_id']
    article.save()
    return

# pipeline
class EsArticlePippeline(object):
# 数据插入es数据库
def process_item(self, item, spider):
    item.save_es()
    return item

原视频UP主慕课网（聚焦Python分布式爬虫必学框架Scrapy 打造搜索引擎）
本篇博客撰写人: XiaoJinZi 个人主页转载请注明出处
学生能力有限附上邮箱: [email protected] 不足以及误处请大佬指责

第十章 elasticserach搭建

elasticsearch 搜索功能搭建

elasticserach介绍

安装与使用

elasticserach概念

倒排索引

批量操作

elasticsearch查询

scrapy数据插入elasticsearch当中

你可能感兴趣的:(第十章 elasticserach搭建)