elasticsearch 搜索功能搭建
标签(空格分隔): python scrapy elasticserach
elasticserach介绍
- 传统搜索
- 无法打分
- 无法分布式
- 无法解析搜索请求
- 效率低
- 分词
安装与使用
-
elasticsearch-rtf(github下载使用)
- jdk版本为1.8
-
head插件和kibana的安装
nodejs安装
npm 镜像设置
npm install -g cnpm --registry=https://registry.npm.taobao.org
head文件插件安装
第三方插件的允许使用
http.cors.enabled: true http.cors.allow-origin: "*" http.cors.allow-methods: OPTIONS,HEAD,GET,POST,PUT,DELETE http.cors.allow-headers:"X-Requested-With,Content-Type,Content-Length,X-Use"
kibana安装
elastic 官网下载,使用与上方的插件一样三个工具的启动方式
elasticsearch.bat //搜索引擎库 cnpm run start //数据库查看 kibana.bat //数据库操作
elasticserach概念
集群:一个或多个节点组织在一起
节点:一个节点是一个集群中的一个服务器,有表示,是漫v的名字
分片:将索引划分为多份的能力,允许水平分割和扩展容量,多个分片响应请求,提高性能和吞吐量
-
副本:数据的备份,一个节点宕机,另一个顶上
elasticsearch mysql index (索引) 数据库 type(类型) 表 document(文档) 行 fields 列
倒排索引
- 倒排索引源于实际应用中需要根据属性的值来查找记录。这种索引表中的每一项都包含一个属性值和具有该属性值得各记录的地址。由于不是由记录来确定属性值,而是由属性值来确定记录的位置,因而称为倒排索引,带有倒排索引的文件我们称为倒排索引文件,简称倒排文件。
- TF-IDF
- 语法结构
#新建索引 PUT lagou { "settings": { "index":{ "number_of_shards":5, "number_of_replicas":1 } } } #新建表添加文件 PUT lagou/job/1 { "name":"张三", "phone":"13333333333", "address":"滁州学院", "openid":"110110", "items":{ "productid":"123132", "productQuantity":2 } } POST lagou/job/2 { "name":"李四", "phone":"13333333333", "address":"滁州学院", "openid":"111110", "items":{ "productid":"123132", "productQuantity":2 } } #get参数 GET lagou/job/1 GET lagou/job/1?_source #修改数据 POST lagou/job/1/_update { "doc":{ "item":{ "productid":"111111" } } }
批量操作
-
_mget
GET _mget { "docs":[ { "_index":"testdb", "_type":"job1", "_id":1 }, { "_index":"testdb", "_type":"job2", "_id":2 } ] } GET textdb/_mget { "docs":[ { "_type":"job1", "_id":1 }, { "_type":"job2", "_id":2 } ] } GET textdb/job1/_mget { "docs":[ { "_id":1 }, { "_id":2 } ] } GET textdb/job1/_mget { "ids":[1,2] }
-
bulk批量操作
action_and_meta_data\n optional_source\n {"index":{"_index":"textdb","type":"job1","_id":1}} {"fields":"values"} {"update":{"_index":"textdb","type":"job1","_id":1}} {"doc":{fields":"values"}} {"create":{"_index":"textdb","type":"job1","_id":1}} {"fields":"values"} {"delete":{"_index":"textdb","type":"job1","_id":1}} POST _bulk {"index":{"_index":"textdb","type":"job1","_id":1}} { "name":"李四", "phone":"13333333333", "address":"滁州学院", "openid":"111110", "items":{ "productid":"123132", "productQuantity":2 } }
-
映射(mapping)
为每个字段定义一个类型string类型:text,keyword(string类型在es5开始已经作废) 数字类型:long,integer,short,byte,double,float 日期类型: date bool类型: boolean binary类型:binary 复杂类型:object,nested geo类型:geo-point,geo-shape 专业类型:ip,competion
属性 描述 适合类型 store 值是yes表示存储,no表示不存储,默认为no all index yes表示分析,no表示不分析,默认值为true string null_value 如字段为空,可以设置一个默认值,比如“na” all analyzer 可以设置索引和搜索是用的分析器,默认使用standard分析器,还可以使用whitespace,simple,english all include_in_all 默认es为每个文档定义一个特殊域_all,它的作用是让每个字段被搜索到,如果不想摸个字段被搜索到,可以设置为false all format 时间格式字符串的模式 date PUT TEST { "mappings":{ "job":{ "properties":{ "title"{ "type":"text" }, "city":{ "type":"keyword" }, commenty:{ "properties":{ "name":{ "type":"text" } } }, "publish_time":{ "type":"date", "format":"yy-mm-dd" } } } } }
elasticsearch查询
- 简单查询
- match查询(存储前分析过的分词处理可以查询)
GET lagou/job/_search { "query":{ "match":{ "title":"Python" } } } # 控制返回结果 GET lagou/job/_search { "query":{ "match":{ "title":"Python" } } "from":0, "size":2 } # 所有结果 GET lagou/job/_search { "query":{ "match_all":{} } # 短语查询(满足"prthon系统") GET lagou/job/_search { "query":{ "match_phrase":{ "title":{ "query":"prthon系统", "slop":6 #两个词之间最小距离 } } }
- term查询(存储前分析过的分词处理不可以查询)
GET lagou/_search { "query":{ "term":{ "title":"python" } } }
- trems查询
GET lagou/_search { "query":{ "terms":{ "title":["python","django"] } } }
- multi_match查询
GET lagou/_search { "query":{ "multi_match":{ "query":"python", "fields":["title^3","desc"] #title^3出现的权重 } } }
- 指定返回字段
GET lagou/_search { "stored_fields":["title","company_name"], # store为true的字段 "query":{ "match":{ "title":"python" } } }
- 通过sort把结果进行排序
GET lagou/_search { "query":{ "multi_all":{} "sort":[{ "comments":{ "order":"desc" # 降序 } }] } }
- 范围查询
GET lagou/_search { "query":{ "range":{ "comments":{ "gte":10, # 大于等于 "lte":20, # 小于等于 "boost":2.0 # 权重 } } } } GET lagou/_search { "query":{ "range":{ "add_time":{ "gte":"2017-04-01", # 大于等于 "lte":"now" # 小于等于 } } } }
- wildcard 模糊查询
GET lagou/_search { "query":{ "wildcard":{ "title":{ "value":"pyth*n", # 模糊 "boost":2.0 } } } }
- 组合查询
- bool查询 (包括must should must_not filter)
GET lagou/_search { "query":{ "bool":{ "must":{ "match_all":{} }, "filter":{ "term":{ "aslary":20 } } } } } GET lagou/_search { "query":{ "bool":{ "must":{ "match_all":{} }, "filter":{ "terms":{ "aslary":[10,20] } } } } }
- 查询分析器解析的结果
GET _analyze { "analyze":"ik_max_word" # ik_smart(最少的数据分词) "text":"" }
- 组合过滤查询
GET lagou/_search { "query":{ "bool":{ "should":[ { "term":{"salary":20} }, { "term":{"title":"python"} } ], "must_not":{ "term":{ "price":30 } } } } }
- 嵌套查询
GET lagou/_search { "query":{ "bool":{ "should":[ { "term":{"salary":20} }, "bool":{ "must":[ { "term":{"salary":20} }, { "term":{"title":"python"} } ] } } } }
- 过滤空和非空
GET lagou/_search { "query":{ "bool":{ "filter":{ "exists":{ "field":"tags" } } } } } GET lagou/_search { "query":{ "bool":{ "must_not":{ "exists":{ "field":"tags" } } } } }
scrapy数据插入elasticsearch当中
-
elasticsearch-dsl
pip intsall elasticsearch-dsl
-
数据对接
# mapping映射创建 # models #!/usr/bin/env python3 # _*_ coding: utf-8 _*_ """ @author 金全 JQ @version 1.0 , 2017/11/9 @description es中jobbole文章模型 """ from datetime import datetime from elasticsearch_dsl import DocType, Date, Nested, Boolean, \ analyzer, InnerObjectWrapper, Completion, Keyword, Text,Integer from elasticsearch_dsl.connections import connections connections.create_connection(hosts='localhost') class ESArticleJobbole(DocType): #jobbole 文章内容注入es title = Text(analyzer='ik_max_word') url = Keyword() url_object_id = Keyword() create_time = Date() praise_nums = Integer() fav_nums = Integer() comment_nums = Integer() content = Text(analyzer='ik_max_word') tags = Text(analyzer='ik_max_word') fornat_image_url = Keyword() fornat_image_path = Keyword() class Meta: index = "jobbole" doc_type = "article" if __name__ == "__main__": ESArticleJobbole.init()
# item class JooblogArticleItem(scrapy.Item): title = scrapy.Field() url = scrapy.Field() url_object_id = scrapy.Field() create_time = scrapy.Field( input_processor=MapCompose(get_datetime) ) praise_nums = scrapy.Field( input_processor=MapCompose(get_num) ) fav_nums = scrapy.Field( input_processor=MapCompose(get_num) ) comment_nums = scrapy.Field( input_processor=MapCompose(get_num) ) content = scrapy.Field() tags = scrapy.Field( input_processor=MapCompose(remove_comment_tag), output_processor=Join(",") ) fornat_image_url = scrapy.Field( output_processor=MapCompose(return_value) ) fornat_image_path = scrapy.Field() def save_es(self): article = ESArticleJobbole() article.create_time = self['create_time'] article.praise_nums = self['praise_nums'] article.content = remove_tags(self['content']) article.fav_nums = self['fav_nums'] article.comment_nums = self['comment_nums'] article.title = self['title'] article.tags = self['tags'] article.fornat_image_path = self['fornat_image_path'] article.fornat_image_url = self['fornat_image_url'] article.url = self['url'] article.meta.id = self['url_object_id'] article.save() return
# pipeline class EsArticlePippeline(object): # 数据插入es数据库 def process_item(self, item, spider): item.save_es() return item
- 原视频UP主慕课网(聚焦Python分布式爬虫必学框架Scrapy 打造搜索引擎)
- 本篇博客撰写人: XiaoJinZi 个人主页 转载请注明出处
- 学生能力有限 附上邮箱: [email protected] 不足以及误处请大佬指责