第十章 elasticserach搭建

elasticsearch 搜索功能搭建

标签(空格分隔): python scrapy elasticserach


elasticserach介绍

  • 传统搜索
    • 无法打分
    • 无法分布式
    • 无法解析搜索请求
    • 效率低
    • 分词

安装与使用

  • elasticsearch-rtf(github下载使用)

    • jdk版本为1.8
  • head插件和kibana的安装

    • nodejs安装

    • npm 镜像设置

    npm install -g cnpm --registry=https://registry.npm.taobao.org
    
    • head文件插件安装

    • 第三方插件的允许使用

    http.cors.enabled: true
    http.cors.allow-origin: "*"
    http.cors.allow-methods: OPTIONS,HEAD,GET,POST,PUT,DELETE
    http.cors.allow-headers:"X-Requested-With,Content-Type,Content-Length,X-Use"
    
    • kibana安装
      elastic 官网下载,使用与上方的插件一样

    • 三个工具的启动方式

    elasticsearch.bat //搜索引擎库
    cnpm run start  //数据库查看
    kibana.bat  //数据库操作
    

elasticserach概念

  • 集群:一个或多个节点组织在一起

  • 节点:一个节点是一个集群中的一个服务器,有表示,是漫v的名字

  • 分片:将索引划分为多份的能力,允许水平分割和扩展容量,多个分片响应请求,提高性能和吞吐量

  • 副本:数据的备份,一个节点宕机,另一个顶上

    elasticsearch mysql
    index (索引) 数据库
    type(类型)
    document(文档)
    fields

倒排索引

  • 倒排索引源于实际应用中需要根据属性的值来查找记录。这种索引表中的每一项都包含一个属性值和具有该属性值得各记录的地址。由于不是由记录来确定属性值,而是由属性值来确定记录的位置,因而称为倒排索引,带有倒排索引的文件我们称为倒排索引文件,简称倒排文件。
  • TF-IDF
  • 语法结构
    #新建索引
    PUT lagou
    {
      "settings": {
        "index":{
          "number_of_shards":5,
          "number_of_replicas":1
        }
      }
    }
    #新建表添加文件
    PUT lagou/job/1
    {
        "name":"张三",
        "phone":"13333333333",
        "address":"滁州学院",
        "openid":"110110",
        "items":{
            "productid":"123132",
            "productQuantity":2
        }
    }
    POST lagou/job/2
    {
        "name":"李四",
        "phone":"13333333333",
        "address":"滁州学院",
        "openid":"111110",
        "items":{
            "productid":"123132",
            "productQuantity":2
        }
    }
    
    #get参数
    GET lagou/job/1
    GET lagou/job/1?_source
    
    #修改数据
    POST lagou/job/1/_update
    {
      "doc":{
        "item":{
          "productid":"111111"
        }
      }
    }
    

批量操作

  • _mget

    GET _mget
    {
        "docs":[
            {
                "_index":"testdb",
                "_type":"job1",
                "_id":1
            },
            {
                "_index":"testdb",
                "_type":"job2",
                "_id":2
            }
        ]
    }
    GET textdb/_mget
    {
        "docs":[
            {
                "_type":"job1",
                "_id":1
            },
            {
                "_type":"job2",
                "_id":2
            }
        ]
    }
    GET textdb/job1/_mget
    {
        "docs":[
            {
                "_id":1
            },
            {
                "_id":2
            }
        ]
    }
    GET textdb/job1/_mget
    {
        "ids":[1,2]
    }
    
  • bulk批量操作

    action_and_meta_data\n
    optional_source\n
    
    {"index":{"_index":"textdb","type":"job1","_id":1}}
    {"fields":"values"}
    {"update":{"_index":"textdb","type":"job1","_id":1}}
    {"doc":{fields":"values"}}
    {"create":{"_index":"textdb","type":"job1","_id":1}}
    {"fields":"values"}
    {"delete":{"_index":"textdb","type":"job1","_id":1}}
    
    
    POST _bulk
    {"index":{"_index":"textdb","type":"job1","_id":1}}
    {
            "name":"李四",
            "phone":"13333333333",
            "address":"滁州学院",
            "openid":"111110",
            "items":{
                "productid":"123132",
                "productQuantity":2
            }
        
    }
    
  • 映射(mapping)
    为每个字段定义一个类型

    string类型:text,keyword(string类型在es5开始已经作废)
    数字类型:long,integer,short,byte,double,float
    日期类型: date
    bool类型: boolean
    binary类型:binary
    复杂类型:object,nested
    geo类型:geo-point,geo-shape
    专业类型:ip,competion
    
    属性 描述 适合类型
    store 值是yes表示存储,no表示不存储,默认为no all
    index yes表示分析,no表示不分析,默认值为true string
    null_value 如字段为空,可以设置一个默认值,比如“na” all
    analyzer 可以设置索引和搜索是用的分析器,默认使用standard分析器,还可以使用whitespace,simple,english all
    include_in_all 默认es为每个文档定义一个特殊域_all,它的作用是让每个字段被搜索到,如果不想摸个字段被搜索到,可以设置为false all
    format 时间格式字符串的模式 date
    PUT TEST
    {
        "mappings":{
            "job":{
                "properties":{
                    "title"{
                        "type":"text"
                    },
                    "city":{
                        "type":"keyword"
                    },
                    commenty:{
                        "properties":{
                            "name":{
                                "type":"text"
                            }
                        }
                    },
                    "publish_time":{
                        "type":"date",
                        "format":"yy-mm-dd"
                    }
                }
            }
        }
    }
    

elasticsearch查询

  • 简单查询
    • match查询(存储前分析过的分词处理可以查询)
    GET lagou/job/_search
    {
        "query":{
            "match":{
                "title":"Python"
            }
        }
    }
    # 控制返回结果
    GET lagou/job/_search
    {
        "query":{
            "match":{
                "title":"Python"
            }
        }
        "from":0,
        "size":2
    }
    # 所有结果
    GET lagou/job/_search
    {
        "query":{
            "match_all":{}
    }
    # 短语查询(满足"prthon系统")
    GET lagou/job/_search
    {
        "query":{
            "match_phrase":{
                "title":{
                    "query":"prthon系统",
                    "slop":6  #两个词之间最小距离
                }
            }
    }
    
    • term查询(存储前分析过的分词处理不可以查询)
    GET lagou/_search
    {
        "query":{
            "term":{
                "title":"python"
            }
        }
    }
    
    • trems查询
    GET lagou/_search
    {
        "query":{
            "terms":{
                "title":["python","django"]
            }
        }
    }
    
    • multi_match查询
    GET lagou/_search
    {
        "query":{
            "multi_match":{
                "query":"python",
                "fields":["title^3","desc"] #title^3出现的权重
            }
        }
    }
    
    • 指定返回字段
    GET lagou/_search
    {
        "stored_fields":["title","company_name"], # store为true的字段
        "query":{
            "match":{
                "title":"python"
            }
        }
    }
    
    • 通过sort把结果进行排序
    GET lagou/_search
    {
        "query":{
            "multi_all":{}
        "sort":[{
            "comments":{
                "order":"desc" # 降序
            }
        }]
        }
    }
    
    • 范围查询
    GET lagou/_search
    {
        "query":{
            "range":{
                "comments":{
                    "gte":10,   # 大于等于
                    "lte":20,   # 小于等于
                    "boost":2.0 # 权重
                }
            }
        }
    }
    GET lagou/_search
    {
        "query":{
            "range":{
                "add_time":{
                    "gte":"2017-04-01", # 大于等于
                    "lte":"now"         # 小于等于
                }
            }
        }
    }
    
    • wildcard 模糊查询
    GET lagou/_search
    {
        "query":{
            "wildcard":{
                "title":{
                    "value":"pyth*n", # 模糊
                    "boost":2.0
                }
            }
        }
    }
    
  • 组合查询
    • bool查询 (包括must should must_not filter)
    GET lagou/_search
    {
        "query":{
            "bool":{
                "must":{
                    "match_all":{}
                },
                "filter":{
                    "term":{
                        "aslary":20
                    }
                }
            }
        }
    }
    GET lagou/_search
    {
        "query":{
            "bool":{
                "must":{
                    "match_all":{}
                },
                "filter":{
                    "terms":{
                        "aslary":[10,20]
                    }
                }
            }
        }
    }
    
    • 查询分析器解析的结果
    GET _analyze
    {
        "analyze":"ik_max_word" # ik_smart(最少的数据分词)
        "text":""
    }
    
    • 组合过滤查询
    GET lagou/_search
    {
        "query":{
            "bool":{
                "should":[
                {
                  "term":{"salary":20}
                },
                 {
                  "term":{"title":"python"}
                }
                ],
                "must_not":{
                    "term":{
                        "price":30
                    }
                }
            }
        }
    }
    
    • 嵌套查询
    GET lagou/_search
    {
        "query":{
            "bool":{
                "should":[
                {
                  "term":{"salary":20}
                },
                "bool":{
                    "must":[
                    {
                      "term":{"salary":20}
                    },
                     {
                      "term":{"title":"python"}
                    }
                    ]
                }
            }
        }
    }
    
    • 过滤空和非空
    GET lagou/_search
    {
        "query":{
            "bool":{
                "filter":{
                    "exists":{  
                        "field":"tags"
                    }
                }
            }
        }
    }
    GET lagou/_search
    {
        "query":{
            "bool":{
                "must_not":{
                    "exists":{  
                        "field":"tags"
                    }
                }
            }
        }
    }
    

scrapy数据插入elasticsearch当中

  • elasticsearch-dsl

    pip intsall elasticsearch-dsl
    
  • 数据对接

    # mapping映射创建
    # models
    #!/usr/bin/env python3
    # _*_ coding: utf-8 _*_
    """
     @author 金全 JQ
     @version 1.0 , 2017/11/9
     @description es中jobbole文章模型
    """
    
    from datetime import datetime
    from elasticsearch_dsl import DocType, Date, Nested, Boolean, \
        analyzer, InnerObjectWrapper, Completion, Keyword, Text,Integer
    from elasticsearch_dsl.connections import connections
    connections.create_connection(hosts='localhost')
    
    class ESArticleJobbole(DocType):
        #jobbole 文章内容注入es
        title = Text(analyzer='ik_max_word')
        url = Keyword()
        url_object_id = Keyword()
        create_time = Date()
        praise_nums = Integer()
        fav_nums = Integer()
        comment_nums = Integer()
        content = Text(analyzer='ik_max_word')
        tags = Text(analyzer='ik_max_word')
        fornat_image_url = Keyword()
        fornat_image_path = Keyword()
    
        class Meta:
            index = "jobbole"
            doc_type = "article"
    
    if __name__ == "__main__":
        ESArticleJobbole.init()
    
    # item
    class JooblogArticleItem(scrapy.Item):
    title = scrapy.Field()
    url = scrapy.Field()
    url_object_id = scrapy.Field()
    create_time = scrapy.Field(
        input_processor=MapCompose(get_datetime)
    )
    praise_nums = scrapy.Field(
        input_processor=MapCompose(get_num)
    )
    fav_nums = scrapy.Field(
        input_processor=MapCompose(get_num)
    )
    comment_nums = scrapy.Field(
        input_processor=MapCompose(get_num)
    )
    content = scrapy.Field()
    tags = scrapy.Field(
        input_processor=MapCompose(remove_comment_tag),
        output_processor=Join(",")
    )
    fornat_image_url = scrapy.Field(
        output_processor=MapCompose(return_value)
    )
    fornat_image_path = scrapy.Field()
    
    def save_es(self):
        article = ESArticleJobbole()
        article.create_time = self['create_time']
        article.praise_nums = self['praise_nums']
        article.content = remove_tags(self['content'])
        article.fav_nums = self['fav_nums']
        article.comment_nums = self['comment_nums']
        article.title = self['title']
        article.tags = self['tags']
        article.fornat_image_path = self['fornat_image_path']
        article.fornat_image_url = self['fornat_image_url']
        article.url = self['url']
        article.meta.id = self['url_object_id']
        article.save()
        return
    
    # pipeline
    class EsArticlePippeline(object):
    # 数据插入es数据库
    def process_item(self, item, spider):
        item.save_es()
        return item
    

  • 原视频UP主慕课网(聚焦Python分布式爬虫必学框架Scrapy 打造搜索引擎)
  • 本篇博客撰写人: XiaoJinZi 个人主页 转载请注明出处
  • 学生能力有限 附上邮箱: [email protected] 不足以及误处请大佬指责

你可能感兴趣的:(第十章 elasticserach搭建)