7. Elasticsearch

不要迷失在别人的评价里,用心倾听自己内心的声音,做好真实的自己!


参考:

  1. ElasticSearch 第三弹,核心概念介绍
  2. https://www.cnblogs.com/Neeo/articles/10593037.html

前面讲到 通过U2模拟人类的点击操作,通过mitmdump来解析数据;
这里将学习通过elasticsearch组件存储数据,用kibana进行大数据分析;

7-1 elasticsearch介绍和安装

1. Elasticsearch介绍

Elasticsearch 作为 Elastic Stack 的核心,它是一个分布式、面向文档的 RESTful 风格搜索和数据分析引擎,它支持结构化和非结构化查询,并且不需要提前定义模式。Elasticsearch 可用作搜索引擎,通常用于 Web 级日志分析、实时应用监控和点击流分析。


https://www.elastic.co/cn/elasticsearch/

2. Elasticsearch安装
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-amd64.deb
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-amd64.deb.sha512
shasum -a 512  -c elasticsearch-7.10.1-amd64.deb.sha512 
sudo dpkg -i elasticsearch-7.10.1-amd64.deb

# 安装
user1@imooc:~/data$ sudo dpkg -i elasticsearch-7.7.0-amd64.deb 

# 安装位置
user1@imooc:~/data$ ls /usr/share/elasticsearch/
bin  jdk  lib  modules  NOTICE.txt  plugins  README.asciidoc

# s权限(超级权限)
user1@imooc:~/data$ ls -ld /etc/elasticsearch/
drwxr-s--- 3 root elasticsearch 4096 12月 18 10:05 /etc/elasticsearch/

# 配置文件
user1@imooc:~/data$ sudo nano /etc/elasticsearch/elasticsearch.yml

network.host: 0.0.0.0   
cluster.initial_master_nodes: node-1


# 启动
user1@imooc:~/data$ service elasticsearch start
[sudo] user1 的密码: 
user1@imooc:~/data$ curl http://localhost:9200   # 查看是否启动
{
  "name" : "imooc",
  "cluster_name" : "elasticsearch",         # 集群名称
  "cluster_uuid" : "Djbdx0q4RUy6zYEAGxeHpQ",
  "version" : {
    "number" : "7.7.0",                            # 版本号码
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "81a1e9eda8e6183f5237786246f6dced26a10eaf",
    "build_date" : "2020-05-12T02:01:37.602180Z",
    "build_snapshot" : false,
    "lucene_version" : "8.5.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"       # 搜索引擎
}

user1@imooc:~/data$ service elasticsearch restart


user1@imooc:~/data$ ps -ef | grep elasticsearch
elastic+ 108187      1 17 11:48 ?        00:00:14 /usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-1918501227772571995 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -XX:MaxDirectMemorySize=536870912 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=deb -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet
elastic+ 108373 108187  0 11:48 ?        00:00:00 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller
user1    108448 105669  0 11:49 pts/7    00:00:00 grep --color=auto elasticsearch

user1@imooc:~/data$ cat /etc/passwd
elasticsearch:x:124:128::/nonexistent:/bin/false

注意: elasticsearch是由远程执行的,不能用root权限启动;

浏览器输入 http://192.168.0.103:9200/ ,返回信息与curl http://localhost:9200一致, 则证明elasticsearch安装成功。

7-2 数据可视化组件kibana的安装和在elasticsearch基本的增删改查

1. Elasticsearch基础
文档(Document)

一个可以被索引的数据单元。例如一个用户的文档、一个产品的文档等等。文档都是 JSON 格式的。

文档相当于mysql表中的数据;
术语

索引(Index)

index类比为Mysql的表

节点(Node)

一个node就是ES的一个运行实例

集群中的一个服务器就是一个节点,节点中会存储数据,同时参与集群的索引以及搜索功能。一个节点想要加入一个集群,只需要配置一下集群名称即可。默认情况下,如果我们启动了多个节点,多个节点还能够互相发现彼此,那么它们会自动组成一个集群,这是 es 默认提供的,但是这种方式并不可靠,有可能会发生脑裂现象。所以在实际使用中,建议一定手动配置一下集群信息。

集群(Cluster)

一个或者多个安装了 es 节点的服务器组织在一起,就是集群,这些节点共同持有数据,共同提供搜索服务。

一个集群有一个名字,这个名字是集群的唯一标识,该名字成为 cluster name,默认的集群名称是 elasticsearch,具有相同名称的节点才会组成一个集群。

可以在 config/elasticsearch.yml 文件中配置集群名称:

cluster.name: javaboy-es

在集群中,节点的状态有三种:绿色、黄色、红色:
绿色:节点运行状态为健康状态。所有的主分片、副本分片都可以正常工作。
黄色:表示节点的运行状态为警告状态,所有的主分片目前都可以直接运行,但是至少有一个副本分片是不能正常工作的。
红色:表示集群无法正常工作。

2. 组件Kibana的安装
1. Kibana介绍

使用Kibana查询和分析存储在Elastisearch中的数据。Kibana 搭载了一些经典功能:柱状图、线状图、饼图、环形图,等等,充分利用了 Elasticsearch 的聚合功能。而且Dashboard功能能够将各种可视化组件集合成大屏。如下图所示,Kibana对网站应用的访客日志进行统计分析。

2. Kibana安装


Kibana安装文档 https://www.elastic.co/guide/en/kibana/current/deb.html

参考:Kibana和 elasticsearch安装的Linux IP = 192.168.0.101
配置最好一致

wget https://artifacts.elastic.co/downloads/kibana/kibana-7.10.1-amd64.deb
shasum -a 512 kibana-7.10.1-amd64.deb
sudo dpkg -i kibana-7.10.1-amd64.deb

# 安装
user1@imooc:~/data$ sudo dpkg -i kibana-7.7.0-amd64.deb 

user1@imooc:~/data$ ls /etc/kibana/
kibana.yml
user1@imooc:~/data$ sudo nano /etc/kibana/kibana.yml   # 配置文件

server.port: 5601
server.host: "192.168.0.103"
elasticsearch.hosts: ["http://localhost:9200"]
i18n.locale: "zh-CN"
# 中文

# 启动
user1@imooc:~/data$ service kibana start
user1@imooc:~/data$ ps -ef|grep kibana
kibana   109436      1 64 12:45 ?        00:00:33 /usr/share/kibana/bin/../node/bin/node /usr/share/kibana/bin/../src/cli -c /etc/kibana/kibana.yml
user1    109516 105669  0 12:46 pts/7    00:00:00 grep --color=auto kibana

启动的时候,需要等待1-5分钟,跟你的电脑配置有关
web界面展示 http://192.168.0.103:5601/app/kibana#/home

点击小扳手
3. Elastic Kibana 增删改查

https://www.elastic.co/guide/en/elasticsearch/reference/7.7/cat-indices.html

增删改

创建索引


创建索引
=
PUT test1            # 创建索引
GET _cat/indices     # 查看索引
DELETE test1         #  删除索引

# 插入(3条)
PUT test1/_doc/1
{
  "name":"大壮",
  "age":18,
  "location":"北京"
}

PUT test1/_doc/2
{
  "name":"张三",
  "age":18,
  "location":"上海"
}


PUT test1/_doc/3
{
  "name":"李四",
  "age":18,
  "location":"广州"
}
# --------------------------------
{
  "_index" : "test1",
  "_type" : "_doc",
  "_id" : "3",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 2,
  "_primary_term" : 1
}


# 查看插入的数据
GET test1/_search
#-----------------------------------
{
  "took" : 738,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {       # 3 条数据
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "大壮",
          "age" : 18,
          "location" : "北京"
        }
      },
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "name" : "张三",
          "age" : 18,
          "location" : "上海"
        }
      },
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "name" : "李四",
          "age" : 18,
          "location" : "广州"
        }
      }
    ]
  }
}
更新数据 update
POST test1/_update/1
{
  "doc":{
    "location":"北京市朝阳区"
  }
}
#--------------------------------------------
{
  "_index" : "test1",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 2,             # 版本
  "result" : "updated",       # 状态
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 3,
  "_primary_term" : 1
}

# 查看更新的数据
GET test1/_doc/1
#---------------------------------
{
  "_index" : "test1",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 2,
  "_seq_no" : 3,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "大壮",
    "age" : 18,
    "location" : "北京市朝阳区"
  }
}

# 删除数据
DELETE test1/_doc/1

# 删除所有数据
DELETE test1

7-3 elasticsearch查询进阶

_doc 是什么意思呢?

1. elasticsearch查询方法2种
  1. 查询字符串(query string),简单查询;
  2. 通过DSL语句进行查询
# 单查 与 全查
GET test1/_doc/1
GET test1/_search

GET test1/_search?q=name="李四"

#name text 以 or 的形式进行匹配
#name.keyword 不会进行分词,全匹配
GET test1/_search
{
  "query":{
    "match":{
      "name":"李四"
    }
  }
}

GET test1/_search
{
  "query":{
    "match":{
      "name.keyword":"李四"
    }
  }
}

# match_all = select * from tables
GET test1/_search
{
  "query":{
    "match_all": {}
  }
}

7-4 elasticsearch查询排序

sort字段来进行排序操作(默认desc降序)
desc降序,asc是升序

# sort字段来进行排序操作(默认desc降序)
# desc降序,asc是升序
GET test1/_search
{
  "query":{
    "match": {
      "name.keyword": "李四"
    }
  }
}

# 升序排序
GET test1/_search
{
  "query":{
    "match": {
      "name.keyword": "李四"
    }
  },
  "sort":[
    {
      "name":{
        "order": "asc"
      }
    }]
  
}

7-5 elasticsearch分页查询

数据量大的时候,查询数据,不用全部显示出来,就要用到分页查询;

from-从index=0开始
size-长度大小=5

# 查看一页两条数据
GET test1/_search
{
  "query":{
    "match_all": {}
  },
  "from":1,
  "size":5
}
#------------------------------------------------
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,   # 总数=3
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "name" : "李四",
          "age" : 18,
          "location" : "广州"
        }
      },
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "大壮",
          "age" : 18,
          "location" : "北京市朝阳区"
        }
      }
    ]
  }
}

7-6 布尔查询

布尔查询是一种最常用的组合查询


1. 查询国家为蜀国和性别为男、年龄=39的人
must 查询条件

GET test1/_search
{
  "query":{
    "bool":{
      "must": [
        {
          "match":{
            "country": "蜀国"
          }
        },
        {
          "match": {
            "gender": "男"
          }
        },
        {
          "match": {
            "age": "39"
          }
        }
      ]
    }
  }
}
#---------------------------------------------------------
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 2.508318,
    "hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 2.508318,
        "_source" : {
          "name" : "诸葛亮",
          "gender" : "男",
          "age" : 39,
          "country" : "蜀国",
          "tags" : "智谋超群"
        }
      }
    ]
  }
}

2. 查询标签为闭月羞花 或 年龄=39的人
should


GET test1/_search
{
  "query":{
    "bool":{
      "should": [
        {
          "match":{
            "tags.keyword": "闭月羞花"
          }
        },
        {
          "match": {
            "age": "39"
          }
        }
      ]
    }
  }
}
# ----------2

3. 查询年龄不为39岁,和标签不为 闭月羞花 和 性别 不为男的人
must_not

GET test1/_search
{
  "query":{
    "bool":{
      "must_not": [
        {
          "match":{
            "tags.keyword": "闭月羞花"
          }
        },
        {
          "match": {
            "age": "39"
          }
        },
        {
          "match":{
            "gender": "男"
          }
        }
      ]
    }
  }
}
#----------------------------------------------------------------
{
    "hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "10",
        "_score" : 0.0,
        "_source" : {
          "name" : "大乔",
          "gender" : "女",
          "age" : 19,
          "country" : "吴国",
          "tags" : "沉鱼落雁"
        }
      }
    ]
  }
}

4. 查询年龄大于=39岁,国别为魏国的人
gt 大于;le 小于 gte lte

# gt 大于;le 小于  gte lte
GET test1/_search
{
  "query":{
    "bool":{
      "must": [
        {
          "match":{
            "country.keyword": "魏国"
          }
        }
      ],
      "filter": [
        {"range": {
          "age": {
            "gte": 39
          }
        }}
      ]
    }
  }
}
#--------------------------------------
"hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.2321435,
        "_source" : {
          "name" : "司马懿",
          "gender" : "男",
          "age" : 41,
          "country" : "魏国",
          "tags" : "暗藏韬略"
        }
      }
    ]

7-7 结果过滤

添加字段
_source : name,age

GET test1/_search
{
  "query":{
    "bool":{
      "must": [
        {
          "match":{
            "country.keyword": "魏国"
          }
        }
      ],
      "filter": [
        {"range": {
          "age": {
            "gte": 38
          }
        }}
      ]
    }
  },
  "_source": ["name","age"]
}
#--------------------------------------
"hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.2321435,
        "_source" : {
          "name" : "司马懿",
          "age" : 41
        }
      },
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "7",
        "_score" : 1.2321435,
        "_source" : {
          "name" : "许绪",
          "age" : 38
        }
      }
    ]

7-8 高亮显示

支持CSS样式 pre_tags": ",

# 高亮显示 :加粗红色
GET test1/_search
{
  "query": {
    "match": {
      "tags": "沉鱼落雁"
    }
  },
  "highlight": {
    "pre_tags": "", 
    "post_tags": "", 
    "fields": {
      "tags": {}
    }
  }
}
#---------------------------------------
"hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "10",
        "_score" : 8.317766,
        "_source" : {
          "name" : "大乔",
          "gender" : "女",
          "age" : 19,
          "country" : "吴国",
          "tags" : "沉鱼落雁"
        },
        "highlight" : {
          "tags" : [
            ""
          ]

7-9 Elasticsearch聚合函数查询

求和、最大、最小、平均

1. 求吴国年龄的平均值

GET test1/_search
{
  "query": {
    "match": {
      "country.keyword": "吴国"
    }
  },
  "aggs":{
    "age_avg":{
      "avg":{
        "field":"age"
      }
    }
  }
}
#-----------------------------------
"aggregations" : {
    "age_avg" : {
      "value" : 24.6
    }
  }



# 过滤一下
GET test1/_search
{
  "query": {
    "match": {
      "country.keyword": "吴国"
    }
  },
  "aggs":{
    "age_avg":{
      "avg":{
        "field":"age"
      }
    }
  },
  "_source": ["name","age"]
}
#------------------------------------------------
"aggregations" : {
    "age_avg" : {
      "value" : 24.6
    }
  }


# 不看查询的数据 size=0
GET test1/_search
{
  "query": {
    "match": {
      "country.keyword": "吴国"
    }
  },
  "aggs":{
    "age_avg":{
      "avg":{
        "field":"age"
      }
    }
  },
  "_source": ["name","age"],
  "size": 0
}

2. 求吴国年龄最大的人

GET test1/_search
{
  "query": {
    "match": {
      "country.keyword": "吴国"
    }
  },
  "aggs":{
    "age_max":{
      "max":{
        "field":"age"
      }
    }
  },
  "_source": ["name","age"],
  "size": 2
}
#-------------------------------------
"aggregations" : {
    "age_max" : {
      "value" : 38.0
    }
  }

4. 求吴国所有年龄的和

GET test1/_search
{
  "query": {
    "match": {
      "country.keyword": "吴国"
    }
  },
  "aggs":{
    "age_sum":{
      "sum":{
        "field":"age"
      }
    }
  },
  "_source": ["name","age"],
  "size": 0
}
#---------------------------------
"aggregations" : {
    "age_sum" : {
      "value" : 123.0
    }
  }

7-10 分组查询

在分组条件之上进行一个 求平均值的操作;

# 分组查询
GET test1/_search
{ "size":0,
  "query": {
    "match_all": {
    }
  },
  "aggs":{
    "age_group":{
      "range": {
        "field": "age",
        "ranges": [
          {
            "from": 10,
            "to": 20
          },
          {
            "from": 20,
            "to": 30
          },
          {
            "from": 30,
            "to": 40
          },
          {
            "from": 40,
            "to": 50
          }
        ]
      },
      "aggs": {
        "age_avg": {
          "avg": {
            "field": "age"
          }
        }
      }
      }
    }
  }
}
#---------------------------------------------------
"hits" : [ ]
  },
  "aggregations" : {
    "age_group" : {
      "buckets" : [
        {
          "key" : "10.0-20.0",
          "from" : 10.0,
          "to" : 20.0,
          "doc_count" : 4,
          "age_avg" : {
            "value" : 18.25
          }
        },
        {
          "key" : "20.0-30.0",
          "from" : 20.0,
          "to" : 30.0,
          "doc_count" : 1,
          "age_avg" : {
            "value" : 21.0
          }
        },
        {
          "key" : "30.0-40.0",
          "from" : 30.0,
          "to" : 40.0,
          "doc_count" : 4,
          "age_avg" : {
            "value" : 36.25
          }
        },
        {
          "key" : "40.0-50.0",
          "from" : 40.0,
          "to" : 50.0,
          "doc_count" : 2,
          "age_avg" : {
            "value" : 40.5
          }
        }
      ]

7-11 _doc是用来做什么的

在6.0之前的版本中, Elasticsearch是这样添加数据的;
—doc 不是文档类型

6.0之前


6.0中
6.0

7.0

7-12 elasticsearch mappings的三种模式

要先定义表结构

elasticsearch 里面的表结构是自动创建的;

PUT test1/_doc/1
{
  "name":"王百万"
}

PUT test1/_doc/2
{
  "name":"赵金条",
  "age":18
}

GET test1
#----------------------------
{
  "test1" : {
    "aliases" : { },
    "mappings" : {
      "properties" : {
        "age" : {
          "type" : "long"   # 长
        },
        "name" : {
          "type" : "text",    # 文本
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1608297148199",
        "number_of_shards" : "1",
        "number_of_replicas" : "1",
        "uuid" : "eNYY8AxJRV6o2KUdpnTVng",
        "version" : {
          "created" : "7070099"
        },
        "provided_name" : "test1"
      }
    }
  }
}

1. mappings
字段类型

定义映射关系(两个字段name 、age)

PUT mappings_test1
{
  "mappings": {
    "properties": {
      "name":{
        "type":"text"
      },
      "age":{
        "type":"long"
      }
    }
  }
}
#-------------------------------------
{
  "acknowledged" : true,
  "shards_acknowledged" : true,
  "index" : "mappings_test1"
}

PUT mappings_test1/_doc/1
{
  "name":"大壮",
  "age":18
}

GET mappings_test1/_search
Mappings的三种类型

动态映射:elasticsearch判断类型,自动创建mappings

静态映射
"dynamic": false,
新增compang字段的时候,elasticsearch会保存数据,但不会建立映射关系;所以查不到compang信息;

# 创建索引
PUT mappings_test2
{
  "mappings": {
    "dynamic": false, 
    "properties": {
      "name":{
        "type":"text"
      },
      "age":{
        "type":"long"
      }
    }
  }
}

# 插入数据;
PUT mappings_test2/_doc/1
{
  "name":"张三",
  "age":19,
  "company":"imooc"
}

# 查询
GET mappings_test2/_search
{
  "query": {
    "match": {
      "compang": "imooc"
    }
  }
}
#--------------------------
"hits" : [ ]
  

严格映射
遇到 没定义的新字段 会报错

# 创建索引
PUT mappings_test3
{
  "mappings": {
    "dynamic": "strict", 
    "properties": {
      "name":{
        "type":"text"
      },
      "age":{
        "type":"long"
      }
    }
  }
}


PUT mappings_test3/_doc/1
{
  "name":"张三",
  "age":19,
  "company":"imooc"
}
#------------------------
{
  "error" : {
    "root_cause" : [
      {
        "type" : "strict_dynamic_mapping_exception",
        "reason" : "mapping set to strict, dynamic introduction of [company] within [_doc] is not allowed"
      }
    ],
    "type" : "strict_dynamic_mapping_exception",
    "reason" : "mapping set to strict, dynamic introduction of [company] within [_doc] is not allowed"
  },
  "status" : 400
}

7-13 elasticsearch的分词器

分析过程

当数据被发送到elasticsearch后并加入到倒排索引之前,elasticsearch会对该文档的进行一系列的处理步骤:

  • 字符过滤:使用字符过滤器转变字符。
  • 文本切分为分词:将文本(档)分为单个或多个分词。
  • 分词过滤:使用分词过滤器转变每个分词。
  • 分词索引:最终将分词存储在Lucene倒排索引中。
1. 标准分词器:standard tokenizer

https://www.cnblogs.com/Neeo/articles/10593037.html

标准分词器(standard tokenizer)是一个基于语法的分词器,对于大多数欧洲语言来说还是不错的,它同时还处理了Unicode文本的分词,但分词默认的最大长度是255字节,它也移除了逗号和句号这样的标点符号。

大写转换为小写;按照空格划分; 去掉特殊符号;汉字分单个;

POST _analyze
{
  "analyzer": "standard",
  "text":"Hello World&端午节"
}
#----------------------------------------
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "",
      "position" : 1
    },
    {
      "token" : "端",
      "start_offset" : 12,
      "end_offset" : 13,
      "type" : "",
      "position" : 2
    },
    {
      "token" : "午",
      "start_offset" : 13,
      "end_offset" : 14,
      "type" : "",
      "position" : 3
    },
    {
      "token" : "节",
      "start_offset" : 14,
      "end_offset" : 15,
      "type" : "",
      "position" : 4
    }
  ]
}

2. 简单分析器:simple analyzer

先按 空格分词;大小写转换;汉字不分开

POST _analyze
{
  "analyzer": "simple",
  "text":"Hello World&端午节"
}
#-------------------------------------------
{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "端午节",
      "start_offset" : 12,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    }
  ]
}
3. 空白分析器:whitespace analyzer

只按空格分开;不转换大小写

POST _analyze
{
  "analyzer": "whitespace",
  "text":"Hello World&端午节"
}
#-------------------------
{
  "tokens" : [
    {
      "token" : "Hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "World&端午节",
      "start_offset" : 6,
      "end_offset" : 15,
      "type" : "word",
      "position" : 1
    }
  ]
}



IK分词器

很好的支持中文


安装完成后,你需要重新启动才可以使用;

elasticsearch插件安装IK分词器
user1@imooc:~$ cd /usr/share/elasticsearch/bin/
user1@imooc:/usr/share/elasticsearch/bin$ sudo ./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.7.0/elasticsearch-analysis-ik-7.7.0.zip
[sudo] user1 的密码: 
-> Installing https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.7.0/elasticsearch-analysis-ik-7.7.0.zip
-> Downloading https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.7.0/elasticsearch-analysis-ik-7.7.0.zip
[=================================================] 100%   
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@     WARNING: plugin requires additional permissions     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.net.SocketPermission * connect,resolve
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.

Continue with installation? [y/N]y
-> Installed analysis-ik

# 重新启动Elasticsearch
user1@imooc:~$ service elasticsearch restart

http://192.168.0.103:9200/          # 查看是否有输出;
IK分词器使用


ik_max_word 细粒度的拆分;
ik_smart 粗粒度的拆分;

POST _analyze
{
  "analyzer": "ik_smart",
  "text":"八百标兵奔北坡"
}
#------------------------------------
{
  "tokens" : [
    {
      "token" : "八百",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "标兵",
      "start_offset" : 2,
      "end_offset" : 4,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "奔北",
      "start_offset" : 4,
      "end_offset" : 6,
      "type" : "CN_WORD",
      "position" : 2
    },
    {
      "token" : "坡",
      "start_offset" : 6,
      "end_offset" : 7,
      "type" : "CN_CHAR",
      "position" : 3
    }
  ]
}


POST _analyze
{
  "analyzer": "ik_max_word",
  "text":"端午节"
}
#---------------------------------------------
{
  "tokens" : [
    {
      "token" : "端午节",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "CN_WORD",
      "position" : 0
    },
    {
      "token" : "端午",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "CN_WORD",
      "position" : 1
    },
    {
      "token" : "节",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "CN_CHAR",
      "position" : 2
    }
  ]
}


7-14 修改elasticsearch的分词器

修改为 ik_max_word 分词器;查询的时候,会按照词来匹配 查询

# 标准分词器
PUT test1

# 修改为IK分词器
Put test2
{
  "mappings":{
    "properties":{
      "title":{
        "type":"text",
        "analyzer":"ik_max_word"
      }
    }
  }
}

7-15 倒排索引

参考:1. 聊聊 Elasticsearch 的倒排索引

elasticsearch 是一个分布式多用户能力的全文搜索引擎
倒排索引:从百度搜索页面输入关键字,到返回搜索结果的过程就是倒排索引;

1. 为什么叫倒排索引?
在没有搜索引擎时,我们是直接输入一个网址,然后获取网站内容,这时我们的行为是:

document -> to -> words  

通过文章,获取里面的单词,此谓「正向索引」,forward index.
后来,我们希望能够输入一个单词,找到含有这个单词,或者和这个单词有关系的文章:

word -> to -> documents

于是我们把这种索引,成为inverted index,直译过来,应该叫「反向索引」,国内翻译成「倒排索引」

2.elasticsearch会为每个字段建立 倒排索引

7-16 通过python操作elasticsearch增删改查

elasticsearch安装

pip install elasticsearch


from elasticsearch import Elasticsearch

# 隐式指定连接:默认 = 127.0.0.1:9200
es = Elasticsearch()
print(es.ping())
#----------------------------
True     # 连接成功

# 显式指定
es = Elasticsearch(hosts="127.0.0.1:9200")
print(es.ping())

print(es.ping(), es.info())
# 创建索引
es.indices.create(index="imooc")
# 查看索引
print(es.cat.indices())
# 删除索引
print(es.indices.delete(index="imooc"))

# 创建索引并插入数据
print(es.index(index="imooc", body={"name":"大壮", "age":18}))
# 查询数据
print(es.search(index="imooc"))

# 指定ID创建索引并插入数据
print(es.index(index="imooc", body={"name":"张三", "age":20}, id=1))
# 查询所有数据
print(es.search(index="imooc"))

# 查询单条数据
print(es.get(index="imooc", id=1))

# search查询SDL单条数据
print(es.search({
    "query": {
        "match":{
            "name": "大壮"
        }
    }
}))

# 数据的更新
body = {
    "doc":{
        "tags":"imooc"
    }
}
print(es.get(index="imooc", id=1))
print(es.update(index="imooc", id=1, body=body))
print(es.get(index="imooc", id=1))

# 单条数据的删除
print(es.delete(index="imooc", id=1))

7-17 通过python对elasticsearch批量添加数据

1. 批量插入10000条数据

1. 一般方法:遍历方式插入

import time

from elasticsearch import Elasticsearch

es = Elasticsearch(hosts="192.168.0.103:9200")

def timer(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        res = func(*args, **kwargs)
        print("耗时{:.3f}".format(time.time()-start_time))
        return res

    return wrapper

@timer
def handle_es():
    for value in range(10000):
        es.index(index="value_demo", body={"value": value})

if __name__ == '__main__':
    handle_es()
#----------------------------------
耗时14.522

2. 优化方案1:先组装一个大list,将数据放到里面去,再利用helpers插入
效率提升了37倍;

import time
from elasticsearch import Elasticsearch, helpers

es = Elasticsearch(hosts="192.168.0.103:9200")

def timer(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        res = func(*args, **kwargs)
        print("耗时{:.3f}".format(time.time()-start_time))
        return res

    return wrapper

@timer
def handle_es():
    data = [{"_index":"value_demo", "_source":{"value": value}} for value in range(10000)]
    # helpers 插入数据
    helpers.bulk(es, data)

if __name__ == '__main__':
    handle_es()
#------------------------------------
耗时0.390

3. 优化方案2:构造迭代器
把迭代器放到es里面,通过es来解析,并执行数据的插入
es插入数据的速度远远高于Python

import time
from elasticsearch import Elasticsearch, helpers

es = Elasticsearch(hosts="192.168.0.103:9200")

def timer(func):
    def wrapper(*args, **kwargs):
        start_time = time.time()
        res = func(*args, **kwargs)
        print("耗时{:.3f}".format(time.time()-start_time))
        return res

    return wrapper

@timer
def handle_es():
    for value in range(10000):
        yield {"_index":"value_demo", "_source":{"value":value}}

if __name__ == '__main__':
    helpers.bulk(es, actions=handle_es())
    print(es.count(index="value_demo"))
#-----------------------------
耗时0.000
{'count': 40000, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}

你可能感兴趣的:(7. Elasticsearch)