7. Elasticsearch

不要迷失在别人的评价里，用心倾听自己内心的声音，做好真实的自己！

参考：

ElasticSearch 第三弹，核心概念介绍

https://www.cnblogs.com/Neeo/articles/10593037.html

前面讲到通过U2模拟人类的点击操作，通过mitmdump来解析数据；
这里将学习通过elasticsearch组件存储数据，用kibana进行大数据分析；

7-1 elasticsearch介绍和安装

1. Elasticsearch介绍

Elasticsearch 作为 Elastic Stack 的核心，它是一个分布式、面向文档的 RESTful 风格搜索和数据分析引擎，它支持结构化和非结构化查询，并且不需要提前定义模式。Elasticsearch 可用作搜索引擎，通常用于 Web 级日志分析、实时应用监控和点击流分析。

https://www.elastic.co/cn/elasticsearch/

2. Elasticsearch安装

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-amd64.deb
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-amd64.deb.sha512
shasum -a 512  -c elasticsearch-7.10.1-amd64.deb.sha512 
sudo dpkg -i elasticsearch-7.10.1-amd64.deb

# 安装
user1@imooc:~/data$ sudo dpkg -i elasticsearch-7.7.0-amd64.deb 

# 安装位置
user1@imooc:~/data$ ls /usr/share/elasticsearch/
bin  jdk  lib  modules  NOTICE.txt  plugins  README.asciidoc

# s权限（超级权限）
user1@imooc:~/data$ ls -ld /etc/elasticsearch/
drwxr-s--- 3 root elasticsearch 4096 12月 18 10:05 /etc/elasticsearch/

# 配置文件
user1@imooc:~/data$ sudo nano /etc/elasticsearch/elasticsearch.yml

network.host: 0.0.0.0   
cluster.initial_master_nodes: node-1


# 启动
user1@imooc:~/data$ service elasticsearch start
[sudo] user1 的密码： 
user1@imooc:~/data$ curl http://localhost:9200   # 查看是否启动
{
  "name" : "imooc",
  "cluster_name" : "elasticsearch",         # 集群名称
  "cluster_uuid" : "Djbdx0q4RUy6zYEAGxeHpQ",
  "version" : {
    "number" : "7.7.0",                            # 版本号码
    "build_flavor" : "default",
    "build_type" : "deb",
    "build_hash" : "81a1e9eda8e6183f5237786246f6dced26a10eaf",
    "build_date" : "2020-05-12T02:01:37.602180Z",
    "build_snapshot" : false,
    "lucene_version" : "8.5.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"       # 搜索引擎
}

user1@imooc:~/data$ service elasticsearch restart


user1@imooc:~/data$ ps -ef | grep elasticsearch
elastic+ 108187      1 17 11:48 ?        00:00:14 /usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-1918501227772571995 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -XX:MaxDirectMemorySize=536870912 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=deb -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet
elastic+ 108373 108187  0 11:48 ?        00:00:00 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller
user1    108448 105669  0 11:49 pts/7    00:00:00 grep --color=auto elasticsearch

user1@imooc:~/data$ cat /etc/passwd
elasticsearch:x:124:128::/nonexistent:/bin/false

注意： elasticsearch是由远程执行的，不能用root权限启动；

浏览器输入 http://192.168.0.103:9200/ ，返回信息与curl http://localhost:9200一致，则证明elasticsearch安装成功。

7-2 数据可视化组件kibana的安装和在elasticsearch基本的增删改查

1. Elasticsearch基础

文档（Document）

一个可以被索引的数据单元。例如一个用户的文档、一个产品的文档等等。文档都是 JSON 格式的。

文档相当于mysql表中的数据；

术语

索引（Index）

index类比为Mysql的表

节点（Node）

一个node就是ES的一个运行实例

集群中的一个服务器就是一个节点，节点中会存储数据，同时参与集群的索引以及搜索功能。一个节点想要加入一个集群，只需要配置一下集群名称即可。默认情况下，如果我们启动了多个节点，多个节点还能够互相发现彼此，那么它们会自动组成一个集群，这是 es 默认提供的，但是这种方式并不可靠，有可能会发生脑裂现象。所以在实际使用中，建议一定手动配置一下集群信息。

集群（Cluster）

一个或者多个安装了 es 节点的服务器组织在一起，就是集群，这些节点共同持有数据，共同提供搜索服务。

一个集群有一个名字，这个名字是集群的唯一标识，该名字成为 cluster name，默认的集群名称是 elasticsearch，具有相同名称的节点才会组成一个集群。

可以在 config/elasticsearch.yml 文件中配置集群名称：

cluster.name: javaboy-es

在集群中，节点的状态有三种：绿色、黄色、红色：
绿色：节点运行状态为健康状态。所有的主分片、副本分片都可以正常工作。
黄色：表示节点的运行状态为警告状态，所有的主分片目前都可以直接运行，但是至少有一个副本分片是不能正常工作的。
红色：表示集群无法正常工作。

2. 组件Kibana的安装

1. Kibana介绍

使用Kibana查询和分析存储在Elastisearch中的数据。Kibana 搭载了一些经典功能：柱状图、线状图、饼图、环形图，等等，充分利用了 Elasticsearch 的聚合功能。而且Dashboard功能能够将各种可视化组件集合成大屏。如下图所示，Kibana对网站应用的访客日志进行统计分析。

2. Kibana安装

Kibana安装文档 https://www.elastic.co/guide/en/kibana/current/deb.html

参考：Kibana和 elasticsearch安装的Linux IP = 192.168.0.101
配置最好一致

wget https://artifacts.elastic.co/downloads/kibana/kibana-7.10.1-amd64.deb
shasum -a 512 kibana-7.10.1-amd64.deb
sudo dpkg -i kibana-7.10.1-amd64.deb

# 安装
user1@imooc:~/data$ sudo dpkg -i kibana-7.7.0-amd64.deb 

user1@imooc:~/data$ ls /etc/kibana/
kibana.yml
user1@imooc:~/data$ sudo nano /etc/kibana/kibana.yml   # 配置文件

server.port: 5601
server.host: "192.168.0.103"
elasticsearch.hosts: ["http://localhost:9200"]
i18n.locale: "zh-CN"
# 中文

# 启动
user1@imooc:~/data$ service kibana start
user1@imooc:~/data$ ps -ef|grep kibana
kibana   109436      1 64 12:45 ?        00:00:33 /usr/share/kibana/bin/../node/bin/node /usr/share/kibana/bin/../src/cli -c /etc/kibana/kibana.yml
user1    109516 105669  0 12:46 pts/7    00:00:00 grep --color=auto kibana

启动的时候，需要等待1-5分钟，跟你的电脑配置有关
web界面展示 http://192.168.0.103:5601/app/kibana#/home

点击小扳手

3. Elastic Kibana 增删改查

https://www.elastic.co/guide/en/elasticsearch/reference/7.7/cat-indices.html

增删改

创建索引

PUT test1            # 创建索引
GET _cat/indices     # 查看索引
DELETE test1         #  删除索引

# 插入（3条）
PUT test1/_doc/1
{
  "name":"大壮",
  "age":18,
  "location":"北京"
}

PUT test1/_doc/2
{
  "name":"张三",
  "age":18,
  "location":"上海"
}


PUT test1/_doc/3
{
  "name":"李四",
  "age":18,
  "location":"广州"
}
# --------------------------------
{
  "_index" : "test1",
  "_type" : "_doc",
  "_id" : "3",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 2,
  "_primary_term" : 1
}


# 查看插入的数据
GET test1/_search
#-----------------------------------
{
  "took" : 738,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {       # 3 条数据
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "大壮",
          "age" : 18,
          "location" : "北京"
        }
      },
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "name" : "张三",
          "age" : 18,
          "location" : "上海"
        }
      },
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "name" : "李四",
          "age" : 18,
          "location" : "广州"
        }
      }
    ]
  }
}

更新数据 update

POST test1/_update/1
{
  "doc":{
    "location":"北京市朝阳区"
  }
}
#--------------------------------------------
{
  "_index" : "test1",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 2,             # 版本
  "result" : "updated",       # 状态
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 3,
  "_primary_term" : 1
}

# 查看更新的数据
GET test1/_doc/1
#---------------------------------
{
  "_index" : "test1",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 2,
  "_seq_no" : 3,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "name" : "大壮",
    "age" : 18,
    "location" : "北京市朝阳区"
  }
}

# 删除数据
DELETE test1/_doc/1

# 删除所有数据
DELETE test1

7-3 elasticsearch查询进阶

_doc 是什么意思呢？

1. elasticsearch查询方法2种

查询字符串（query string），简单查询；

通过DSL语句进行查询

# 单查 与 全查
GET test1/_doc/1
GET test1/_search

GET test1/_search?q=name="李四"

#name text 以 or 的形式进行匹配
#name.keyword 不会进行分词，全匹配
GET test1/_search
{
  "query":{
    "match":{
      "name":"李四"
    }
  }
}

GET test1/_search
{
  "query":{
    "match":{
      "name.keyword":"李四"
    }
  }
}

# match_all = select * from tables
GET test1/_search
{
  "query":{
    "match_all": {}
  }
}

7-4 elasticsearch查询排序

sort字段来进行排序操作（默认desc降序）
desc降序，asc是升序

# sort字段来进行排序操作（默认desc降序）
# desc降序，asc是升序
GET test1/_search
{
  "query":{
    "match": {
      "name.keyword": "李四"
    }
  }
}

# 升序排序
GET test1/_search
{
  "query":{
    "match": {
      "name.keyword": "李四"
    }
  },
  "sort":[
    {
      "name":{
        "order": "asc"
      }
    }]
  
}

7-5 elasticsearch分页查询

数据量大的时候，查询数据，不用全部显示出来，就要用到分页查询；

from-从index=0开始
size-长度大小=5

# 查看一页两条数据
GET test1/_search
{
  "query":{
    "match_all": {}
  },
  "from":1,
  "size":5
}
#------------------------------------------------
{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,   # 总数=3
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "name" : "李四",
          "age" : 18,
          "location" : "广州"
        }
      },
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "name" : "大壮",
          "age" : 18,
          "location" : "北京市朝阳区"
        }
      }
    ]
  }
}

7-6 布尔查询

布尔查询是一种最常用的组合查询

1. 查询国家为蜀国和性别为男、年龄=39的人
must 查询条件

GET test1/_search
{
  "query":{
    "bool":{
      "must": [
        {
          "match":{
            "country": "蜀国"
          }
        },
        {
          "match": {
            "gender": "男"
          }
        },
        {
          "match": {
            "age": "39"
          }
        }
      ]
    }
  }
}
#---------------------------------------------------------
{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 2.508318,
    "hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "5",
        "_score" : 2.508318,
        "_source" : {
          "name" : "诸葛亮",
          "gender" : "男",
          "age" : 39,
          "country" : "蜀国",
          "tags" : "智谋超群"
        }
      }
    ]
  }
}

2. 查询标签为闭月羞花或年龄=39的人
should


GET test1/_search
{
  "query":{
    "bool":{
      "should": [
        {
          "match":{
            "tags.keyword": "闭月羞花"
          }
        },
        {
          "match": {
            "age": "39"
          }
        }
      ]
    }
  }
}
# ----------2

3. 查询年龄不为39岁，和标签不为闭月羞花和性别不为男的人
must_not

GET test1/_search
{
  "query":{
    "bool":{
      "must_not": [
        {
          "match":{
            "tags.keyword": "闭月羞花"
          }
        },
        {
          "match": {
            "age": "39"
          }
        },
        {
          "match":{
            "gender": "男"
          }
        }
      ]
    }
  }
}
#----------------------------------------------------------------
{
    "hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "10",
        "_score" : 0.0,
        "_source" : {
          "name" : "大乔",
          "gender" : "女",
          "age" : 19,
          "country" : "吴国",
          "tags" : "沉鱼落雁"
        }
      }
    ]
  }
}

4. 查询年龄大于=39岁，国别为魏国的人
gt 大于；le 小于 gte lte

# gt 大于；le 小于  gte lte
GET test1/_search
{
  "query":{
    "bool":{
      "must": [
        {
          "match":{
            "country.keyword": "魏国"
          }
        }
      ],
      "filter": [
        {"range": {
          "age": {
            "gte": 39
          }
        }}
      ]
    }
  }
}
#--------------------------------------
"hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.2321435,
        "_source" : {
          "name" : "司马懿",
          "gender" : "男",
          "age" : 41,
          "country" : "魏国",
          "tags" : "暗藏韬略"
        }
      }
    ]

7-7 结果过滤

添加字段
_source ： name,age

GET test1/_search
{
  "query":{
    "bool":{
      "must": [
        {
          "match":{
            "country.keyword": "魏国"
          }
        }
      ],
      "filter": [
        {"range": {
          "age": {
            "gte": 38
          }
        }}
      ]
    }
  },
  "_source": ["name","age"]
}
#--------------------------------------
"hits" : [
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.2321435,
        "_source" : {
          "name" : "司马懿",
          "age" : 41
        }
      },
      {
        "_index" : "test1",
        "_type" : "_doc",
        "_id" : "7",
        "_score" : 1.2321435,
        "_source" : {
          "name" : "许绪",
          "age" : 38
        }
      }
    ]

7-8 高亮显示

支持CSS样式 pre_tags": ",

# 高亮显示 :加粗红色 GET test1/_search { "query": { "match": { "tags": "沉鱼落雁" } }, "highlight": { "pre_tags": "", "post_tags": "", "fields": { "tags": {} } } } #--------------------------------------- "hits" : [ { "_index" : "test1", "_type" : "_doc", "_id" : "10", "_score" : 8.317766, "_source" : { "name" : "大乔", "gender" : "女", "age" : 19, "country" : "吴国", "tags" : "沉鱼落雁" }, "highlight" : { "tags" : [ "沉鱼落雁" ]

7-9 Elasticsearch聚合函数查询

求和、最大、最小、平均

1. 求吴国年龄的平均值

GET test1/_search { "query": { "match": { "country.keyword": "吴国" } }, "aggs":{ "age_avg":{ "avg":{ "field":"age" } } } } #----------------------------------- "aggregations" : { "age_avg" : { "value" : 24.6 } } # 过滤一下 GET test1/_search { "query": { "match": { "country.keyword": "吴国" } }, "aggs":{ "age_avg":{ "avg":{ "field":"age" } } }, "_source": ["name","age"] } #------------------------------------------------ "aggregations" : { "age_avg" : { "value" : 24.6 } } # 不看查询的数据 size=0 GET test1/_search { "query": { "match": { "country.keyword": "吴国" } }, "aggs":{ "age_avg":{ "avg":{ "field":"age" } } }, "_source": ["name","age"], "size": 0 }

2. 求吴国年龄最大的人

GET test1/_search { "query": { "match": { "country.keyword": "吴国" } }, "aggs":{ "age_max":{ "max":{ "field":"age" } } }, "_source": ["name","age"], "size": 2 } #------------------------------------- "aggregations" : { "age_max" : { "value" : 38.0 } }

4. 求吴国所有年龄的和

GET test1/_search { "query": { "match": { "country.keyword": "吴国" } }, "aggs":{ "age_sum":{ "sum":{ "field":"age" } } }, "_source": ["name","age"], "size": 0 } #--------------------------------- "aggregations" : { "age_sum" : { "value" : 123.0 } }

7-10 分组查询

在分组条件之上进行一个求平均值的操作；

# 分组查询 GET test1/_search { "size":0, "query": { "match_all": { } }, "aggs":{ "age_group":{ "range": { "field": "age", "ranges": [ { "from": 10, "to": 20 }, { "from": 20, "to": 30 }, { "from": 30, "to": 40 }, { "from": 40, "to": 50 } ] }, "aggs": { "age_avg": { "avg": { "field": "age" } } } } } } } #--------------------------------------------------- "hits" : [ ] }, "aggregations" : { "age_group" : { "buckets" : [ { "key" : "10.0-20.0", "from" : 10.0, "to" : 20.0, "doc_count" : 4, "age_avg" : { "value" : 18.25 } }, { "key" : "20.0-30.0", "from" : 20.0, "to" : 30.0, "doc_count" : 1, "age_avg" : { "value" : 21.0 } }, { "key" : "30.0-40.0", "from" : 30.0, "to" : 40.0, "doc_count" : 4, "age_avg" : { "value" : 36.25 } }, { "key" : "40.0-50.0", "from" : 40.0, "to" : 50.0, "doc_count" : 2, "age_avg" : { "value" : 40.5 } } ]

7-11 _doc是用来做什么的

在6.0之前的版本中， Elasticsearch是这样添加数据的；
—doc 不是文档类型

6.0之前

6.0中

6.0

7.0

7-12 elasticsearch mappings的三种模式

要先定义表结构

elasticsearch 里面的表结构是自动创建的；

PUT test1/_doc/1 { "name":"王百万" } PUT test1/_doc/2 { "name":"赵金条", "age":18 } GET test1 #---------------------------- { "test1" : { "aliases" : { }, "mappings" : { "properties" : { "age" : { "type" : "long" # 长 }, "name" : { "type" : "text", # 文本 "fields" : { "keyword" : { "type" : "keyword", "ignore_above" : 256 } } } } }, "settings" : { "index" : { "creation_date" : "1608297148199", "number_of_shards" : "1", "number_of_replicas" : "1", "uuid" : "eNYY8AxJRV6o2KUdpnTVng", "version" : { "created" : "7070099" }, "provided_name" : "test1" } } } }

1. mappings

字段类型

定义映射关系（两个字段name 、age）

PUT mappings_test1 { "mappings": { "properties": { "name":{ "type":"text" }, "age":{ "type":"long" } } } } #------------------------------------- { "acknowledged" : true, "shards_acknowledged" : true, "index" : "mappings_test1" } PUT mappings_test1/_doc/1 { "name":"大壮", "age":18 } GET mappings_test1/_search

Mappings的三种类型

动态映射：elasticsearch判断类型，自动创建mappings

静态映射
"dynamic": false,
新增compang字段的时候，elasticsearch会保存数据，但不会建立映射关系；所以查不到compang信息；

# 创建索引 PUT mappings_test2 { "mappings": { "dynamic": false, "properties": { "name":{ "type":"text" }, "age":{ "type":"long" } } } } # 插入数据； PUT mappings_test2/_doc/1 { "name":"张三", "age":19, "company":"imooc" } # 查询 GET mappings_test2/_search { "query": { "match": { "compang": "imooc" } } } #-------------------------- "hits" : [ ]

严格映射
遇到没定义的新字段会报错

# 创建索引 PUT mappings_test3 { "mappings": { "dynamic": "strict", "properties": { "name":{ "type":"text" }, "age":{ "type":"long" } } } } PUT mappings_test3/_doc/1 { "name":"张三", "age":19, "company":"imooc" } #------------------------ { "error" : { "root_cause" : [ { "type" : "strict_dynamic_mapping_exception", "reason" : "mapping set to strict, dynamic introduction of [company] within [_doc] is not allowed" } ], "type" : "strict_dynamic_mapping_exception", "reason" : "mapping set to strict, dynamic introduction of [company] within [_doc] is not allowed" }, "status" : 400 }

7-13 elasticsearch的分词器

分析过程

当数据被发送到elasticsearch后并加入到倒排索引之前，elasticsearch会对该文档的进行一系列的处理步骤：

字符过滤：使用字符过滤器转变字符。

文本切分为分词：将文本（档）分为单个或多个分词。

分词过滤：使用分词过滤器转变每个分词。

分词索引：最终将分词存储在Lucene倒排索引中。

1. 标准分词器：standard tokenizer

https://www.cnblogs.com/Neeo/articles/10593037.html

标准分词器（standard tokenizer）是一个基于语法的分词器，对于大多数欧洲语言来说还是不错的，它同时还处理了Unicode文本的分词，但分词默认的最大长度是255字节，它也移除了逗号和句号这样的标点符号。

大写转换为小写；按照空格划分；去掉特殊符号；汉字分单个；

POST _analyze { "analyzer": "standard", "text":"Hello World&端午节" } #---------------------------------------- { "tokens" : [ { "token" : "hello", "start_offset" : 0, "end_offset" : 5, "type" : "", "position" : 0 }, { "token" : "world", "start_offset" : 6, "end_offset" : 11, "type" : "", "position" : 1 }, { "token" : "端", "start_offset" : 12, "end_offset" : 13, "type" : "", "position" : 2 }, { "token" : "午", "start_offset" : 13, "end_offset" : 14, "type" : "", "position" : 3 }, { "token" : "节", "start_offset" : 14, "end_offset" : 15, "type" : "", "position" : 4 } ] }

2. 简单分析器：simple analyzer

先按空格分词；大小写转换；汉字不分开

POST _analyze { "analyzer": "simple", "text":"Hello World&端午节" } #------------------------------------------- { "tokens" : [ { "token" : "hello", "start_offset" : 0, "end_offset" : 5, "type" : "word", "position" : 0 }, { "token" : "world", "start_offset" : 6, "end_offset" : 11, "type" : "word", "position" : 1 }, { "token" : "端午节", "start_offset" : 12, "end_offset" : 15, "type" : "word", "position" : 2 } ] }

3. 空白分析器：whitespace analyzer

只按空格分开；不转换大小写

POST _analyze { "analyzer": "whitespace", "text":"Hello World&端午节" } #------------------------- { "tokens" : [ { "token" : "Hello", "start_offset" : 0, "end_offset" : 5, "type" : "word", "position" : 0 }, { "token" : "World&端午节", "start_offset" : 6, "end_offset" : 15, "type" : "word", "position" : 1 } ] }

IK分词器

很好的支持中文

安装完成后，你需要重新启动才可以使用；

elasticsearch插件安装IK分词器

user1@imooc:~$ cd /usr/share/elasticsearch/bin/ user1@imooc:/usr/share/elasticsearch/bin$ sudo ./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.7.0/elasticsearch-analysis-ik-7.7.0.zip [sudo] user1 的密码： -> Installing https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.7.0/elasticsearch-analysis-ik-7.7.0.zip -> Downloading https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.7.0/elasticsearch-analysis-ik-7.7.0.zip [=================================================] 100% @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin requires additional permissions @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ * java.net.SocketPermission * connect,resolve See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html for descriptions of what these permissions allow and the associated risks. Continue with installation? [y/N]y -> Installed analysis-ik # 重新启动Elasticsearch user1@imooc:~$ service elasticsearch restart http://192.168.0.103:9200/ # 查看是否有输出；

IK分词器使用

ik_max_word 细粒度的拆分；
ik_smart 粗粒度的拆分；

POST _analyze { "analyzer": "ik_smart", "text":"八百标兵奔北坡" } #------------------------------------ { "tokens" : [ { "token" : "八百", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 0 }, { "token" : "标兵", "start_offset" : 2, "end_offset" : 4, "type" : "CN_WORD", "position" : 1 }, { "token" : "奔北", "start_offset" : 4, "end_offset" : 6, "type" : "CN_WORD", "position" : 2 }, { "token" : "坡", "start_offset" : 6, "end_offset" : 7, "type" : "CN_CHAR", "position" : 3 } ] } POST _analyze { "analyzer": "ik_max_word", "text":"端午节" } #--------------------------------------------- { "tokens" : [ { "token" : "端午节", "start_offset" : 0, "end_offset" : 3, "type" : "CN_WORD", "position" : 0 }, { "token" : "端午", "start_offset" : 0, "end_offset" : 2, "type" : "CN_WORD", "position" : 1 }, { "token" : "节", "start_offset" : 2, "end_offset" : 3, "type" : "CN_CHAR", "position" : 2 } ] }

7-14 修改elasticsearch的分词器

修改为 ik_max_word 分词器；查询的时候，会按照词来匹配查询

# 标准分词器 PUT test1 # 修改为IK分词器 Put test2 { "mappings":{ "properties":{ "title":{ "type":"text", "analyzer":"ik_max_word" } } } }

7-15 倒排索引

参考：1. 聊聊 Elasticsearch 的倒排索引

elasticsearch 是一个分布式多用户能力的全文搜索引擎
倒排索引：从百度搜索页面输入关键字，到返回搜索结果的过程就是倒排索引；

1. 为什么叫倒排索引？
在没有搜索引擎时，我们是直接输入一个网址，然后获取网站内容，这时我们的行为是：

document -> to -> words

通过文章，获取里面的单词，此谓「正向索引」，forward index.
后来，我们希望能够输入一个单词，找到含有这个单词，或者和这个单词有关系的文章：

word -> to -> documents

于是我们把这种索引，成为inverted index，直译过来，应该叫「反向索引」，国内翻译成「倒排索引」，

2.elasticsearch会为每个字段建立倒排索引

7-16 通过python操作elasticsearch增删改查

elasticsearch安装

pip install elasticsearch from elasticsearch import Elasticsearch # 隐式指定连接：默认 = 127.0.0.1:9200 es = Elasticsearch() print(es.ping()) #---------------------------- True # 连接成功 # 显式指定 es = Elasticsearch(hosts="127.0.0.1:9200") print(es.ping()) print(es.ping(), es.info()) # 创建索引 es.indices.create(index="imooc") # 查看索引 print(es.cat.indices()) # 删除索引 print(es.indices.delete(index="imooc")) # 创建索引并插入数据 print(es.index(index="imooc", body={"name":"大壮", "age":18})) # 查询数据 print(es.search(index="imooc")) # 指定ID创建索引并插入数据 print(es.index(index="imooc", body={"name":"张三", "age":20}, id=1)) # 查询所有数据 print(es.search(index="imooc")) # 查询单条数据 print(es.get(index="imooc", id=1)) # search查询SDL单条数据 print(es.search({ "query": { "match":{ "name": "大壮" } } })) # 数据的更新 body = { "doc":{ "tags":"imooc" } } print(es.get(index="imooc", id=1)) print(es.update(index="imooc", id=1, body=body)) print(es.get(index="imooc", id=1)) # 单条数据的删除 print(es.delete(index="imooc", id=1))

7-17 通过python对elasticsearch批量添加数据

1. 批量插入10000条数据

1. 一般方法：遍历方式插入

import time from elasticsearch import Elasticsearch es = Elasticsearch(hosts="192.168.0.103:9200") def timer(func): def wrapper(*args, **kwargs): start_time = time.time() res = func(*args, **kwargs) print("耗时{:.3f}".format(time.time()-start_time)) return res return wrapper @timer def handle_es(): for value in range(10000): es.index(index="value_demo", body={"value": value}) if __name__ == '__main__': handle_es() #---------------------------------- 耗时14.522

2. 优化方案1：先组装一个大list，将数据放到里面去，再利用helpers插入
效率提升了37倍；

import time from elasticsearch import Elasticsearch, helpers es = Elasticsearch(hosts="192.168.0.103:9200") def timer(func): def wrapper(*args, **kwargs): start_time = time.time() res = func(*args, **kwargs) print("耗时{:.3f}".format(time.time()-start_time)) return res return wrapper @timer def handle_es(): data = [{"_index":"value_demo", "_source":{"value": value}} for value in range(10000)] # helpers 插入数据 helpers.bulk(es, data) if __name__ == '__main__': handle_es() #------------------------------------ 耗时0.390

3. 优化方案2：构造迭代器
把迭代器放到es里面，通过es来解析，并执行数据的插入
es插入数据的速度远远高于Python

import time from elasticsearch import Elasticsearch, helpers es = Elasticsearch(hosts="192.168.0.103:9200") def timer(func): def wrapper(*args, **kwargs): start_time = time.time() res = func(*args, **kwargs) print("耗时{:.3f}".format(time.time()-start_time)) return res return wrapper @timer def handle_es(): for value in range(10000): yield {"_index":"value_demo", "_source":{"value":value}} if __name__ == '__main__': helpers.bulk(es, actions=handle_es()) print(es.count(index="value_demo")) #----------------------------- 耗时0.000 {'count': 40000, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}

7. Elasticsearch

7-1 elasticsearch介绍和安装

1. Elasticsearch介绍

2. Elasticsearch安装

7-2 数据可视化组件kibana的安装和在elasticsearch基本的增删改查

1. Elasticsearch基础

文档（Document）

索引（Index）

节点（Node）

集群（Cluster）

2. 组件Kibana的安装

1. Kibana介绍

2. Kibana安装

3. Elastic Kibana 增删改查

增删改

更新数据 update

7-3 elasticsearch查询进阶

1. elasticsearch查询方法2种

7-4 elasticsearch查询排序

7-5 elasticsearch分页查询

7-6 布尔查询

7-7 结果过滤

7-8 高亮显示

7-9 Elasticsearch聚合函数查询

7-10 分组查询

7-11 _doc是用来做什么的

7-12 elasticsearch mappings的三种模式

1. mappings

字段类型

Mappings的三种类型

7-13 elasticsearch的分词器

分析过程

1. 标准分词器：standard tokenizer

2. 简单分析器：simple analyzer

3. 空白分析器：whitespace analyzer

IK分词器

elasticsearch插件安装IK分词器

IK分词器使用

7-14 修改elasticsearch的分词器

7-15 倒排索引

7-16 通过python操作elasticsearch增删改查

7-17 通过python对elasticsearch批量添加数据

1. 批量插入10000条数据

你可能感兴趣的:(7. Elasticsearch)