不要迷失在别人的评价里,用心倾听自己内心的声音,做好真实的自己!
参考:
- ElasticSearch 第三弹,核心概念介绍
- https://www.cnblogs.com/Neeo/articles/10593037.html
前面讲到 通过U2模拟人类的点击操作,通过mitmdump来解析数据;
这里将学习通过elasticsearch组件存储数据,用kibana进行大数据分析;
7-1 elasticsearch介绍和安装
1. Elasticsearch介绍
Elasticsearch 作为 Elastic Stack 的核心,它是一个分布式、面向文档的 RESTful 风格搜索和数据分析引擎,它支持结构化和非结构化查询,并且不需要提前定义模式。Elasticsearch 可用作搜索引擎,通常用于 Web 级日志分析、实时应用监控和点击流分析。
https://www.elastic.co/cn/elasticsearch/
2. Elasticsearch安装
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-amd64.deb
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-amd64.deb.sha512
shasum -a 512 -c elasticsearch-7.10.1-amd64.deb.sha512
sudo dpkg -i elasticsearch-7.10.1-amd64.deb
# 安装
user1@imooc:~/data$ sudo dpkg -i elasticsearch-7.7.0-amd64.deb
# 安装位置
user1@imooc:~/data$ ls /usr/share/elasticsearch/
bin jdk lib modules NOTICE.txt plugins README.asciidoc
# s权限(超级权限)
user1@imooc:~/data$ ls -ld /etc/elasticsearch/
drwxr-s--- 3 root elasticsearch 4096 12月 18 10:05 /etc/elasticsearch/
# 配置文件
user1@imooc:~/data$ sudo nano /etc/elasticsearch/elasticsearch.yml
network.host: 0.0.0.0
cluster.initial_master_nodes: node-1
# 启动
user1@imooc:~/data$ service elasticsearch start
[sudo] user1 的密码:
user1@imooc:~/data$ curl http://localhost:9200 # 查看是否启动
{
"name" : "imooc",
"cluster_name" : "elasticsearch", # 集群名称
"cluster_uuid" : "Djbdx0q4RUy6zYEAGxeHpQ",
"version" : {
"number" : "7.7.0", # 版本号码
"build_flavor" : "default",
"build_type" : "deb",
"build_hash" : "81a1e9eda8e6183f5237786246f6dced26a10eaf",
"build_date" : "2020-05-12T02:01:37.602180Z",
"build_snapshot" : false,
"lucene_version" : "8.5.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search" # 搜索引擎
}
user1@imooc:~/data$ service elasticsearch restart
user1@imooc:~/data$ ps -ef | grep elasticsearch
elastic+ 108187 1 17 11:48 ? 00:00:14 /usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX:+ShowCodeDetailsInExceptionMessages -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=SPI,COMPAT -Xms1g -Xmx1g -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -Djava.io.tmpdir=/tmp/elasticsearch-1918501227772571995 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -XX:MaxDirectMemorySize=536870912 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=deb -Des.bundled_jdk=true -cp /usr/share/elasticsearch/lib/* org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet
elastic+ 108373 108187 0 11:48 ? 00:00:00 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller
user1 108448 105669 0 11:49 pts/7 00:00:00 grep --color=auto elasticsearch
user1@imooc:~/data$ cat /etc/passwd
elasticsearch:x:124:128::/nonexistent:/bin/false
注意: elasticsearch是由远程执行的,不能用root权限启动;
浏览器输入 http://192.168.0.103:9200/ ,返回信息与curl http://localhost:9200一致, 则证明elasticsearch安装成功。
7-2 数据可视化组件kibana的安装和在elasticsearch基本的增删改查
1. Elasticsearch基础
文档(Document)
一个可以被索引的数据单元。例如一个用户的文档、一个产品的文档等等。文档都是 JSON 格式的。
文档相当于mysql表中的数据;索引(Index)
index类比为Mysql的表
节点(Node)
一个node就是ES的一个运行实例
集群中的一个服务器就是一个节点,节点中会存储数据,同时参与集群的索引以及搜索功能。一个节点想要加入一个集群,只需要配置一下集群名称即可。默认情况下,如果我们启动了多个节点,多个节点还能够互相发现彼此,那么它们会自动组成一个集群,这是 es 默认提供的,但是这种方式并不可靠,有可能会发生脑裂现象。所以在实际使用中,建议一定手动配置一下集群信息。
集群(Cluster)
一个或者多个安装了 es 节点的服务器组织在一起,就是集群,这些节点共同持有数据,共同提供搜索服务。
一个集群有一个名字,这个名字是集群的唯一标识,该名字成为 cluster name,默认的集群名称是 elasticsearch,具有相同名称的节点才会组成一个集群。
可以在 config/elasticsearch.yml 文件中配置集群名称:
cluster.name: javaboy-es
在集群中,节点的状态有三种:绿色、黄色、红色:
绿色:节点运行状态为健康状态。所有的主分片、副本分片都可以正常工作。
黄色:表示节点的运行状态为警告状态,所有的主分片目前都可以直接运行,但是至少有一个副本分片是不能正常工作的。
红色:表示集群无法正常工作。
2. 组件Kibana的安装
1. Kibana介绍
使用Kibana查询和分析存储在Elastisearch中的数据。Kibana 搭载了一些经典功能:柱状图、线状图、饼图、环形图,等等,充分利用了 Elasticsearch 的聚合功能。而且Dashboard功能能够将各种可视化组件集合成大屏。如下图所示,Kibana对网站应用的访客日志进行统计分析。
2. Kibana安装
Kibana安装文档 https://www.elastic.co/guide/en/kibana/current/deb.html
参考:Kibana和 elasticsearch安装的Linux IP = 192.168.0.101
配置最好一致
wget https://artifacts.elastic.co/downloads/kibana/kibana-7.10.1-amd64.deb
shasum -a 512 kibana-7.10.1-amd64.deb
sudo dpkg -i kibana-7.10.1-amd64.deb
# 安装
user1@imooc:~/data$ sudo dpkg -i kibana-7.7.0-amd64.deb
user1@imooc:~/data$ ls /etc/kibana/
kibana.yml
user1@imooc:~/data$ sudo nano /etc/kibana/kibana.yml # 配置文件
server.port: 5601
server.host: "192.168.0.103"
elasticsearch.hosts: ["http://localhost:9200"]
i18n.locale: "zh-CN"
# 中文
# 启动
user1@imooc:~/data$ service kibana start
user1@imooc:~/data$ ps -ef|grep kibana
kibana 109436 1 64 12:45 ? 00:00:33 /usr/share/kibana/bin/../node/bin/node /usr/share/kibana/bin/../src/cli -c /etc/kibana/kibana.yml
user1 109516 105669 0 12:46 pts/7 00:00:00 grep --color=auto kibana
启动的时候,需要等待1-5分钟,跟你的电脑配置有关
web界面展示 http://192.168.0.103:5601/app/kibana#/home
3. Elastic Kibana 增删改查
https://www.elastic.co/guide/en/elasticsearch/reference/7.7/cat-indices.html
增删改
创建索引
PUT test1 # 创建索引
GET _cat/indices # 查看索引
DELETE test1 # 删除索引
# 插入(3条)
PUT test1/_doc/1
{
"name":"大壮",
"age":18,
"location":"北京"
}
PUT test1/_doc/2
{
"name":"张三",
"age":18,
"location":"上海"
}
PUT test1/_doc/3
{
"name":"李四",
"age":18,
"location":"广州"
}
# --------------------------------
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "3",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 2,
"_primary_term" : 1
}
# 查看插入的数据
GET test1/_search
#-----------------------------------
{
"took" : 738,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : { # 3 条数据
"value" : 3,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"name" : "大壮",
"age" : 18,
"location" : "北京"
}
},
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "2",
"_score" : 1.0,
"_source" : {
"name" : "张三",
"age" : 18,
"location" : "上海"
}
},
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"name" : "李四",
"age" : 18,
"location" : "广州"
}
}
]
}
}
更新数据 update
POST test1/_update/1
{
"doc":{
"location":"北京市朝阳区"
}
}
#--------------------------------------------
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "1",
"_version" : 2, # 版本
"result" : "updated", # 状态
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 3,
"_primary_term" : 1
}
# 查看更新的数据
GET test1/_doc/1
#---------------------------------
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "1",
"_version" : 2,
"_seq_no" : 3,
"_primary_term" : 1,
"found" : true,
"_source" : {
"name" : "大壮",
"age" : 18,
"location" : "北京市朝阳区"
}
}
# 删除数据
DELETE test1/_doc/1
# 删除所有数据
DELETE test1
7-3 elasticsearch查询进阶
_doc 是什么意思呢?
1. elasticsearch查询方法2种
- 查询字符串(query string),简单查询;
- 通过DSL语句进行查询
# 单查 与 全查
GET test1/_doc/1
GET test1/_search
GET test1/_search?q=name="李四"
#name text 以 or 的形式进行匹配
#name.keyword 不会进行分词,全匹配
GET test1/_search
{
"query":{
"match":{
"name":"李四"
}
}
}
GET test1/_search
{
"query":{
"match":{
"name.keyword":"李四"
}
}
}
# match_all = select * from tables
GET test1/_search
{
"query":{
"match_all": {}
}
}
7-4 elasticsearch查询排序
sort字段来进行排序操作(默认desc降序)
desc降序,asc是升序
# sort字段来进行排序操作(默认desc降序)
# desc降序,asc是升序
GET test1/_search
{
"query":{
"match": {
"name.keyword": "李四"
}
}
}
# 升序排序
GET test1/_search
{
"query":{
"match": {
"name.keyword": "李四"
}
},
"sort":[
{
"name":{
"order": "asc"
}
}]
}
7-5 elasticsearch分页查询
数据量大的时候,查询数据,不用全部显示出来,就要用到分页查询;
from-从index=0开始
size-长度大小=5
# 查看一页两条数据
GET test1/_search
{
"query":{
"match_all": {}
},
"from":1,
"size":5
}
#------------------------------------------------
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 3, # 总数=3
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "3",
"_score" : 1.0,
"_source" : {
"name" : "李四",
"age" : 18,
"location" : "广州"
}
},
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"name" : "大壮",
"age" : 18,
"location" : "北京市朝阳区"
}
}
]
}
}
7-6 布尔查询
布尔查询是一种最常用的组合查询
1. 查询国家为蜀国和性别为男、年龄=39的人
must 查询条件
GET test1/_search
{
"query":{
"bool":{
"must": [
{
"match":{
"country": "蜀国"
}
},
{
"match": {
"gender": "男"
}
},
{
"match": {
"age": "39"
}
}
]
}
}
}
#---------------------------------------------------------
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 2.508318,
"hits" : [
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "5",
"_score" : 2.508318,
"_source" : {
"name" : "诸葛亮",
"gender" : "男",
"age" : 39,
"country" : "蜀国",
"tags" : "智谋超群"
}
}
]
}
}
2. 查询标签为闭月羞花 或 年龄=39的人
should
GET test1/_search
{
"query":{
"bool":{
"should": [
{
"match":{
"tags.keyword": "闭月羞花"
}
},
{
"match": {
"age": "39"
}
}
]
}
}
}
# ----------2
3. 查询年龄不为39岁,和标签不为 闭月羞花 和 性别 不为男的人
must_not
GET test1/_search
{
"query":{
"bool":{
"must_not": [
{
"match":{
"tags.keyword": "闭月羞花"
}
},
{
"match": {
"age": "39"
}
},
{
"match":{
"gender": "男"
}
}
]
}
}
}
#----------------------------------------------------------------
{
"hits" : [
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "10",
"_score" : 0.0,
"_source" : {
"name" : "大乔",
"gender" : "女",
"age" : 19,
"country" : "吴国",
"tags" : "沉鱼落雁"
}
}
]
}
}
4. 查询年龄大于=39岁,国别为魏国的人
gt 大于;le 小于 gte lte
# gt 大于;le 小于 gte lte
GET test1/_search
{
"query":{
"bool":{
"must": [
{
"match":{
"country.keyword": "魏国"
}
}
],
"filter": [
{"range": {
"age": {
"gte": 39
}
}}
]
}
}
}
#--------------------------------------
"hits" : [
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.2321435,
"_source" : {
"name" : "司马懿",
"gender" : "男",
"age" : 41,
"country" : "魏国",
"tags" : "暗藏韬略"
}
}
]
7-7 结果过滤
添加字段
_source : name,age
GET test1/_search
{
"query":{
"bool":{
"must": [
{
"match":{
"country.keyword": "魏国"
}
}
],
"filter": [
{"range": {
"age": {
"gte": 38
}
}}
]
}
},
"_source": ["name","age"]
}
#--------------------------------------
"hits" : [
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "4",
"_score" : 1.2321435,
"_source" : {
"name" : "司马懿",
"age" : 41
}
},
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "7",
"_score" : 1.2321435,
"_source" : {
"name" : "许绪",
"age" : 38
}
}
]
7-8 高亮显示
支持CSS样式 pre_tags": ",
# 高亮显示 :加粗红色
GET test1/_search
{
"query": {
"match": {
"tags": "沉鱼落雁"
}
},
"highlight": {
"pre_tags": "",
"post_tags": "",
"fields": {
"tags": {}
}
}
}
#---------------------------------------
"hits" : [
{
"_index" : "test1",
"_type" : "_doc",
"_id" : "10",
"_score" : 8.317766,
"_source" : {
"name" : "大乔",
"gender" : "女",
"age" : 19,
"country" : "吴国",
"tags" : "沉鱼落雁"
},
"highlight" : {
"tags" : [
"沉鱼落雁"
]
7-9 Elasticsearch聚合函数查询
求和、最大、最小、平均
1. 求吴国年龄的平均值
GET test1/_search
{
"query": {
"match": {
"country.keyword": "吴国"
}
},
"aggs":{
"age_avg":{
"avg":{
"field":"age"
}
}
}
}
#-----------------------------------
"aggregations" : {
"age_avg" : {
"value" : 24.6
}
}
# 过滤一下
GET test1/_search
{
"query": {
"match": {
"country.keyword": "吴国"
}
},
"aggs":{
"age_avg":{
"avg":{
"field":"age"
}
}
},
"_source": ["name","age"]
}
#------------------------------------------------
"aggregations" : {
"age_avg" : {
"value" : 24.6
}
}
# 不看查询的数据 size=0
GET test1/_search
{
"query": {
"match": {
"country.keyword": "吴国"
}
},
"aggs":{
"age_avg":{
"avg":{
"field":"age"
}
}
},
"_source": ["name","age"],
"size": 0
}
2. 求吴国年龄最大的人
GET test1/_search
{
"query": {
"match": {
"country.keyword": "吴国"
}
},
"aggs":{
"age_max":{
"max":{
"field":"age"
}
}
},
"_source": ["name","age"],
"size": 2
}
#-------------------------------------
"aggregations" : {
"age_max" : {
"value" : 38.0
}
}
4. 求吴国所有年龄的和
GET test1/_search
{
"query": {
"match": {
"country.keyword": "吴国"
}
},
"aggs":{
"age_sum":{
"sum":{
"field":"age"
}
}
},
"_source": ["name","age"],
"size": 0
}
#---------------------------------
"aggregations" : {
"age_sum" : {
"value" : 123.0
}
}
7-10 分组查询
在分组条件之上进行一个 求平均值的操作;
# 分组查询
GET test1/_search
{ "size":0,
"query": {
"match_all": {
}
},
"aggs":{
"age_group":{
"range": {
"field": "age",
"ranges": [
{
"from": 10,
"to": 20
},
{
"from": 20,
"to": 30
},
{
"from": 30,
"to": 40
},
{
"from": 40,
"to": 50
}
]
},
"aggs": {
"age_avg": {
"avg": {
"field": "age"
}
}
}
}
}
}
}
#---------------------------------------------------
"hits" : [ ]
},
"aggregations" : {
"age_group" : {
"buckets" : [
{
"key" : "10.0-20.0",
"from" : 10.0,
"to" : 20.0,
"doc_count" : 4,
"age_avg" : {
"value" : 18.25
}
},
{
"key" : "20.0-30.0",
"from" : 20.0,
"to" : 30.0,
"doc_count" : 1,
"age_avg" : {
"value" : 21.0
}
},
{
"key" : "30.0-40.0",
"from" : 30.0,
"to" : 40.0,
"doc_count" : 4,
"age_avg" : {
"value" : 36.25
}
},
{
"key" : "40.0-50.0",
"from" : 40.0,
"to" : 50.0,
"doc_count" : 2,
"age_avg" : {
"value" : 40.5
}
}
]
7-11 _doc是用来做什么的
在6.0之前的版本中, Elasticsearch是这样添加数据的;
—doc 不是文档类型
6.0中
7-12 elasticsearch mappings的三种模式
要先定义表结构
elasticsearch 里面的表结构是自动创建的;
PUT test1/_doc/1
{
"name":"王百万"
}
PUT test1/_doc/2
{
"name":"赵金条",
"age":18
}
GET test1
#----------------------------
{
"test1" : {
"aliases" : { },
"mappings" : {
"properties" : {
"age" : {
"type" : "long" # 长
},
"name" : {
"type" : "text", # 文本
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1608297148199",
"number_of_shards" : "1",
"number_of_replicas" : "1",
"uuid" : "eNYY8AxJRV6o2KUdpnTVng",
"version" : {
"created" : "7070099"
},
"provided_name" : "test1"
}
}
}
}
1. mappings
字段类型
定义映射关系(两个字段name 、age)
PUT mappings_test1
{
"mappings": {
"properties": {
"name":{
"type":"text"
},
"age":{
"type":"long"
}
}
}
}
#-------------------------------------
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "mappings_test1"
}
PUT mappings_test1/_doc/1
{
"name":"大壮",
"age":18
}
GET mappings_test1/_search
Mappings的三种类型
动态映射:elasticsearch判断类型,自动创建mappings
静态映射
"dynamic": false,
新增compang字段的时候,elasticsearch会保存数据,但不会建立映射关系;所以查不到compang信息;
# 创建索引
PUT mappings_test2
{
"mappings": {
"dynamic": false,
"properties": {
"name":{
"type":"text"
},
"age":{
"type":"long"
}
}
}
}
# 插入数据;
PUT mappings_test2/_doc/1
{
"name":"张三",
"age":19,
"company":"imooc"
}
# 查询
GET mappings_test2/_search
{
"query": {
"match": {
"compang": "imooc"
}
}
}
#--------------------------
"hits" : [ ]
严格映射
遇到 没定义的新字段 会报错
# 创建索引
PUT mappings_test3
{
"mappings": {
"dynamic": "strict",
"properties": {
"name":{
"type":"text"
},
"age":{
"type":"long"
}
}
}
}
PUT mappings_test3/_doc/1
{
"name":"张三",
"age":19,
"company":"imooc"
}
#------------------------
{
"error" : {
"root_cause" : [
{
"type" : "strict_dynamic_mapping_exception",
"reason" : "mapping set to strict, dynamic introduction of [company] within [_doc] is not allowed"
}
],
"type" : "strict_dynamic_mapping_exception",
"reason" : "mapping set to strict, dynamic introduction of [company] within [_doc] is not allowed"
},
"status" : 400
}
7-13 elasticsearch的分词器
分析过程
当数据被发送到elasticsearch
后并加入到倒排索引之前,elasticsearch
会对该文档的进行一系列的处理步骤:
- 字符过滤:使用字符过滤器转变字符。
- 文本切分为分词:将文本(档)分为单个或多个分词。
- 分词过滤:使用分词过滤器转变每个分词。
- 分词索引:最终将分词存储在Lucene倒排索引中。
1. 标准分词器:standard tokenizer
https://www.cnblogs.com/Neeo/articles/10593037.html
标准分词器(standard tokenizer)是一个基于语法的分词器,对于大多数欧洲语言来说还是不错的,它同时还处理了Unicode文本的分词,但分词默认的最大长度是255字节,它也移除了逗号和句号这样的标点符号。
大写转换为小写;按照空格划分; 去掉特殊符号;汉字分单个;
POST _analyze
{
"analyzer": "standard",
"text":"Hello World&端午节"
}
#----------------------------------------
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "",
"position" : 0
},
{
"token" : "world",
"start_offset" : 6,
"end_offset" : 11,
"type" : "",
"position" : 1
},
{
"token" : "端",
"start_offset" : 12,
"end_offset" : 13,
"type" : "",
"position" : 2
},
{
"token" : "午",
"start_offset" : 13,
"end_offset" : 14,
"type" : "",
"position" : 3
},
{
"token" : "节",
"start_offset" : 14,
"end_offset" : 15,
"type" : "",
"position" : 4
}
]
}
2. 简单分析器:simple analyzer
先按 空格分词;大小写转换;汉字不分开
POST _analyze
{
"analyzer": "simple",
"text":"Hello World&端午节"
}
#-------------------------------------------
{
"tokens" : [
{
"token" : "hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "world",
"start_offset" : 6,
"end_offset" : 11,
"type" : "word",
"position" : 1
},
{
"token" : "端午节",
"start_offset" : 12,
"end_offset" : 15,
"type" : "word",
"position" : 2
}
]
}
3. 空白分析器:whitespace analyzer
只按空格分开;不转换大小写
POST _analyze
{
"analyzer": "whitespace",
"text":"Hello World&端午节"
}
#-------------------------
{
"tokens" : [
{
"token" : "Hello",
"start_offset" : 0,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "World&端午节",
"start_offset" : 6,
"end_offset" : 15,
"type" : "word",
"position" : 1
}
]
}
IK分词器
很好的支持中文
安装完成后,你需要重新启动才可以使用;
elasticsearch插件安装IK分词器
user1@imooc:~$ cd /usr/share/elasticsearch/bin/
user1@imooc:/usr/share/elasticsearch/bin$ sudo ./elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.7.0/elasticsearch-analysis-ik-7.7.0.zip
[sudo] user1 的密码:
-> Installing https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.7.0/elasticsearch-analysis-ik-7.7.0.zip
-> Downloading https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.7.0/elasticsearch-analysis-ik-7.7.0.zip
[=================================================] 100%
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: plugin requires additional permissions @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
* java.net.SocketPermission * connect,resolve
See http://docs.oracle.com/javase/8/docs/technotes/guides/security/permissions.html
for descriptions of what these permissions allow and the associated risks.
Continue with installation? [y/N]y
-> Installed analysis-ik
# 重新启动Elasticsearch
user1@imooc:~$ service elasticsearch restart
http://192.168.0.103:9200/ # 查看是否有输出;
IK分词器使用
ik_max_word 细粒度的拆分;
ik_smart 粗粒度的拆分;
POST _analyze
{
"analyzer": "ik_smart",
"text":"八百标兵奔北坡"
}
#------------------------------------
{
"tokens" : [
{
"token" : "八百",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "标兵",
"start_offset" : 2,
"end_offset" : 4,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "奔北",
"start_offset" : 4,
"end_offset" : 6,
"type" : "CN_WORD",
"position" : 2
},
{
"token" : "坡",
"start_offset" : 6,
"end_offset" : 7,
"type" : "CN_CHAR",
"position" : 3
}
]
}
POST _analyze
{
"analyzer": "ik_max_word",
"text":"端午节"
}
#---------------------------------------------
{
"tokens" : [
{
"token" : "端午节",
"start_offset" : 0,
"end_offset" : 3,
"type" : "CN_WORD",
"position" : 0
},
{
"token" : "端午",
"start_offset" : 0,
"end_offset" : 2,
"type" : "CN_WORD",
"position" : 1
},
{
"token" : "节",
"start_offset" : 2,
"end_offset" : 3,
"type" : "CN_CHAR",
"position" : 2
}
]
}
7-14 修改elasticsearch的分词器
修改为 ik_max_word 分词器;查询的时候,会按照词来匹配 查询
# 标准分词器
PUT test1
# 修改为IK分词器
Put test2
{
"mappings":{
"properties":{
"title":{
"type":"text",
"analyzer":"ik_max_word"
}
}
}
}
7-15 倒排索引
参考:1. 聊聊 Elasticsearch 的倒排索引
elasticsearch 是一个分布式多用户能力的全文搜索引擎
倒排索引:从百度搜索页面输入关键字,到返回搜索结果的过程就是倒排索引;
1. 为什么叫倒排索引?
在没有搜索引擎时,我们是直接输入一个网址,然后获取网站内容,这时我们的行为是:
document -> to -> words
通过文章,获取里面的单词,此谓「正向索引」,forward index.
后来,我们希望能够输入一个单词,找到含有这个单词,或者和这个单词有关系的文章:
word -> to -> documents
于是我们把这种索引,成为inverted index,直译过来,应该叫「反向索引」,国内翻译成「倒排索引」,
2.elasticsearch会为每个字段建立 倒排索引
7-16 通过python操作elasticsearch增删改查
elasticsearch安装
pip install elasticsearch
from elasticsearch import Elasticsearch
# 隐式指定连接:默认 = 127.0.0.1:9200
es = Elasticsearch()
print(es.ping())
#----------------------------
True # 连接成功
# 显式指定
es = Elasticsearch(hosts="127.0.0.1:9200")
print(es.ping())
print(es.ping(), es.info())
# 创建索引
es.indices.create(index="imooc")
# 查看索引
print(es.cat.indices())
# 删除索引
print(es.indices.delete(index="imooc"))
# 创建索引并插入数据
print(es.index(index="imooc", body={"name":"大壮", "age":18}))
# 查询数据
print(es.search(index="imooc"))
# 指定ID创建索引并插入数据
print(es.index(index="imooc", body={"name":"张三", "age":20}, id=1))
# 查询所有数据
print(es.search(index="imooc"))
# 查询单条数据
print(es.get(index="imooc", id=1))
# search查询SDL单条数据
print(es.search({
"query": {
"match":{
"name": "大壮"
}
}
}))
# 数据的更新
body = {
"doc":{
"tags":"imooc"
}
}
print(es.get(index="imooc", id=1))
print(es.update(index="imooc", id=1, body=body))
print(es.get(index="imooc", id=1))
# 单条数据的删除
print(es.delete(index="imooc", id=1))
7-17 通过python对elasticsearch批量添加数据
1. 批量插入10000条数据
1. 一般方法:遍历方式插入
import time
from elasticsearch import Elasticsearch
es = Elasticsearch(hosts="192.168.0.103:9200")
def timer(func):
def wrapper(*args, **kwargs):
start_time = time.time()
res = func(*args, **kwargs)
print("耗时{:.3f}".format(time.time()-start_time))
return res
return wrapper
@timer
def handle_es():
for value in range(10000):
es.index(index="value_demo", body={"value": value})
if __name__ == '__main__':
handle_es()
#----------------------------------
耗时14.522
2. 优化方案1:先组装一个大list,将数据放到里面去,再利用helpers插入
效率提升了37倍;
import time
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch(hosts="192.168.0.103:9200")
def timer(func):
def wrapper(*args, **kwargs):
start_time = time.time()
res = func(*args, **kwargs)
print("耗时{:.3f}".format(time.time()-start_time))
return res
return wrapper
@timer
def handle_es():
data = [{"_index":"value_demo", "_source":{"value": value}} for value in range(10000)]
# helpers 插入数据
helpers.bulk(es, data)
if __name__ == '__main__':
handle_es()
#------------------------------------
耗时0.390
3. 优化方案2:构造迭代器
把迭代器放到es里面,通过es来解析,并执行数据的插入
es插入数据的速度远远高于Python
import time
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch(hosts="192.168.0.103:9200")
def timer(func):
def wrapper(*args, **kwargs):
start_time = time.time()
res = func(*args, **kwargs)
print("耗时{:.3f}".format(time.time()-start_time))
return res
return wrapper
@timer
def handle_es():
for value in range(10000):
yield {"_index":"value_demo", "_source":{"value":value}}
if __name__ == '__main__':
helpers.bulk(es, actions=handle_es())
print(es.count(index="value_demo"))
#-----------------------------
耗时0.000
{'count': 40000, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}}