蒋一清

Elasticsearch（三）

1、文档分析

将一块文本分成适合于倒排索引的独立的词条
将这些词条统一化为标准格式以提高它们的“可搜索性”，或者 recall

分析器执行上面的工作。分析器实际上是将三个功能封装到了一个包里：

字符过滤器：首先，字符串按顺序通过每个字符过滤器。他们的任务是在分词前整理字符串。一个字符过滤器可以用来去掉 HTML，或者将 & 转化成 and。
分词器：其次，字符串被分词器分为单个的词条。一个简单的分词器遇到空格和标点的时候，可能会将文本拆分成词条。
Token 过滤器：最后，词条按顺序通过每个 token 过滤器。这个过程可能会改变词条（例如，小写化Quick ），删除词条（例如，像 a， and， the 等无用词），或者增加词条（例如，像 jump 和 leap 这种同义词）。

1.1、内置分析器

Elasticsearch 还附带了可以直接使用的预包装的分析器。接下来我们会列出最重要的分析器。为了证明它们的差异，我们看看每个分析器会从下面的字符串得到哪些词条：

"Set the shape to semi-transparent by calling set_trans(5)"

"Set the shape to semi-transparent by calling set_trans(5)"

标准分析器

标准分析器是 Elasticsearch 默认使用的分析器。它是分析各种语言文本最常用的选择。它根据 Unicode 联盟定义的单词边界划分文本。删除绝大部分标点。最后，将词条小写。它会产生：

set, the, shape, to, semi, transparent, by, calling, set_trans, 5

简单分析器

简单分析器在任何不是字母的地方分隔文本，将词条小写。它会产生：

set, the, shape, to, semi, transparent, by, calling, set, trans

空格分析器

空格分析器在空格的地方划分文本。它会产生：

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)

语言分析器

特定语言分析器可用于很多语言。它们可以考虑指定语言的特点。例如，英语分析器附带了一组英语无用词（常用单词，例如 and 或者 the ，它们对相关性没有多少影响），它们会被删除。由于理解英语语法的规则，这个分词器可以提取英语单词的词干。

set, shape, semi, transpar, call, set_tran, 5

注意看 transparent、 calling 和 set_trans 已经变为词根格式

1.2、IK分词器

是这样的，当我们使用默认的分词器来解析中文的时候会发现，将我们的中文逐个拆解，但是事实上我们分词是是希望他根据词组来分词，我们这里采用 IK 中文分词器，下载地址为:Releases · medcl/elasticsearch-analysis-ik · GitHub

将解压后的后的文件夹放入 ES 根目录下的 plugins 目录下，重启 ES 即可使用。我们这次加入新的查询参数"analyzer":"ik_max_word"

{
    "text": "测试单词",
    "analyzer": "ik_max_word"
}
ik_max_word：会将文本做最细粒度的拆分
ik_smart：会将文本做最粗粒度的拆分

分词结果

{
    "tokens": [
        {
            "token": "测试",
            "start_offset": 0,
            "end_offset": 2,
            "type": "CN_WORD",
            "position": 0
        },
        {
            "token": "单词",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 1
        }
    ]
}

ES 中也可以进行扩展词汇，首先查询"弗雷尔卓德"那么时候会将这个词逐个拆分，但是事实上这是没有什么意义的，我们希望他是一个完整的词组，所以我们要自定义词组。

首先进入 ES 根目录中的 plugins 文件夹下的 ik 文件夹，进入 config 目录，创建 custom.dic文件，写入弗雷尔卓德。同时打开 IKAnalyzer.cfg.xml 文件，将新建的 custom.dic 配置其中，重启 ES 服务器。

{
    "text": "弗雷尔卓德",
    "analyzer": "ik_max_word"
}
{
    "tokens": [
        {
            "token": "弗雷尔卓德",
            "start_offset": 0,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 0
        }
    ]
}

1.3、自定义分析器

虽然 Elasticsearch 带有一些现成的分析器，然而在分析器上 Elasticsearch 真正的强大之处在于，你可以通过在一个适合你的特定数据的设置之中组合字符过滤器、分词器、词汇单元过滤器来创建自定义的分析器。在分析与分析器我们说过，一个分析器就是在一个包里面组合了三种函数的一个包装器，三种函数按照顺序被执行:

字符过滤器

字符过滤器用来整理一个尚未被分词的字符串。例如，如果我们的文本是 HTML 格式的，它会包含像

或者

这样的 HTML 标签，这些标签是我们不想索引的。我们可以使用 html 清除字符过滤器来移除掉所有的 HTML 标签，并且像把 Á 转换为相对应的 Unicode 字符 Á 这样，转换 HTML 实体。一个分析器可能有 0 个或者多个字符过滤器。

分词器

一个分析器必须有一个唯一的分词器。分词器把字符串分解成单个词条或者词汇单元。标准分析器里使用的标准分词器把一个字符串根据单词边界分解成单个词条，并且移除掉大部分的标点符号，然而还有其他不同行为的分词器存在。

例如，关键词分词器完整地输出接收到的同样的字符串，并不做任何分词。空格分词器只根据空格分割文本。正则分词器根据匹配正则表达式来分割文本。

词单元过滤器

经过分词，作为结果的词单元流会按照指定的顺序通过指定的词单元过滤器。词单元过滤器可以修改、添加或者移除词单元。我们已经提到过 lowercase 和 stop 词过滤器，但是在Elasticsearch 里面还有很多可供选择的词单元过滤器。词干过滤器把单词遏制为词干。 ascii_folding 过滤器移除变音符，把一个像 "très" 这样的词转换为 "tres" 。ngram 和 edge_ngram 词单元过滤器可以产生适合用于部分匹配或者自动补全的词单元。接下来，我们看看如何创建自定义的分析器：

# PUT http: //localhost:9200/my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "&_to_and": {
                    "type": "mapping",
                    "mappings": [
                        "&=> and "
                    ]
                }
            },
            "filter": {
                "my_stopwords": {
                    "type": "stop",
                    "stopwords": [
                        "the",
                        "a"
                    ]
                }
            },
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "char_filter": [
                        "html_strip",
                        "&_to_and"
                    ],
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "my_stopwords"
                    ]
                }
            }
        }
    }
}

2、Kibana

Kibana 是一个免费且开放的用户界面，能够让你对 Elasticsearch 数据进行可视化，并让你在 Elastic Stack 中进行导航。你可以进行各种操作，从跟踪查询负载，到理解请求如何流经你的整个应用，都能轻松完成。

Kibana 7.17.16 | Elastic

修改 config/kibana.yml 文件

# 默认端口
server.port: 5601
# ES 服务器的地址
elasticsearch.hosts: ["http://localhost:9200"]
# 索引名
kibana.index: ".kibana"
# 支持中文
i18n.locale: "zh-CN"

Windows 环境下执行 bin/kibana.bat 文件，启动成功后访问 http://localhost:5601。

3、框架集成

3.1、Spring Data 框架介绍

Spring Data 是一个用于简化数据库、非关系型数据库、索引库访问，并支持云服务的开源框架。其主要目标是使得对数据的访问变得方便快捷，并支持 map-reduce 框架和云计算数据服务。 Spring Data 可以极大的简化 JPA（Elasticsearch„）的写法，可以在几乎不用写实现的情况下，实现对数据的访问和操作。除了 CRUD 外，还包括如分页、排序等一些常用的功能。

Spring Data Elasticsearch 基于 spring data API 简化 Elasticsearch 操作，将原始操作Elasticsearch 的客户端 API 进行封装。Spring Data 为 Elasticsearch 项目提供集成搜索引擎。Spring Data Elasticsearch POJO 的关键功能区域为中心的模型与 Elastichsearch 交互文档和轻松地编写一个存储索引库数据访问层。

配置类

package com.songzhishu.es.config;

import lombok.Data;
import org.apache.http.HttpHost;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestClientBuilder;
import org.elasticsearch.client.RestHighLevelClient;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.context.annotation.Configuration;
import org.springframework.data.elasticsearch.config.AbstractElasticsearchConfiguration;

/**
 * @BelongsProject: elaticsearch-7-spring-data
 * @BelongsPackage: com.songzhishu.es.config
 * @Author: 斗痘侠
 * @CreateTime: 2024-02-05  13:13
 * @Description: es配置类
 * @Version: 1.0
 */
@ConfigurationProperties(prefix = "elasticsearch")
@Configuration
@Data
public class ElasticsearchConfig extends AbstractElasticsearchConfiguration {

    private String host;
    private int port;

    /**
     * 配置es连接
     * @return
     */
    @Override
    public RestHighLevelClient elasticsearchClient() {
        // 连接es
        RestClientBuilder builder = RestClient.builder(new HttpHost(host, port));
        // 高级客户端对象
        RestHighLevelClient client = new RestHighLevelClient(builder);
        return client;
    }
}

接口

package com.songzhishu.es.dao;

import com.songzhishu.es.pojo.Product;
import org.springframework.data.elasticsearch.repository.ElasticsearchRepository;
import org.springframework.stereotype.Repository;

/**
 * @BelongsProject: elaticsearch-7-spring-data
 * @BelongsPackage: com.songzhishu.es.dao
 */
@Repository
public interface ProductDao extends ElasticsearchRepository {

}

实体

package com.songzhishu.es.pojo;

import lombok.AllArgsConstructor;
import lombok.Data;
import lombok.NoArgsConstructor;
import lombok.ToString;
import org.springframework.data.annotation.Id;
import org.springframework.data.elasticsearch.annotations.Document;
import org.springframework.data.elasticsearch.annotations.Field;
import org.springframework.data.elasticsearch.annotations.FieldType;

/**
 * @BelongsProject: elaticsearch-7-spring-data
 * @BelongsPackage: com.songzhishu.es.pojo
 * @Author: 斗痘侠
 * @CreateTime: 2024-02-05  13:10
 * @Description: 商品实体类
 * @Version: 1.0
 */
@Data
@NoArgsConstructor
@AllArgsConstructor
@ToString
@Document(indexName = "product", shards = 3, replicas = 1)//索引名字，分片数，副本数
public class Product {

    @Id
    private Long id;//商品id

    @Field(type = FieldType.Text, analyzer = "ik_max_word")//分词器 ik_max_word 最大化分词
    private String title;//商品标题

    @Field(type = FieldType.Keyword)//不分词
    private String category;//商品分类

    @Field(type = FieldType.Double)//浮点型
    private Double price;//商品价格
    @Field(type = FieldType.Keyword,index = false)//不分词 不索引(就是不能搜索)
    private String images;//商品图片

}

测试类

package com.songzhishu.es.test;

import com.songzhishu.es.dao.ProductDao;
import com.songzhishu.es.pojo.Product;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.data.domain.PageRequest;
import org.springframework.data.domain.Sort;
import org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate;

import javax.annotation.Resource;
import java.util.ArrayList;
import java.util.List;


/**
 * @BelongsProject: elaticsearch-7-spring-data
 * @BelongsPackage: com.songzhishu.es.test
 * @Author: 斗痘侠
 * @CreateTime: 2024-02-05  13:25
 * @Description: 测试类
 * @Version: 1.0
 */
@SpringBootTest
public class SpringDataESIndexTest {

    @Autowired
    private ElasticsearchRestTemplate elasticsearchRestTemplate;

    @Resource
    private ProductDao productDao;

    @Test
    public void createIndex() {
        // 创建索引 会根据Product类的@Document注解信息来创建 如果已经存在就会失败
        System.out.println("创建索引");
    }

    @Test
    public void deleteIndex() {
        // 删除索引
        boolean b = elasticsearchRestTemplate.indexOps(Product.class).delete();
        System.out.println("删除索引" + b);
    }

    /**
     * 测试文档操作-创建文档
     */
    @Test
    public void testDocument() {
        // 创建文档
        Product product = new Product();
        product.setId(8L);
        product.setTitle("一加手机1");
        product.setCategory("手机");
        product.setPrice(2599.00);
        product.setImages("http://www.baidu.com");
        productDao.save(product);
        System.out.println("创建文档");
    }

    /**
     * 测试文档操作-删除文档
     */
    @Test
    public void testDeleteDocument() {
        // 删除文档
        productDao.deleteById(8L);
        System.out.println("删除文档");
    }

    /**
     * 测试文档操作-修改文档
     */
    @Test
    public void testUpdateDocument() {
        // 修改文档
        Product product = new Product();
        product.setId(1L);
        product.setTitle("华为手机");
        product.setCategory("手机");
        product.setPrice(2999.00);
        product.setImages("http://www.baidu.com");
        productDao.save(product);
        System.out.println("修改文档");
    }

    /**
     * 测试文档操作-查询文档
     */
    @Test
    public void testQueryDocument() {
        // 查询文档
        Product product = productDao.findById(1L).get();
        System.out.println(product);
    }

    /**
     * 测试文档操作-查询所有
     */
    @Test
    public void testQueryAll() {
        // 查询所有
        Iterable products = productDao.findAll();
        products.forEach(System.out::println);
    }

    /**
     * 测试文档操作-批量添加
     */
    @Test
    public void testAddAll() {
        // 批量添加
        List list = new ArrayList<>();
        for (int i = 8; i < 18; i++) {
            Product product = new Product((long) i, "oppo手机" + i, "手机", 4999.00, "http://www.baidu.com");
            list.add(product);
        }
        productDao.saveAll(list);
        System.out.println("批量添加");
    }

    /**
     * 测试文档操作-批量删除
     */
    @Test
    public void testDeleteAll() {
        // 批量删除
        productDao.deleteAll();
        System.out.println("批量删除");
    }

    /**
     * 测试文档操作-分页查询
     */
    @Test
    public void testQueryPage() {
        // 分页查询 设置分页参数
        int page = 0; // 页码
        int size = 5; // 每页显示的条数

        Sort sort = Sort.by(Sort.Direction.DESC, "id");//排序 按照id降序

        PageRequest request = PageRequest.of(page, size, sort);
        Iterable products = productDao.findAll(request);

        products.forEach(System.out::println);
    }


}

package com.songzhishu.es.test;

import com.songzhishu.es.dao.ProductDao;
import com.songzhishu.es.pojo.Product;
import org.elasticsearch.cluster.metadata.AliasAction;
import org.elasticsearch.index.query.QueryBuilders;

import org.elasticsearch.index.query.TermQueryBuilder;
import org.elasticsearch.search.aggregations.AggregationBuilders;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;
import org.elasticsearch.search.sort.SortBuilders;
import org.elasticsearch.search.sort.SortOrder;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.data.domain.PageRequest;
import org.springframework.data.elasticsearch.core.AggregationsContainer;
import org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate;
import org.springframework.data.elasticsearch.core.SearchHits;

import org.springframework.data.elasticsearch.core.clients.elasticsearch7.ElasticsearchAggregations;
import org.springframework.data.elasticsearch.core.query.*;


import javax.annotation.Resource;

/**
 * @BelongsProject: elaticsearch-7-spring-data
 * @BelongsPackage: com.songzhishu.es.test
 * @Author: 斗痘侠
 * @CreateTime: 2024-02-05  14:11
 * @Description: 测试类 查询
 * @Version: 1.0
 */
@SpringBootTest
public class SpringDataESSearchTest {

    @Autowired
    private ElasticsearchRestTemplate elasticsearchRestTemplate;

    @Resource
    private ProductDao productDao;

    /**
     * 测试查询-term查询
     */
    @Test
    public void testTermQuery() {
        /*NativeSearchQueryBuilder queryBuilder = new NativeSearchQueryBuilder();

        //term查询
        queryBuilder.withQuery(QueryBuilders.termQuery("title", "手机"));

        SearchHits search = elasticsearchRestTemplate.search(queryBuilder.build(), Product.class);

        search.forEach(productSearchHit -> {
            System.out.println(productSearchHit.getContent());
        });*/

        CriteriaQuery query = new CriteriaQuery(new Criteria("title").is("手机"));

        SearchHits search = elasticsearchRestTemplate.search(query, Product.class);
        search.forEach(productSearchHit -> {
            System.out.println(productSearchHit.getContent());
        });


    }


    /**
     * 测试查询-分页查询
     */
    @Test
    public void testPageQuery() {

        NativeSearchQuery query = new NativeSearchQueryBuilder()
                .withPageable(PageRequest.of(0, 5))
                .withQuery(QueryBuilders.termQuery("title", "手机"))
                .build();

        SearchHits search = elasticsearchRestTemplate.search(query, Product.class);

        search.forEach(productSearchHit -> {
            System.out.println(productSearchHit.getContent());
        });
    }

    /**
     * 测试查询-排序查询
     */
    @Test
    public void testSortQuery() {

        NativeSearchQuery query = new NativeSearchQueryBuilder()
                .withPageable(PageRequest.of(0, 5))
                .withQuery(QueryBuilders.termQuery("title", "手机"))
                .withSorts(SortBuilders.fieldSort("price").order(SortOrder.DESC))
                .build();

        SearchHits search = elasticsearchRestTemplate.search(query, Product.class);

        search.forEach(productSearchHit -> {
            System.out.println(productSearchHit.getContent());
        });
    }


    /**
     * 测试查询-聚合查询
     */

    @Test
    public void testAggQuery() {

        NativeSearchQuery query = new NativeSearchQueryBuilder()
                .addAggregation(AggregationBuilders.terms("categoryAgg").field("category"))
                .build();

        SearchHits search = elasticsearchRestTemplate.search(query, Product.class);

        ElasticsearchAggregations aggregations = (ElasticsearchAggregations) search.getAggregations();




    }

    /**
     * 测试查询-高亮查询
     */
    @Test
    public void testHighLightQuery() {

        NativeSearchQuery query = new NativeSearchQueryBuilder()
                .withQuery(QueryBuilders.termQuery("title", "手机"))
                .withHighlightFields(new HighlightBuilder.Field("title").preTags("").postTags(""))
                .build();

        SearchHits search = elasticsearchRestTemplate.search(query, Product.class);

        search.forEach(productSearchHit -> {
            System.out.println(productSearchHit.getHighlightField("title"));
        });
    }

}

4、配置项

参数名	参数值	说明
cluster.name	elasticsearch	配置 ES 的集群名称，默认值是 ES，建议改成与所存数据相关的名称，ES 会自动发现在同一网段下的集群名称相同的节点。
node.name	node-1	集群中的节点名，在同一个集群中不能重复。节点的名称一旦设置，就不能再改变了。当然可以设置成服务器的主机名称，例如node.name:${HOSTNAME}。
node-master	true	指定该节点是否有资格被选举成为 Master 节点，默认是 True，如果被设置为 True，则只是有资格成为Master 节点，具体能否成为 Master 节点，需要通过选举产生。
node-data	true	指定该节点是否存储索引数据，默认为 True。数据的增、删、改、查都是在 Data 节点完成的。
index.number_of_shards	1	设置都索引分片个数，默认是 1 片。也可以在创建索引时设置该值，具体设置为多大都值要根据数据量的大小来定。如果数据量不大，则设置成 1 时效率最高
index.number_of_replicas	1	设置默认的索引副本个数，默认为 1 个。副本数越多，集群的可用性越好，但是写索引时需要同步的数据越多。
transport.tcp.compress	true	设置在节点间传输数据时是否压缩，默认为 False，不压缩
discovery.zen.minimum_master_nodes	1	设置在选举 Master 节点时需要参与的最少的候选主节点数，默认为 1。如果使用默认值，则当网络不稳定时有可能会出现脑裂。合理的数值为 (master_eligible_nodes/2)+1 ，其中master_eligible_nodes 表示集群中的候选主节点数
discovery.zen.ping.timeout	3s	设置在集群中自动发现其他节点时 Ping 连接的超时时间，默认为 3 秒。在较差的网络环境下需要设置得大一点，防止因误判该节点的存活状态而导致分片的转移

5、新版api

 
        
            org.elasticsearch.plugin
            x-pack-sql-jdbc
            7.17.16
        

        
            co.elastic.clients
            elasticsearch-java
            7.17.16
        

        
            com.fasterxml.jackson.core
            jackson-databind
            2.16.0
        

        
            jakarta.json
            jakarta.json-api
            2.1.2

5.1、普通操作

5.1.1、索引

// 创建索引
CreateIndexRequest request = new CreateIndexRequest.Builder().index("myindex").build();
final CreateIndexResponse createIndexResponse = client.indices().create(request);

System.out.println("创建索引成功：" + createIndexResponse.acknowledged());

// 查询索引
GetIndexRequest getIndexRequest = new GetIndexRequest.Builder().index("myindex").build();
final GetIndexResponse getIndexResponse = client.indices().get(getIndexRequest);

System.out.println("索引查询成功：" + getIndexResponse.result());

// 删除索引
DeleteIndexRequest deleteIndexRequest = new DeleteIndexRequest.Builder().index("myindex").build();

final DeleteIndexResponse delete = client.indices().delete(deleteIndexRequest);
final boolean acknowledged = delete.acknowledged();

System.out.println("删除索引成功：" + acknowledged);

5.1.2、文档

// 创建文档
IndexRequest indexRequest = new IndexRequest.Builder()
 .index("myindex")
 .id(user.getId().toString())
 .document(user)
 .build();
final IndexResponse index = client.index(indexRequest);
System.out.println("文档操作结果:" + index.result());

// 批量创建文档
final List operations = new ArrayList();
for ( int i= 1;i <= 5; i++ ) {
 final CreateOperation.Builder builder = new CreateOperation.Builder();
        builder.index("myindex");
        builder.id("200" + i);
        builder.document(new User(2000 + i, 30 + i * 10, "zhangsan" + i, "beijing", 1000 + i*1000));

final CreateOperation

Elasticsearch（三）

1、文档分析

1.1、内置分析器

1.2、IK分词器

1.3、自定义分析器

2、Kibana

3、框架集成

3.1、Spring Data 框架介绍

4、配置项

5、新版api

5.1、普通操作

5.1.1、索引

5.1.2、文档

5.1.3、文档搜索

5.2、函数操作

5.2.1、索引操作

5.2.2、文档操作

5.2.3、文档查询

你可能感兴趣的:(搜索引擎,elasticsearch,搜索引擎,java)