ElasticSearch+LogStash实现mysql的数据同步和分词全文检索

建站过程中,为了方便笔记和文章内容的全文检索,考虑集成es,使用es的分词功能,实现站内的全文检索。

安装elasticsearch

官网下载es压缩包,解压之后配置config中的yml文件:

cluster.name: legolas
node.name: node-1
http.port: 9200
# transport.tcp.port: 9300 集群节点使用端口
network.host: 127.0.0.1

# elasticsearch-head插件需要用到的跨域配置
http.cors.enabled: true
http.cors.allow-origin: "*"

执行elasticsearch.bat,浏览器访问http://localhost:9200/,不报错即成功。
为了查看方便,我们可以安装elasticsearch-head插件,首先安装nodeJs环境,

安装 grunt: npm install npm install -g grunt -cli 查看grunt版本 grunt -version

下载elasticsearch-head (禁止放在elasticsearch的plugins和modules目录下,放在根目录下即可),解压之后修改head/Gruntfile.js:

connect: {
    server: {
        options: {
            port: 9100,
            hostname: '*',
            base: '.',
            keepalive: true
        }
    }
}

安装elasticsearch-head:在elasticsearch-head目录下执行npm install
运行插件只要在head源代码目录下启动nodejs:grunt server
然后浏览器输入http://localhost:9100/即可。
若启动报错Gruntfile.js引起的,缺少以下包,执行以下命令:

npm install grunt-contrib-clean --registry=https://registry.npm.taobao.org
npm install grunt-contrib-concat --registry=https://registry.npm.taobao.org
npm install grunt-contrib-watch --registry=https://registry.npm.taobao.org
npm install grunt-contrib-connect --registry=https://registry.npm.taobao.org
npm install grunt-contrib-copy --registry=https://registry.npm.taobao.org
npm install grunt-contrib-jasmine --registry=https://registry.npm.taobao.org

安装ik分词器:
git上找到相应版本(版本必须一致),下载到本地打包mvn package,在target/release中得到zip,在plugins目录下新建ik文件夹,解压进去即可
pinying分词器操作一样 ik->pinyin

安装Logstash

1.下载解压logstash之后,在根目录下新建mysql,将mysql-connector-java-5.1.27.jar拷贝进去,然后配置jdbc.conf,和同步的表的查询方式,如果是全量同步只需要select * from [table]即可
jdbc.conf配置:

# logstash同步mysql数据库到elasticsearch
input {
    stdin {
    }
    jdbc {
        type =>"t_article"
        # mysql 数据库链接
        jdbc_connection_string => "jdbc:mysql://192.168.1.131:3306/ds"
        # 用户名和密码
        jdbc_user => "root"
        jdbc_password => "**********"
        jdbc_driver_library => "E:\Program\logstash-6.7.0\mysql\mysql-connector-java-5.1.27.jar"
        jdbc_driver_class => "com.mysql.jdbc.Driver"
        jdbc_paging_enabled => "true"
        jdbc_page_size => "50000"
        # 执行的sql 就是上一步创建的sql文件的绝对路径+文件名字
        statement_filepath => "E:\Program\logstash-6.7.0\mysqletc\article.sql"
        # 设置监听间隔  各字段含义(由左至右)分、时、天、月、年,全部为*默认含义为每分钟都更新
        schedule => "* * * * *"
    }

    jdbc {
          type =>"t_note"
          # mysql 数据库链接
          jdbc_connection_string => "jdbc:mysql://192.168.1.131:3306/ds"
          # 用户名和密码
          jdbc_user => "root"
          jdbc_password => "**********"
          jdbc_driver_library => "E:\Program\logstash-6.7.0\mysql\mysql-connector-java-5.1.27.jar"
          jdbc_driver_class => "com.mysql.jdbc.Driver"
          jdbc_paging_enabled => "true"
          jdbc_page_size => "50000"
          # 执行的sql 就是上一步创建的sql文件的绝对路径+文件名字
          statement_filepath => "E:\Program\logstash-6.7.0\mysqletc\note.sql"
          # 设置监听间隔  各字段含义(由左至右)分、时、天、月、年,全部为*默认含义为每分钟都更新
          schedule => "* * * * *"
        }
}

filter {
    json {
        source => "message"
        remove_field => ["message"]
    }
}

output {
if [type]=="t_article"{
    elasticsearch {
        # ES的IP地址及端口
        hosts => ["192.168.1.131:9200"]
        index => "article"
        #user => "elastic"
        #password => "123456"
        # 索引名称
        # 自增ID id必须是待查询的数据表的序列字段
        document_id => "%{id}"
        }
    }
    if [type]=="t_note"{
        elasticsearch {
            hosts => ["192.168.1.131:9200"]
            index => "note"
            document_id => "%{id}"
        }
     }
    stdout {
       # JSON格式输出
        codec => json_lines
    }
}

note.sql :

select *fromt_notewherelocked=0

2.运行logstash(进入到logstash/bin目录下):logstash -f …/mysql/jdbc.conf
logstash就会自动同步mysql的数据到elasticsearch中了。

spring集成elasticsearch

首先spring集成elasticsearch需要添加 elasticsearch的transport 依赖

 
            org.elasticsearch.client
            transport
            6.6.0
            
                
                    jackson-core
                    com.fasterxml.jackson.core
                
            
        

然后我们定义TransportClient(连接es客户端的java api,这里我们把它交给spring容器去管理);TransportClient作为一个外部访问者,请求ES的集群,对于集群而言,它是一个外部因素,不会影响集群的运行。

@Configuration
public class MyConfig {
    @Bean
    public TransportClient client() throws UnknownHostException {
        TransportAddress node = new TransportAddress(
                InetAddress.getByName("192.168.1.118"), 9300
        );

        //.put("client.transport.sniff", true)自动嗅探整个集群的状态,把集群中其他ES节点的ip添加到本地的客户端列表中
        Settings settings = Settings.builder().put("cluster.name", "legolas").build();

        TransportClient client = new PreBuiltTransportClient(settings);
        client.addTransportAddress(node);
        return client;
    }
}

然后在业务层调用TransportClient实例对es进行增删改查操作。

public String queryArticles(@PathVariable("content") String content, @PathVariable("num") Integer num, ModelMap model) {
        try {
            List contentSearchTerm = ESUtils.handlingSearchContent(client, "article", content);
            BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
            boolQuery.must(QueryBuilders.termQuery("user_id", Consts.DEFAULT_LEGOLASID));
            boolQuery.must(QueryBuilders.termQuery("locked", 0));
            SearchRequestBuilder builder = client.prepareSearch("article").setTypes("doc");
            builder.setQuery(boolQuery);

            BoolQueryBuilder termQuery = QueryBuilders.boolQuery();
            if (contentSearchTerm != null && contentSearchTerm.size() > 0) {
                for (String searchTerm : contentSearchTerm) {
                    termQuery.should(QueryBuilders.matchPhrasePrefixQuery("title", searchTerm))
                            .should(QueryBuilders.matchPhrasePrefixQuery("content_text", searchTerm));
                }
            }
            boolQuery.must(termQuery);
            builder.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
                    .setQuery(boolQuery)
                    .setFrom(10 * (num - 1))
                    .setSize(10)
                    .addSort("update_time", SortOrder.DESC);
            SearchResponse responseTemp = builder.execute().actionGet();
            long total = responseTemp.getHits().getTotalHits() / 10 == 0 ? responseTemp.getHits().getTotalHits() / 10 : responseTemp.getHits().getTotalHits() / 10 + 1;
            if (total == 0) {
                total = 1;
            }
            List result = new ArrayList<>();
            SearchResponse response = builder.get();
            for (SearchHit hit : response.getHits()) {
                try {
                    ArticleSum recogcardInfo = MyHightlightBuilder.setArticleSumHighlighter(hit, contentSearchTerm);
                    result.add(recogcardInfo);
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
            model.addAttribute("articles", result);
            model.addAttribute("pageNum", num);
            model.addAttribute("totalPage", total);
        } catch (Exception e) {
            //若elastic异常,则改用数据库查询标题匹配
            PageInfo result = articleService.listUnLockedArticlesByUserIdAndTitleWithPage(Consts.DEFAULT_LEGOLASID, content, 1);
            List atcs = result.getList();
            MyHightlightBuilder.setArticleTitleHighlighter(atcs, content);
            model.addAttribute("articles", atcs);
            model.addAttribute("pageNum", result.getPageNum());
            model.addAttribute("totalPage", result.getPages());
            e.printStackTrace();
            return "views/index";
        }
        return "views/index";
    }

得到请求参数content字段以后我们先用ES的分词器对其进行分词操作:在ESUtils类中我们定义handlingSearchContent方法,在getIkAnalyzeSearchTerms中调用elasticsearch的ik分词器进行分词。

public static List handlingSearchContent(TransportClient client, String index, String searchContent) {
        List searchTermResultList = new ArrayList<>();
        try {
            // 按逗号分割,获取搜索词列表
            List searchTermList = Arrays.asList(searchContent.split(","));
            // 如果搜索词大于 1 个字,则经过 IK 分词器获取分词结果列表
            searchTermList.forEach(searchTerm -> {
                // 搜索词 TAG 本身加入搜索词列表,并解决 will 这种问题
                searchTermResultList.add(searchTerm);
                // 获取搜索词 IK 分词列表
                searchTermResultList.addAll(getIkAnalyzeSearchTerms(client, index, searchTerm));
            });
        } catch (Exception e) {
            e.printStackTrace();
        }
        return searchTermResultList;
    }

public static List getIkAnalyzeSearchTerms(TransportClient client, String index, String searchContent) {
        //TODO
        AnalyzeRequestBuilder ikRequest = new AnalyzeRequestBuilder(client,
                AnalyzeAction.INSTANCE, index, searchContent);
        ikRequest.setTokenizer("ik_smart");//使用ik智能分词
        //ikRequest.setTokenizer("ik_max_word");
        List ikTokenList = ikRequest.execute().actionGet().getTokens();
        // 循环赋值
        List searchTermList = new ArrayList<>();
        ikTokenList.forEach(ikToken -> searchTermList.add(ikToken.getTerm()));
        return handlingIkResultTerms(searchTermList);
    }



private static List handlingIkResultTerms(List searchTermList) {
       //这里我们只保留词,可以自己定义
        List phraseList = new ArrayList<>();
        searchTermList.forEach(term -> {
            if (term.length() > 1) {
                phraseList.add(term);
            }
        });
        return phraseList;
    }

QueryBuilders.boolQuery用于组合多个叶子或复合查询子句的默认查询。其中must方法类似与操作,should类似或操作;matchPhrasePrefixQuery用于短语前缀匹配查询,分词后的短语通过循环组合到boolQuery中,最终通过SearchRequestBuilder调用这些组合语句进行查询。
我们自定义工具处理类对查询结果进行查询关键字高亮处理,新建MyHightlightBuilder类,定义setNoteHighlighter方法(针对不同的实体内容处理可以定义不同的方法):

public static Note setNoteHighlighter(SearchHit hit, List searchTerms) {
        String sourceAsString = hit.getSourceAsString();
        //将json串值转换成对应的实体对象
        Note recogcardInfo = JSON.parseObject(sourceAsString, Note.class);
        //获取对应的高亮域
        StringBuilder sb = new StringBuilder(recogcardInfo.getContent());
        for (String searchTerm : searchTerms) {
            int n = StringUtils.appearNumber(sb.toString(), searchTerm);
            for (int i = 1; i <= n; i++) {
                sb.insert(StringUtils.positionAppearN(sb.toString(), searchTerm, i), PRE_TAG[4]);
                sb.insert(StringUtils.positionAppearN(sb.toString(), searchTerm, i) + searchTerm.length(), END_TAG);
            }
        }
        recogcardInfo.setContent(sb.toString());
        return recogcardInfo;
    }

在业务层通过循环调用高亮方法处理查询的结果:

SearchResponse response = builder.get();
         for (SearchHit hit : response.getHits()) {
             //直接获取源数据用:hit.getSourceAsMap();
             //将文档中的每一个对象转换json串值
             Note recogcardInfo = MyHightlightBuilder.setNoteHighlighter(hit, contentSearchTerm);
             result.add(recogcardInfo);
         }

对elasticsearch的更新操作:

   UpdateRequest update = new UpdateRequest("article", "doc", article.getId().toString());
     XContentBuilder builder = XContentFactory.jsonBuilder().startObject();
     builder.field("title", article.getTitle())
             .field("summary", article.getSummary())
             .field("content", article.getContent())
             .field("content_text", article.getContentText())
             .field("count", article.getCount())
             .field("locked", article.getLocked());
        builder.endObject();
        update.doc(builder);
        client.update(update);

对elasticsearch的删除操作:

client.prepareDelete("article", "doc", id.toString());

你可能感兴趣的:(ElasticSearch+LogStash实现mysql的数据同步和分词全文检索)