修改ik分词器源码实现直连数据库动态增量更新词汇

 谈到es的中文分词器,肯定少不了ik分词器.现ik分词器有两种获取主词汇和停用词的方法:

一是通过ik\config目录下的main.dic和stopword.dic获取,但是每次修改后要重启才能生效
二是通过提供接口返回所有词汇的接口,接口路径配置在.但是该方式每次都需要将所有词汇返回,效率不高.

 本次目的就是通过jdbc直接连接数据库来实现增量更新词汇.我们要做的就是找到添加主词汇和停用词汇的方法,然后再通过jdbc获取数据库词汇来调用该方法来更新词汇

 下载ik源码,我下载的是7.17.6本版.因为es使用的是7.17.7,为防止启动报错,下载后我将版本改成了7.17.7.

词汇更新介绍

(1)找到Dictionary.initial方法
修改ik分词器源码实现直连数据库动态增量更新词汇_第1张图片
  可以看到,加载词汇的过程再Dictionary.initial 方法中,在该方法中,加载了各文件的词汇还有通过定时任务来获取接口词汇进行更新.
(2)接下来我们进入到singleton.loadMainDict -> loadExtDict -> loadDictFile方法中
修改ik分词器源码实现直连数据库动态增量更新词汇_第2张图片
  可以看到dict.fillSegment就是添加主词汇
(3)同理的,如下_stopWords.fillSegment就是对停用词的加载
修改ik分词器源码实现直连数据库动态增量更新词汇_第3张图片
  所以我们要做的就是拿到词汇,调用对应的fillSegment来加载词汇就可以了

准备工作

(1)表设计
 主词汇表:

CREATE TABLE `es_dic_main` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `word` varchar(100) NOT NULL COMMENT '词汇',
  `moditime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `ifdel` char(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='主词汇'

 通用词表:

CREATE TABLE `es_dic_stop` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `word` varchar(100) NOT NULL COMMENT '停用词',
  `moditime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  `ifdel` char(1) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci COMMENT='停用词'

(2)在/config目录下创建jdbc配置文件jdbc.properties:

jdbc.url=jdbc:mysql://cckg.liulingjie.cn:3306/test?useUnicode=true&characterEncoding=utf8&autoReconnect=true&useSSL=false&serverTimezone=Asia/Shanghai
jdbc.username=账号
jdbc.password=密码
# 主词汇增量查询sql
main.word.sql=SELECT * FROM es_dic_main WHERE moditime >= ?
# 通用词增量查询sql
stop.word.sql=SELECT * FROM es_dic_stop WHERE moditime >= ?
# 执行间隔(秒)
interval=10

(3)pom.xml添加jdbc依赖:

<dependency>
    <groupId>mysqlgroupId>
    <artifactId>mysql-connector-javaartifactId>
    <version>8.0.21version>
dependency>

(4)src/main/assemblies/plugin.xml下添加以下内容打包时包含mysql驱动jar包:

        <dependencySet>
            <outputDirectory/>
            <useProjectArtifact>trueuseProjectArtifact>
            <useTransitiveFiltering>trueuseTransitiveFiltering>
            <includes>
                <include>mysql:mysql-connector-javainclude>
            includes>
        dependencySet>

修改ik分词器源码实现直连数据库动态增量更新词汇_第4张图片

过程

 大致流程:
修改ik分词器源码实现直连数据库动态增量更新词汇_第5张图片
 主要涉及有两个类,一个是Dictionary,一个是自己创建的类JdbcMonitor。
  Dictionary:提供读取配置,加载词汇和启动词汇更新任务。
  JdbcMonitor功能:是一个实现了Runner接口的类,通过jdbc读取数据库词汇并调用Dictionary的方法加载词汇

(1)在Dictionary类中添加以下方法提供对词汇的api
修改ik分词器源码实现直连数据库动态增量更新词汇_第6张图片
 代码:

    protected void fillSegmentMain(String word) {
        _MainDict.fillSegment(word.trim().toCharArray());
    }

    protected void disableSegmentMain(String word) {
        _MainDict.disableSegment(word.trim().toCharArray());
    }

    protected void fillSegmentStop(String word) {
        _StopWords.fillSegment(word.trim().toCharArray());
    }

    protected void disableSegmentStop(String word) {
        _StopWords.disableSegment(word.trim().toCharArray());
    }

(2)在Dictionary构造方法中读取配置jdbc.properties
修改ik分词器源码实现直连数据库动态增量更新词汇_第7张图片
 代码:

public class JdbcConfig {

    private String url;

    private String username;

    private String password;

    private String mainWordSql;

    private String stopWordSql;

    private Integer interval;
	// geter,setter省略
}
    private Dictionary(Configuration cfg) {
   		//......省略

        // 读取jdbc配置
        setJdbcConfig();
    }

    private void setJdbcConfig() {
        Path file = PathUtils.get(getDictRoot(), Dictionary.PATH_JDBC_CONFIG);
        Properties properties = null;
        try {
            properties = new Properties();
            properties.load(new FileInputStream(file.toFile()));
        } catch (Exception e) {
            logger.error("load jdbc.properties failed");
            logger.error(e.getMessage());
        }
        jdbcConfig = new JdbcConfig(
                properties.getProperty("jdbc.url"),
                properties.getProperty("jdbc.username"),
                properties.getProperty("jdbc.password"),
                properties.getProperty("main.word.sql"),
                properties.getProperty("stop.word.sql"),
                Integer.valueOf(properties.getProperty("interval"))
        );
    }

(3)声明JdbcMinitor类定时连接数据库读取并更新词汇

package org.wltea.analyzer.dic;

import org.apache.logging.log4j.Logger;
import org.elasticsearch.SpecialPermission;
import org.wltea.analyzer.cfg.JdbcConfig;
import org.wltea.analyzer.help.ESPluginLoggerFactory;

import java.security.AccessController;
import java.security.PrivilegedAction;
import java.sql.*;
import java.util.ArrayList;
import java.util.List;

/**
 * @author liulingjie
 * @date 2022/11/29 20:36
 */
public class JdbcMonitor implements Runnable {

    static {
        try {
            Class.forName("com.mysql.cj.jdbc.Driver");
        } catch (Exception e) {
            e.getStackTrace();
        }
    }

    /**
     * jdbc配置
     */
    private JdbcConfig jdbcConfig;
    /**
     * 主词汇上次更新时间
     */
    private Timestamp mainLastModitime = Timestamp.valueOf("2022-01-01 00:00:00");
    /**
     * 停用词上次更新时间
     */
    private Timestamp stopLastModitime = Timestamp.valueOf("2022-01-01 00:00:00");

    private static final Logger logger = ESPluginLoggerFactory.getLogger(JdbcMonitor.class.getName());

    public JdbcMonitor(JdbcConfig jdbcConfig) {
        this.jdbcConfig = jdbcConfig;
    }

    @Override
    public void run() {
        SpecialPermission.check();
        AccessController.doPrivileged((PrivilegedAction<Void>) () -> {
            this.runUnprivileged();
            return null;
        });
    }

    /**
     * 加载词汇和停用词
     */
    public void runUnprivileged() {
        //Dictionary.getSingleton().reLoadMainDict();
        loadWords();
    }

    private void loadWords() {
        List<String> mainWords = new ArrayList<>();
        List<String> delMainWords = new ArrayList<>();
        List<String> stopWords = new ArrayList<>();
        List<String> delStopWords = new ArrayList<>();

        setAllWordList(mainWords, delMainWords, stopWords, delStopWords);

        mainWords.forEach(w -> Dictionary.getSingleton().fillSegmentMain(w));
        delMainWords.forEach(w -> Dictionary.getSingleton().disableSegmentMain(w));
        stopWords.forEach(w -> Dictionary.getSingleton().fillSegmentStop(w));
        delStopWords.forEach(w -> Dictionary.getSingleton().disableSegmentStop(w));


        logger.info("ik dic refresh from db. mainLastModitime: {} stopLastModitime: {}", mainLastModitime, stopLastModitime);
    }

    /**
     * 获取主词汇和停用词
     *
     * @param mainWords
     * @param delMainWords
     * @param stopWords
     * @param delStopWords
     */
    private void setAllWordList(List<String> mainWords, List<String> delMainWords, List<String> stopWords, List<String> delStopWords) {
        Connection connection = null;
        try {
            connection = DriverManager.getConnection(jdbcConfig.getUrl(), jdbcConfig.getUsername(), jdbcConfig.getPassword());
            setWordList(connection, jdbcConfig.getMainWordSql(), mainLastModitime, mainWords, delMainWords);
            setWordList(connection, jdbcConfig.getStopWordSql(), stopLastModitime, stopWords, delStopWords);
        } catch (SQLException throwables) {
            logger.error("jdbc load words failed: mainLastModitime-{} stopLostMOditime-{}", mainLastModitime, stopLastModitime);
            logger.error(throwables.getStackTrace());
        } finally {

            if (connection != null) {
                try {
                    connection.close();
                } catch (SQLException throwables) {
                    logger.error("failed to close connection");
                    logger.error(throwables.getMessage());
                }
            }
        }
    }

    /**
     * 连接数据库获取词汇
     *
     * @param connection
     * @param sql
     * @param lastModitime
     * @param words
     * @param delWords
     */
    private void setWordList(Connection connection, String sql, Timestamp lastModitime, List<String> words, List<String> delWords) {
        PreparedStatement prepareStatement = null;
        ResultSet result = null;

        try {
            prepareStatement = connection.prepareStatement(sql);
            prepareStatement.setTimestamp(1, lastModitime);
            result = prepareStatement.executeQuery();

            while (result.next()) {
                String word = result.getString("word");
                Timestamp moditime = result.getTimestamp("moditime");
                String ifdel = result.getString("ifdel");

                if ("1".equals(ifdel)) {
                    delWords.add(word);
                } else {
                    words.add(word);
                }

                // 取最大的时间
                if (moditime.after(lastModitime)) {
                    lastModitime.setTime(moditime.getTime());
                }
            }
        } catch (SQLException throwables) {
            logger.error("jdbc load words failed: {}", lastModitime);
            logger.error(throwables.getMessage());
        } finally {
            if (result != null) {
                try {
                    result.close();
                } catch (SQLException throwables) {
                    logger.error("failed to close prepareStatement");
                    logger.error(throwables.getMessage());
                }
            }

            if (prepareStatement != null) {
                try {
                    prepareStatement.close();
                } catch (SQLException throwables) {
                    logger.error("failed to close prepareStatement");
                    logger.error(throwables.getMessage());
                }
            }
        }
    }
}

(4)最后在Dictionary.initial方法中启用该定时任务
修改ik分词器源码实现直连数据库动态增量更新词汇_第8张图片
 代码:

public static synchronized void initial(Configuration cfg) {
    if (singleton== null) {
        synchronized (Dictionary.class) {
            if (singleton == null) {
                singleton = new Dictionary(cfg);
                ......
                // 开启数据库增量更新
                pool.scheduleAtFixedRate(new JdbcMonitor(singleton.jdbcConfig), 10, singleton.jdbcConfig.getInterval(), TimeUnit.SECONDS);
            }
        }
    }
}

(5)最后mvn cliean package打包,在~\target\releases下会生成如下包修改ik分词器源码实现直连数据库动态增量更新词汇_第9张图片
(6)解压放入到 es安装路径/plugins/ik 重启es就行了

你可能感兴趣的:(web应用,数据库,elasticsearch,ik,中文分词器)