安装好Elasticsearch以后(https://blog.csdn.net/sweeper_freedoman/article/details/86227778),需要将数据导进去。项目数据是很常规的情况,存储在MySQL里面。从MySQL导入数据到Elasticsearch里,查了一下有Logstash工具,简单看了一下文档,根据安装配置来看,其实现为定时执行自定义的SQL查询语句,理解起来很简单,维护起来也就简单了。这里记录一下本地Ubuntu Linux的安装过程。
①重装Java
Logstash依赖Java8,本地的Ubuntu 18.04操作系统安装自带的openjdk与其不符,需要先重新安装Java8。
root@ubuntu:~# java --version
openjdk 10.0.2 2018-07-17
OpenJDK Runtime Environment (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4)
OpenJDK 64-Bit Server VM (build 10.0.2+13-Ubuntu-1ubuntu0.18.04.4, mixed mode)
root@ubuntu:~# apt remove OpenJDK*
root@ubuntu:~# apt autoremove --purge
root@ubuntu:~# apt install openjdk-8-jdk
root@ubuntu:~# java -version
openjdk version "1.8.0_191"
OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-0ubuntu0.18.04.1-b12)
OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
②安装Logstash
root@ubuntu:~# su ubuntu
ubuntu@ubuntu:/root$ cd
ubuntu@ubuntu:~$ wget https://artifacts.elastic.co/downloads/logstash/logstash-6.5.4.tar.gz
ubuntu@ubuntu:~$ tar -xz -f logstash-6.5.4.tar.gz
吸取了安装Elasticsearch时的教训,这里也先切换到普通用户(非root账号)。
③运行示例
ubuntu@ubuntu:~$ cd logstash-6.5.4/bin/
ubuntu@ubuntu:~/logstash-6.5.4/bin$ bash logstash -e 'input { stdin { } } output { stdout {} }'
…… omitted ……
The stdin plugin is now waiting for input:
[2019-01-11T12:02:39,680][INFO ][logstash.agent ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]
}[2019-01-11T12:02:39,830][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}
hello world
{
"@version" => "1",
"host" => "ubuntu",
"@timestamp" => 2019-01-11T03:03:27.533Z,
"message" => "hello world"
}
[2019-01-11T12:03:49,284][INFO ][logstash.pipeline ] Pipeline has terminated {:pipeline_id=>"main", :thread=>"#"}
这个是官方文档中打印“hello world”的示例([CTRL-D]退出),成功输出表明Logstash安装OK了。
④安装JDBC
运行Logstash将MySQL数据导入Elasticsearch需要JDBC。如果本地环境已经安装了JDBC,跳过即可。
ubuntu@ubuntu:~/logstash-6.5.4/bin$ cd
ubuntu@ubuntu:~$ wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-8.0.13.tar.gz
ubuntu@ubuntu:~$ tar -xz -f mysql-connector-java-8.0.13.tar.gz
⑤配置Logstash
ubuntu@ubuntu:~$ cd logstash-6.5.4/config/
ubuntu@ubuntu:~/logstash-6.5.4/config$ vim logstash.sql
ubuntu@ubuntu:~/logstash-6.5.4/config$ cat logstash.sql
SELECT *
FROM `test`.`case_parse` AS cp
WHERE cp.`source_id` > :sql_last_value
ubuntu@ubuntu:~/logstash-6.5.4/config$ vim logstash.cnf
ubuntu@ubuntu:~/logstash-6.5.4/config$ cat logstash.cnf
input {
jdbc {
jdbc_driver_library => "/home/ubuntu/mysql-connector-java-8.0.13/mysql-connector-java-8.0.13.jar"
jdbc_driver_class => "com.mysql.cj.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://localhost:3306/test"
jdbc_user => "root"
jdbc_password => "1234"
statement_filepath => "/home/ubuntu/logstash-6.5.4/config/logstash.sql"
use_column_value => true
tracking_column => "source_id"
tracking_column_type => "numeric"
last_run_metadata_path => "/home/ubuntu/logstash-6.5.4/.logstash_jdbc_last_run"
schedule => "* * * * *"
}
}
output {
elasticsearch {
hosts => ["localhost:9200"]
index => "table"
document_id => "%{source_id}"
codec => "json"
}
}
这个过程主要是写两个文件(sql文件可以不用写也可以,直接配置进cnf文件亦可)。sql文件是导入数据的查询,其中的“sql_last_value”为Logstash内置变量,控制增量数据更新。cnf文件配置各种JDBC input和Elasticsearch output选项。
input设置数据库访问相关选项。“tracking_column”定义数据更新的依赖键,这里是“source_id”字段,为一个INT PRIMARY KEY。“schedule”后面的“*”号是不是很面熟,即cron任务调度中的时间选项:分时日月周,默认每分钟执行一次数据同步。output定义了Elasticsearch的访问,定义了index名称、document_id和数据类型。
示例用表定义如下。
CREATE TABLE `case_parse` (
`source_id` INT(10) UNSIGNED NOT NULL COMMENT 'case source id',
`name` VARCHAR(255) NULL DEFAULT NULL COMMENT 'case name' COLLATE 'utf8mb4_general_ci',
`source` VARCHAR(5000) NULL DEFAULT NULL COMMENT 'message source' COLLATE 'utf8mb4_general_ci',
PRIMARY KEY (`source_id`)
)
COMMENT='case information parse record'
COLLATE='utf8_general_ci'
ENGINE=InnoDB
;
⑥执行导入
ubuntu@ubuntu:~/logstash-6.5.4/config$ bash /home/ubuntu/elasticsearch-6.5.4/bin/elasticsearch -d -p Zoo.pid
ubuntu@ubuntu:~/logstash-6.5.4/config$ bash ../bin/logstash -f logstash.cnf
…… omitted ……
[2019-01-11T14:30:01,101][INFO ][logstash.inputs.jdbc ] (0.014795s) SELECT *
FROM `test`.`case_parse` AS cp
WHERE cp.`source_id` > 0
[2019-01-11T14:31:00,367][INFO ][logstash.inputs.jdbc ] (0.003792s) SELECT *
FROM `test`.`case_parse` AS cp
WHERE cp.`source_id` > 233
[2019-01-11T14:32:00,061][INFO ][logstash.inputs.jdbc ] (0.001589s) SELECT *
FROM `test`.`case_parse` AS cp
WHERE cp.`source_id` > 233
…… omitted ……
执行导入前,先确保MySQL服务器和Elasticsearch服务处于运行状态,本地在运行前先开启了Elasticsearch服务。这里通过“-f”参数指定刚才定义的数据导入配置文件,运行Logstash。查看日志输出,每次数据导入所执行的查询都打印了出来,即sql文件里面定义的查询,时间间隔大概1分钟。示例表中数值主键最大值为233,运行中没有再写入新数据,所以后面的输出都是重复的。打开浏览器输入http://localhost:9200/table/_search?pretty即可查看导入的JSON格式数据,链接中的“table”为cnf文件中“index”参数的自定义值。
参考:
https://www.elastic.co/downloads/logstash
https://www.elastic.co/guide/en/logstash/6.5/installing-logstash.html
https://www.elastic.co/guide/en/logstash/current/first-event.html
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-jdbc.html
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html
https://www.elastic.co/guide/en/logstash/6.5/running-logstash-command-line.html