上一篇博客讲到了elasticsearch搜索引擎的管理工具kibana的安装和配置,那么本篇博客将会具体介绍数据导入工具logstash的安装和配置;
Logstash是一款强大的数据处理工具,它可以实现多样化的数据源数据全量或增量传输,数据标准格式处理,数据格式化输出等的功能,常用于日志处理。工作流程分为三个阶段:
# Settings file in YAML
#
# Settings can be specified either in hierarchical form, e.g.:
#
# pipeline:
# batch:
# size: 125
# delay: 5
#
# Or as flat keys:
#
# pipeline.batch.size: 125
# pipeline.batch.delay: 5
#
# ------------ Node identity ------------
#
# Use a descriptive name for the node:
#设置节点名称
# node.name: test
#
# If omitted the node name will default to the machine's host name
#
# ------------ Data path ------------------
#
# Which directory should be used by logstash and its plugins
# for any persistent needs. Defaults to LOGSTASH_HOME/data
#设置UUID文件存放路径
path.data: /data/es/logstash-5.6.1
#
# ------------ Pipeline Settings --------------
#
# Set the number of workers that will, in parallel, execute the filters+outputs
# stage of the pipeline.
#
# This defaults to the number of the host's CPU cores.
#pipeline线程数,建议等同于cpu内核数
pipeline.workers: 10
#
# How many workers should be used per output plugin instance
#实际output时的线程数,建议等同于cpu内核数
pipeline.output.workers: 10
#
# How many events to retrieve from inputs before sending to filters+workers
#每次发送的事件数,批处理事件数,修改默认防止es集群的网络io过载
#默认125,数值越大,处理数据越高效,但占用内存越高,可自行调整
pipeline.batch.size: 3000
#
# How long to wait before dispatching an undersized batch to filters+workers
# Value is in milliseconds.
#发送延时,传输间歇时间,默认5
pipeline.batch.delay: 100
#
stash to exit during shutdown even if there are still inflight
# events in memory. By default, logstash will refuse to quit until all
# received events have been pushed to the outputs.
#
# WARNING: enabling this can lead to data loss during shutdown
#
# pipeline.unsafe_shutdown: false
#
# ------------ Pipeline Configuration Settings --------------
#
# Where to fetch the pipeline configuration for the main pipeline
#
#设置配置文件存放路径
path.config: /data/es/logstash-5.6.1/config/logstash.conf
#
# Pipeline configuration string for the main pipeline
#
# config.string:
#
# At startup, test if the configuration is valid and exit (dry run)
#
# config.test_and_exit: false
#
# Periodically check if the configuration has changed and reload the pipeline
# This can also be triggered manually through the SIGHUP signal
#
# config.reload.automatic: false
#
# How often to check if the pipeline configuration has changed (in seconds)
#
# config.reload.interval: 3
#
# Show fully compiled configuration as debug log message
# NOTE: --log.level must be 'debug'
#
# config.debug: false
#
# When enabled, process escaped characters such as \n and \" in strings in the
# pipeline configuration files.
#
# config.support_escapes: false
#
# ------------ Module Settings ---------------
# Define modules here. Modules definitions must be defined as an array.
# The simple way to see this is to prepend each `name` with a `-`, and keep
# all associated variables under the `name` they are associated with, and
# above the next, like this:
#
# modules:
# - name: MODULE_NAME
# var.PLUGINTYPE1.PLUGINNAME1.KEY1: VALUE
# var.PLUGINTYPE1.PLUGINNAME1.KEY2: VALUE
# var.PLUGINTYPE2.PLUGINNAME1.KEY1: VALUE
# var.PLUGINTYPE3.PLUGINNAME3.KEY1: VALUE
#
# Module variable names must be in the format of
#
# var.PLUGIN_TYPE.PLUGIN_NAME.KEY
#
# modules:
#
# ------------ Queuing Settings --------------
#
# Internal queuing model, "memory" for legacy in-memory based queuing and
# "persisted" for disk-based acked queueing. Defaults is memory
#
# queue.type: memory
#
# If using queue.type: persisted, the directory path where the data files will be stored.
# Default is path.data/queue
#
path.queue: /data/es/logstash-5.6.1/data/queue
#
# If using queue.type: persisted, the page data files size. The queue data consists of
# append-only data files separated into pages. Default is 250mb
#
# queue.page_capacity: 250mb
#
# If using queue.type: persisted, the maximum number of unread events in the queue.
# Default is 0 (unlimited)
#
# queue.max_events: 0
#
# If using queue.type: persisted, the total capacity of the queue in number of bytes.
# If you would like more unacked events to be buffered in Logstash, you can increase the
# capacity using this setting. Please make sure your disk drive has capacity greater than
# the size specified here. If both max_bytes and max_events are specified, Logstash will pick
# whichever criteria is reached first
# Default is 1024mb or 1gb
#
# queue.max_bytes: 1024mb
#
# If using queue.type: persisted, the maximum number of acked events before forcing a checkpoint
# Default is 1024, 0 for unlimited
#
# queue.checkpoint.acks: 1024
#
# If using queue.type: persisted, the maximum number of written events before forcing a checkpoint
# Default is 1024, 0 for unlimited
#
# queue.checkpoint.writes: 1024
#
# If using queue.type: persisted, the interval in milliseconds when a checkpoint is forced on the head page
# Default is 1000, 0 for no periodic checkpoint.
#
# queue.checkpoint.interval: 1000
#
# ------------ Dead-Letter Queue Settings --------------
# Flag to turn on dead-letter queue.
#
# dead_letter_queue.enable: false
# If using dead_letter_queue.enable: true, the maximum size of each dead letter queue. Entries
# will be dropped if they would increase the size of the dead letter queue beyond this setting.
# Default is 1024mb
# dead_letter_queue.max_bytes: 1024mb
# If using dead_letter_queue.enable: true, the directory path where the data files will be stored.
# Default is path.data/dead_letter_queue
#
path.dead_letter_queue: /data/es/logstash-5.6.1/data/dead_letter_queue
#
# ------------ Metrics Settings --------------
#
# Bind address for the metrics REST endpoint
#
# http.host: "127.0.0.1"
#
# Bind port for the metrics REST endpoint, this option also accept a range
# (9600-9700) and logstash will pick up the first available ports.
#
# http.port: 9600-9700
#
# ------------ Debugging Settings --------------
#
# Options for log.level:
# * fatal
# * error
# * warn
# * info (default)
# * debug
# * trace
#
# log.level: info
path.logs: log4j2.properties
#
# ------------ Other Settings --------------
#
# Where to find custom plugins
# path.plugins: []
(1)logstash的greenplum(postgresql)的连接配置方式:
#1.连接数据库,数据输入阶段
input {
#如果是后台运行,则去掉stdin{}这个配置
stdin {
}
jdbc {
#jdbc驱动包,路径可自定义,需自行下载
jdbc_driver_library => "../lib/greenplum-1.0.jar"
#jdbc驱动类
jdbc_driver_class => "com.pivotal.jdbc.GreenplumDriver"
#jdbc链接URL
jdbc_connection_string=>"jdbc:pivotal:greenplum://127.0.0.1:2345;DatabaseName=testIndex"
#jdbc用户名
jdbc_user => "root"
#jdbc密码
jdbc_password => "1234"
#是否分页导入,如果全量则设置为false
jdbc_paging_enabled => "false"
#每页数据量,即:每一次导入的数据量
jdbc_page_size => "1000"
#是否要将所有字段变成小写
lowercase_column_names => "false"
#导入数据的SQL文件存放路径,路径可自定义
statement_filepath => "/data/es/logstash-5.6.1/config/jdbc.sql"
#定时器任务;多久执行一次,默认1分钟。分、时、天、月、年
schedule => "* * * * *"
#是否清除last_run_metadata_path里面的记录,如果为true,则每次都从新导入数据
clean_run => false
#是否记录上次执行的结果,如果为true则会把上次执行的tracking_column记录到last_run_metadata_path指定文件,增量导入的时候需要用到,路径可自定义
record_last_run => true
last_run_metadata_path => "/data/es/logstash-5.6.1/data/jdbc.lastrun"
#是否使用列属性值,默认track的是timestamp的值,如果为true,这会使用最后记#录的时间作为增量标记,如果为false,则系统会自动记录上一次导入的时间
use_column_value => true
#设置添加增量的条件,根据上一次的导入时间进行数据导入
tracking_column => inputtime
}
}
#2.过滤格式化数据阶段
filter {
json {
source => "message"
remove_field => ["message"]
}
ruby { code => "event.set('timestamp', event.get('@timestamp').time.localtime + 8*60*60)" }
}
#3.数据输出到ES阶段
output {
elasticsearch {
#ES的IP访问地址
hosts => "127.0.0.1:9200"
#索引名称
index => "testIndex"
#索引类型名称
document_type => "test"
#设置主键(id字段等同于数据库的主键字段,可修改),默认系统自动生成
#document_id => "%{id}"
#累计缓冲event条数达到flush_size值会flush释放空间一次
flush_size => 1000
#距离上次flush的时间之后idle_flush_time秒后也会flush一次
idle_flush_time => 15
#是否使用模板覆盖,如果不需要模板,这将下面两行删掉或者注释掉
template_overwrite=>true
#模板路径,通用模板下载:https://download.csdn.net/download/alan_liuyue/11241484
template=>"/data/es/logstash-5.6.1/template/logstash.json"
}
stdout {
codec => json_lines
}
}
(2)logstash的oracle的连接配置方式:
input {
stdin {
}
jdbc {
jdbc_driver_library => "../lib/ojdbc14-10.2.0.3.0.jar"
jdbc_driver_class => "Java::oracle.jdbc.driver.OracleDriver"
jdbc_connection_string => "jdbc:oracle:thin:root/1234@//127.0.0.1:2345/orcl"
jdbc_user => "root"
jdbc_password => "1234"
jdbc_paging_enabled => "false"
jdbc_page_size => "1000"
statement_filepath => "/data/es/logstash-5.6.1/config/sql/test.sql"
schedule => "* * * * *"
clean_run => false
record_last_run => true
last_run_metadata_path => "/data/es/logstash-5.6.1/data/test.lastrun"
use_column_value => true
tracking_column => inputtime
type => "test"
}
}
(3)logstash的sqlserver的连接配置方式:
input {
stdin {
}
jdbc {
jdbc_driver_library => "../lib/sqljdbc4.jar"
jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
jdbc_connection_string => "jdbc:sqlserver://127.0.0.1:2345;databaseName=TESTDB "
jdbc_user => "root"
jdbc_password => "1234"
jdbc_paging_enabled => "false"
jdbc_page_size => "1000"
statement_filepath => "/data/es/logstash-5.6.1/config/sql/test.sql"
schedule => "* * * * *"
clean_run => false
record_last_run => true
last_run_metadata_path => "/data/es/logstash-5.6.1/data/test.lastrun"
use_column_value => true
tracking_column => inputtime
type => "test"
}
}
(4)logstash的mysql的连接配置方式:
input {
stdin {
}
jdbc {
jdbc_driver_library => "../lib/ mysql-connector-java-6.0.5.jar "
jdbc_driver_class => " com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql:// 127.0.0.1:2345/ TESTDB "
jdbc_user => "root"
jdbc_password => "1234"
jdbc_paging_enabled => "false"
jdbc_page_size => "1000"
statement_filepath => "/data/es/logstash-5.6.1/config/sql/test.sql"
schedule => "* * * * *"
clean_run => false
record_last_run => true
last_run_metadata_path => "/data/es/logstash-5.6.1/data/test.lastrun"
use_column_value => true
tracking_column => inputtime
type => "test"
}
}
进入bin目录;
执行命令 nohup ./logstash & 后台启动;
如果要停止,则执行命令 ps aux|grep logstash 查看进程,然后杀死进程即可;
1. jdk版本不兼容,需要1.8版本以上的jdk,解决方法:
不改变当前JDK环境变量的情况下,可以在bin目录下的logstash文件里面的头部新增如下:
export JAVA_HOME=/usr/local/jdk1.8.0_121
export PATH=$JAVA_HOME/bin:$PATH
2. logstash的增量配置的最后更新值sql_last_value默认为timestamp类型的时间值,如果需要使用自定义的字段,则需要自行修改 sql_last_value 值(只需修改一次),然后指定更新的字段(record_last_run => true; tracking_column =>stringField),这样logstash则会根据入库的最后一条记录的字段值进行改写和实现增量;
3. logstash数据输出阶段使用的template模板,这里的作用主要用于将字段分词设置成ik分词,如果不需要这个模板可直接去掉;
4. postgresql、oracle、sqlserver、mysql等数据库连接方式都不一样,所以如果数据源不一样的话,可自行切换连接方式;另外,提供几个驱动包的下载路径,如果有需要可自行前往下载:greenplum-1.0.jar;mysql-connector-java-6.0.5.jar;ojdbc14-10.2.0.3.0.jar;sqljdbc4.jar;
1. test.sql的书写方式:
Select * from tableName where inputtime>:sql_last_value
说明:sql语法和普通的一样,如果是增量的话需要使用增量字段作为条件,
:sql_last_value为默认写法,logstash会自动读取test.lastrun的时间;
2. test.lastrun的书写方式:
— 2017-12-26 00:00:00
说明:如果增量字段是时间类型,可按照上面的格式去写首次导入的时间,如果增量字段是字符串类型,
比如:“20171226000000”,则上面的格式也需要写成:— ‘20171226000000’;否则增量不起作用
实践是检验认识真理性的唯一标准,自己动手丰衣足食~