- 数据采集和验证方案
- HDFS数据采集
- Filebeat
- Kafka-Eagle
- ES数据采集
- Logstash
- ES-Kibana
- Kibana设置索引生命周期策略
- Elastalert
- MYSQL数据采集
- HDFS数据采集
仓库地址
https://gitee.com/carollia/tools-docker
里面有以下docker的Dockerfile,可以自行打镜像, 修改镜像名字, 也可以使用公用的镜像
HDFS数据采集
Filebeat -> Kafka -> Flink -> Hive
主要是采集数据到数仓, 做运营数据分析
通过Filebeat采集日志到Kafka,通过Kafka-Eagle对每个主题数据数量简单查询, 校验数据正确性
Filebeat
Filebeat 是使用 Golang 实现的轻量型日志采集器,Filebeat 的可靠性很强,可以保证日志 At least once 的上报,同时也考虑了日志搜集中的各类问题,例如日志断点续读、文件名更改、日志 Truncated 等。
为了使得Filebeat开箱即用, 特别设计了Dockfile, 使得通过docker-compose.yml文件的环境变量便可以实现不同的功能
version: "3"
services:
filebeat:
restart: always
image: carollia/filebeat:v7.9.3_docker
container_name: filebeat
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: on-failure
extra_hosts: # ip域名映射
- "master:IP"
- "slaves01:IP"
- "slaves02:IP"
volumes:
- /var/lib/docker/containers:/var/lib/docker/containers
- /var/run/docker.sock:/var/run/docker.sock
environment:
CONTAINERS_IDS: '["*"]' # 修改处, *代表采集所有容器, 容器Id放入list则是多个容器日志
CONTAINERS_PATH: "\\/var\\/lib\\/docker\\/containers"
OUTPUT_CONSOLE_ENABLE: "true" # 1.为true打印到控制台, 与kafka不能同时为true
OUTPUT_KAFKA_ENABLE: "false" # 2.为true,控制台为false, 传输到kafka
KAFKA_HOSTS: "['slaves02:9092']" # 填写kafka地址
KAFKA_VESION: "2.0.0"
KAFKA_INVALID_TOPIC: "filebeat-invalid" # 设置无效数据传输的kafka主题, 非json,没有下面规则的message.type
KAFKA_TOPIC: "filebeat-%{[message.type]}" # 通过日志的type进行数据分发到不同的主题
KAFKA_REQUIRED_ACKS: 1 # kafka的ack级别
KAFKA_COMPRESSION: "snappy"
KAFKA_MAX_MESSAGE_BYTES: 1000000
PROCESSORS_JSON_FILEDS: "['message']"
PROCESSORS_JSON_PROCESS_ARRAY: "true"
PROCESSORS_JSON_MAX_DEPTH: 3 # Json字符串的解析层数
PROCESSORS_JSON_TARGET: "message"
PROCESSORS_JSON_OVERWRITE_KEYS: "true"
Mysql 8
version: '3'
services:
# user: admin, pwd: :manager
# user: appconfig, pwd: :appconfig
mysql:
restart: always
container_name: mysql
image: mysql:8.0.20
ports:
- "3306:3306"
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: on-failure
volumes:
- ./my.conf:/etc/mysql/my.conf
environment:
MYSQL_ROOT_PASSWORD: "123456"
my.conf
[mysqld]
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
datadir = /opt
#secure-file-priv= /var/lib/mysql
secure-file-priv=
symbolic-links=0
log-bin=/var/lib/mysql/mysql-bin
binlog_format=row
binlog_cache_size=32m
max_binlog_cache_size=64m
max_binlog_size=512m
long_query_time=1
log_output=FILE
log-error=/var/lib/mysql/mysql-error.log
slow_query_log=1
slow_query_log_file=/var/lib/mysql/slow_statement.log
general_log=0
general_log_file=/var/lib/mysql/general_statement.log
binlog_expire_logs_seconds=1728000
relay-log=/var/lib/mysql/relay-bin
relay-log-index=/var/lib/mysql/relay-bin.index
master-info-repository=TABLE
relay-log-info-repository=TABLE
!includedir /etc/mysql/conf.d/
Kafka-Eagle
Kafka-Eagle用于验证Kafka数据, 创建主题,删除主题,KQL查询数据等
version: '3'
services:
kafka-eagle:
restart: always
container_name: kafka-eagle
image: carollia/kafka-eagle:v2.0.2 # 镜像名称
ports:
- "8048:8048"
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: on-failure
extra_hosts:
- "master:IP"
- "slaves01:IP"
- "slaves02:IP"
environment:
ZK_CLUSTER_ALIAS: "test" # 此处可以开多个集群,用逗号分割,比如test,local, dev
ZK_LOCAL: "" # 上述存在local,则这里需要填local集群的zookeeper地址
ZK_TEST: "master:2181" # 上述存在test,则这里需要填test集群的zookeeper地址
ZK_DEV: "" # 上述存在dev,则这里需要填dev集群的zookeeper地址
MYSQL_URL: "ip:port" # 数据库ip:port
MYSQL_DATABASE: "ke" # 数据库名称
MYSQL_USER: "user" # 数据库用户
MYSQL_PWD: "password" # 数据库密码
ES数据采集
Filebeat -> Kafka -> LogStash -> ES --> Elastalert (日志告警)
主要是对服务端日志或客户端日志进行采集和预警, 来及时监控服务运行状态, 提高用户体验
Logstash
Logstash 是免费且开放的服务器端数据处理管道,能够从多个来源采集数据,转换数据,与ES集成的方案非常成熟,以下Docker按天生成索引, 结合es-kibana的过期策略定时删除索引
version: "3"
services:
logstash:
restart: always
image: carollia/logstash:v7.9.3_kafka
container_name: logstash-kafka
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: on-failure
extra_hosts:
- "master:IP"
- "slaves01:IP"
- "slaves02:IP"
healthcheck:
disable: true
environment:
KAFKA_BROKER: "slaves02:9092" # 采集的kafka地址
KAFKA_GROUP_ID: "logstash-kafka" # kafka消费组ID
LOGSTASH_CLIENT: "logstash-kafka" # kafka消费client
KAFKA_AUTO_OFFSET_RESERT: "earliest" # kafka消费offset偏移方式
KAFKA_ENABLE_AUTO_COMMIT: "true"
KAFKA_TOPICS: "['filebeat-nginx','filebeat-es']" # kafka消费主题list
ES_HOST: "['master:9200']" # 输出到es地址
ES_INDEX: "%{[message][app]}-logstash-%{+YYYY.MM.dd}" # 索引名必须小写,按天分区
ES_USERNAME: "elastic" # 填写es的用户名
ES_PASSWORD: "elastic" # 填写es的密码
STDOUT: "true" # 是否打印到控制台
ES-Kibana
Elasticsearch一款基于Apache Lucene™开源搜索引擎,其核心是迄今为止最先进、性能最好的、功能最全的搜索引擎库Lucene。Elasticsearch使用简单,具有非常强大的全文搜索功能。
Kibana是一个与Elasticsearch协同工作的开源分析和可视化平台,Kibana 可以让你更方便地对 Elasticsearch 中数据进行操作,包括高级的数据分析以及在图表中可视化您的数据。
version: "3.2"
volumes:
elastic_data_7.5.2:
services:
# sysctl -w vm.max_map_count=262144
es:
image: docker.elastic.co/elasticsearch/elasticsearch:7.9.3
container_name: es
volumes:
- elastic_data_7.5.2:/usr/share/elasticsearch/data
ports:
- "9200:9200"
- "9300:9300"
environment:
http.cors.allow-headers: "Authorization"
node.name: "es"
cluster.name: "docker-cluster"
network.host: "0.0.0.0"
network.publish_host: _eth0_
discovery.type: "single-node"
bootstrap.memory_lock: "true"
xpack.security.enabled: "true"
ES_JAVA_OPTS: "-Xmx1g -Xms1g"
ELASTIC_PASSWORD: "elastic"
node.max_local_storage_nodes: 20
ulimits:
memlock:
soft: -1
hard: -1
nofile:
soft: 655350
hard: 655350
nproc: 655350
logging:
driver: "json-file"
options:
max-size: "1g"
max-file: "1"
restart: always
kibana:
image: docker.elastic.co/kibana/kibana:7.5.2
container_name: kibana
ports:
- "5601:5601"
environment:
# 大写环境变量生效,小写无效
SERVER_NAME: "kibana"
SERVER_HOST: "0.0.0.0"
ELASTICSEARCH_HOSTS: "http://es:9200"
KIBANA_INDEX: ".kibana"
ELASTICSEARCH_USERNAME: "elastic"
ELASTICSEARCH_PASSWORD: "elastic"
logging:
driver: "json-file"
options:
max-size: "1g"
max-file: "1"
restart: always
depends_on:
- "es"
Kibana设置索引生命周期策略
1.创建索引模板
2.创建过期删除策略
3.应用过期策略到索引模板中
Elastalert
elastalert是yelp使用python开发的elasticsearch告警工具。以下封装了docker版本的,错误日志告警,自定义日志级别告警, 以及未产生日志的告警。
version: "3.2"
services:
elastalert:
image: carollia/elastalert:dingtalk
container_name: es-alert
logging:
driver: "json-file"
options:
max-size: "500m"
max-file: "1"
restart: always
healthcheck:
disable: true
environment:
ELASTALERT_CONFIG_RUN_EVERY_SECONDS: 60
ELASTALERT_CONFIG_BUNFFER_TIME: 15
ELASTALERT_CONFIG_ES_HOST: "xxxxx" # todo: ES IP地址
ELASTALERT_CONFIG_ES_PORT: 9200
ELASTALERT_CONFIG_ES_USERNAME: "elastic" # todo: ES用户名
ELASTALERT_CONFIG_ES_PASSWORD: "elastic" # todo: ES密码
ELASTALERT_CONFIG_WRITEBACK_INDEX: "elastalert_status"
ELASTALERT_CONFIG_ALERT_TIME_LIMIT_DAYS: 2
ELASTALERT_CONFIG_SMTP_HOST: "smtp.163.com"
ELASTALERT_CONFIG_SMTP_PORT: 25
ELASTALERT_CONFIG_SMTP_SSL: "false"
SMTP_USER: "[email protected]" # todo: 163邮箱用户
SMTP_PASSWORD: "xxx" # todo: 163邮箱用户授权码
# 错误日志报警, 日志中level有error
RULE_NAME: "ES日志异常-DataQ"
RULE_INDEX: "logstash-dataq*"
RULE_FILTER_LEVEL: "error" # 日志中level有error
# 告警日志异常, 日志中level有warn
RULE_WARN_NAME: "ES日志异常-Events上报失败"
RULE_WARN_INDEX: "logstash-dataq*"
RULE_WARN_FILTER_LEVEL: "warn"
# 错误和告警日志报警
RULE_TYPE: "frequency"
RULE_NUM_EVENTS: 1
RULE_TIMEFRAME_HOURS: 1
RULE_ALERT: "elastalert_modules.dingtalk_alert.DingTalkAlerter"
RULE_DINGTALK_WEBHOOK: "https:\\/\\/oapi.dingtalk.com\\/robot\\/send?access_token=xxx" # todo: 钉钉机器人webhook
RULE_DINGTALK_MSGTYPE: "text"
RULE_FROM_ADDR: "[email protected]" # todo: 发送的163邮箱用户
RULE_EMAIL: "[email protected]" # todo: 接收的邮箱用户
# 未产生日志报警
RULE_SPARE_ERROR_NAME: "EventsReport日志异常"
RULE_SPARE_ERROR_INDEX: "logstash-event*"
RULE_SPARE_ERROR_TEXT_ALERT: "ES日志异常:EventsReport服务10分钟未产生日志"
RULE_FILTER_APP: "event"
RULE_SPARE_ERROR_TYPE: "flatline"
RULE_SPARE_ERROR_NUM_EVENTS: 1
RULE_SPARE_ERROR_TIMEFRAME_MINUTES: 600
RULE_SPARE_ERROR_ALERT: "elastalert_modules.dingtalk_alert.DingTalkAlerter"
RULE_SPARE_ERROR_DINGTALK_WEBHOOK: "https:\\/\\/oapi.dingtalk.com\\/robot\\/send?access_token=xxx" # todo: 钉钉机器人webhook
RULE_SPARE_ERROR_DINGTALK_MSGTYPE: "text"
MYSQL数据采集
Binlog -> canal -> Kafka
主要用于数据库日志的采集, 此为额外介绍, 开多个docker可以实现高可用, 此高可用并非集群概念, 而是当一台机器宕机时, 另一台机器会继续采集。每一时刻只有一台机器运作。
version: "3"
services:
canal:
restart: always
image: carollia/canal:1.1.5_kafka
container_name: canal
deploy:
mode: replicated
replicas: 1
restart_policy:
condition: on-failure
extra_hosts:
- "master:IP"
- "slaves01:IP"
- "slaves02:IP"
healthcheck:
disable: true
environment:
# zk地址
CANAL_ZKSERVERS: "slaves02:2181"
# tcp, kafka, rocketMQ, rabbitMQ
CANAL_SERVERMODE: "kafka"
# 使用druid处理所有的ddl解析来获取库和表名
CANAL_INSTANCE_FILTER_DRUID_DDL: "true"
# 是否忽略ddl语句, 数据定义语言: create、drop、alter
CANAL_INSTANCE_FILTER_QUERY_DDL: "true"
# 是否忽略dml语句: 数据操纵语句: insert、delete、udpate 和select等(增添改查)
CANAL_INSTANCE_FILTER_QUERY_DML: "false"
# 是否忽略dcl语句: 数据控制语句, grant、revoke
CANAL_INSTANCE_FILTER_QUERY_DCL: "true"
# 是否忽略binlog表结构获取失败的异常
CANAL_INSTANCE_FILTER_TABLE_ERROR: "false"
# kafka订阅
KAFKA_BOOTSTRAP_SERVERS: "slaves02:9092"
KAFKA_ACKS: "all"
KAFKA_COMPRESSION_TYPE: "snappy"
KAFKA_BATCH_SIZE: 16384
KAFKA_MAX_REQUEST_SIZE: 1048576
KAFKA_BUFFER_MEMORY: 33554432
KAFKA_RETRIES: 1
# mysql instance config
CANAL_DESTINATIONS: "example_dev"
# mysql主库链接时起始的binlog文件
MYSQL_SLAVEID: 123456
CANAL_INSTANCE_MASTER_JOURNAL_NAME: ""
CANAL_INSTANCE_MASTER_POSITION: ""
# mysql主库链接时起始的binlog的时间戳
CANAL_INSTANCE_MASTER_TIMESTAMP: ""
# 是否启用mysql gtid的订阅模式
CANAL_INSTANCE_GTIDON: "false"
CANAL_INSTANCE_MASTER_GTID: ""
# msyql 相关配置
CANAL_INSTANCE_DEFAULT_DATABASENAME: "database" # todo:采集的数据库
CANAL_INSTANCE_MASTER_ADDRESS: "IP:PORT" # todo:采集的数据库mysql地址端口
CANAL_INSTANCE_DBUSERNAME: "user" # todo:采集的数据库用户名
CANAL_INSTANCE_DBPASSWORD: "password" # todo:采集的数据库密码
#CANAL_INSTANCE_FILTER_REGEX: "account\\..*,account_log\\..*,money\\..*"
CANAL_INSTANCE_FILTER_REGEX: " money\\.charge_prop_config" # todo:过滤表规则
CANAL_INSTANCE_FILTER_BLACK_REGEX: "mysql\\.slave_.*" # todo:黑名单
#推送主题配置
CANAL_MQ_TOPIC: "canal-binlog" # 推送kafka主题
CANAL_MQ_PARTITIONSNUM: 3 # 推送kafka主题分区数据
CANAL_MQ_PARTITIONHASH: ".*\\..*" # 推送kafka主题分发消息的规则
系列文章
第一篇: Ambari自动化部署
第二篇: 数据埋点设计和SDK源码
第三篇: 数据采集和验证方案
第四篇: ETL实时方案: Kafka->Flink->Hive
第五篇: ETL用户数据处理: kafka->spark->kudu
第六篇: Presto分析模型SQL和UDF函数
第七篇: 用户画像和留存预测模型