在上一篇中,我们搭建了一套实时日志分析平台,目前该平台的主要需求就是监测日志中是否含有某些敏感信息,对于不同的日志来源渠道,规则是不同的,有些是默认规则,有些是用户个性化需求,比如A系统,我的日志里面不允许出现hello这个单词,B系统我的日志里面不允许出现world这个单词,当用户新增了敏感信息后,要求应用能够近乎实时的更新其本地缓存(默认情况下,当storm启动的时候,会加载默认规则到bolt里面,而不是每次都从DB查询),这里我们采用redis的发布订阅来实现。
redis的发布订阅类似于观察者模式,订阅者可以订阅多个主题,当发布者发布消息到某个主题的时候,订阅这个主题的客户端都可以收到,从而进行后续的业务处理。
在看具体的代码实现之前,需要先了解一下storm里面bolt的生命周期以及bolt相关的几个方法是干嘛的
public class IBolt接口测试 implements IBolt{
@SuppressWarnings("rawtypes")
@Override
public void prepare(Map stormConf, TopologyContext context,
OutputCollector collector) {
/**
* 1、提供bolt运行的一些环境
*/
}
@Override
public void execute(Tuple input) {
/**
* 1、一次处理一个输入的元祖,元祖对象包括来自哪个组件/流/任务的元数据
* 2、IBolt没有立即处理元祖,而是完整的捕获一个元祖并在以后进行处理
* 3、如果实现basicBolt则不用手动ack()
*/
}
@Override
public void cleanup() {
/**
* 1、当一个bolt即将关闭时调用,不能保证一定被调用,集群的kill -9 不行
*
*/
}
/**
* bolt的生命周期:在客户端主机上创建Ibolt对象,bolt被序列化到拓扑,并提及到nimbus,然后nimbus
* 启动工作进程(worker)进行反序列化,调用其prepare()方法开始处理
*/
}
我们的初始化工作都在prepare方法里面完成,对于不同的日志来源,可能topology的配置不一样,比如名字,应用需要监听的主题,所以我们在部署的时候,可能main方法会有几个,即不同的topic采用不同的topology进行接收处理,但是bolt尽量做到通用,在bolt里面,根据传入的工程名加载相对应的规则文件。先看代码
pom.xml
4.0.0
flume
storm
0.0.1-SNAPSHOT
jar
storm
http://maven.apache.org
UTF-8
4.3.3.RELEASE
redis.clients
jedis
2.9.0
org.apache.commons
commons-pool2
2.4.2
commons-pool
commons-pool
1.6
org.apache.curator
curator-framework
2.7.0
org.apache.storm
storm-core
1.1.0
provided
ch.qos.logback
logback-classic
org.slf4j
log4j-over-slf4j
org.slf4j
slf4j-api
org.clojure
clojure
org.apache.kafka
kafka_2.11
0.9.0.1
org.apache.zookeeper
zookeeper
org.slf4j
slf4j-log4j12
log4j
log4j
org.apache.storm
storm-kafka
1.0.0
com.googlecode.json-simple
json-simple
1.1.1
org.springframework
spring-core
${spring.version}
org.springframework
spring-beans
${spring.version}
org.springframework
spring-aop
${spring.version}
org.springframework
spring-context
${spring.version}
mysql
mysql-connector-java
5.1.26
org.apache.storm
storm-jdbc
1.0.1
org.apache.storm
storm-redis
1.1.0
org.apache.maven.plugins
maven-compiler-plugin
2.3.2
1.8
1.8
maven-assembly-plugin
jar-with-dependencies
make-assembly
package
single
spout
package com.log.storm;
import java.util.Arrays;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.kafka.BrokerHosts;
import org.apache.storm.kafka.KafkaSpout;
import org.apache.storm.kafka.SpoutConfig;
import org.apache.storm.kafka.StringScheme;
import org.apache.storm.kafka.ZkHosts;
import org.apache.storm.spout.SchemeAsMultiScheme;
import org.apache.storm.topology.TopologyBuilder;
public class CounterTopology {
public static void main(String[] args) {
try {
String kafkaZookeeper = "192.168.80.132:2181";
BrokerHosts brokerHosts = new ZkHosts(kafkaZookeeper);
SpoutConfig spoutConfig = new SpoutConfig(brokerHosts, "test", "/kafka2storm", "id");
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
spoutConfig.zkServers = Arrays.asList(new String[] { "192.168.80.132" });
spoutConfig.zkPort = 2181;
KafkaSpout kafkaSpout = new KafkaSpout(spoutConfig);
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout("spout", kafkaSpout, 1);
/** 日志渠道来源于项目 project01*/
builder.setBolt("parseBolt01", new ParseBolt("192.168.80.132", 6379, "project01"), 3)
.shuffleGrouping("spout");
builder.setBolt("insertbolt01", PersistentBolt.getJdbcInsertBolt(), 1).shuffleGrouping("parseBolt01");
Config config = new Config();
config.setDebug(false);
config.put("projectName", "project01");
if (args != null && args.length > 0) {
config.setNumWorkers(1);
StormSubmitter.submitTopology(args[0], config, builder.createTopology());
} else {
config.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("special-topology", config, builder.createTopology());
Thread.sleep(50000);
cluster.killTopology("special-topology");
cluster.shutdown();
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
在spout里面,可以看到,我们监听topic=test,另外,redis监听的主题是project01,config里面设置的工程名也是project01 (设置这个的原因是bolt里面可以拿到然后去DB查询该工程相应的默认规则),如果我们还有其他产品日志需要采集分析,直接改相应的redis监听主题,和工程名即可。
Bolt
package com.log.storm;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.HashSet;
import java.util.Map;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.base.BaseRichBolt;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import redis.clients.jedis.Jedis;
import redis.clients.jedis.JedisPool;
import redis.clients.jedis.JedisPoolConfig;
import redis.clients.jedis.JedisPubSub;
public class ParseBolt extends BaseRichBolt{
private Logger logger = LoggerFactory.getLogger(ParseBolt.class);
private static final long serialVersionUID = -5508421065181891596L;
private OutputCollector collector;
private String host;
private int port;
private String channel;
private JedisPool jedisPool;
/**map的key为不同的项目,value为这个项目对应的规则配置 */
private Map> dataRuleConfigMap = new ConcurrentHashMap>();
private String url = "jdbc:mysql://192.168.80.132:3306/logmonitor?useUnicode=true&characterEncoding=UTF-8";
private String username = "root";
private String password = "123456";
private String driver = "com.mysql.jdbc.Driver";
private ResultSet rs;
private Statement state;
public ParseBolt(String host,int port,String channel) {
this.host = host;
this.port = port;
this.channel = channel;
}
class ListenerThread extends Thread{
JedisPool pool;
String channel;
public ListenerThread(JedisPool pool, String channel) {
this.pool = pool;
this.channel = channel;
}
@Override
public void run() {
JedisPubSub jedisPubSub = new JedisPubSub() {
@Override
public void onMessage(String channel, String message) {
/**
* 当收到了规则文件变更的触发后,更新本地缓存
*/
logger.info("当前线程:" + Thread.currentThread().getName() + "监听队列:" + channel + ";收到变更消息:" + message);
if(null != dataRuleConfigMap && !dataRuleConfigMap.isEmpty()) {
Set sets = dataRuleConfigMap.get(channel);
sets.add(message);
}
}
};
Jedis jedis = pool.getResource();
try {
jedis.subscribe(jedisPubSub, channel);
}finally {
jedis.close();
}
}
}
@SuppressWarnings("rawtypes")
@Override
public void prepare(Map stormConf, TopologyContext context,
OutputCollector collector) {
this.collector = collector;
jedisPool = new JedisPool(new JedisPoolConfig(), host, port);
ListenerThread listenerThread = new ListenerThread(jedisPool, channel);
listenerThread.start();
String projectName = (String) stormConf.get("projectName");
/**
* 第一次当bolt启动的时候,初始化本地缓存里面的规则信息(也就是默认的规则),这里可以根据不同的项目初始化不同的规则配置
* 也可以查DB或者properties配置加载
*/
try {
Class.forName(driver);
Connection conn = DriverManager.getConnection(url,username,password);
state = conn.createStatement();
rs = state.executeQuery("select * from t_rule_config where projectname='"+projectName+"'");
Set projetNameSet = new HashSet();
while(rs.next()) {
String config = rs.getString(3);
logger.info("当前线程:" + Thread.currentThread().getName() + "数据库信息:" + config);
projetNameSet.add(config);
}
dataRuleConfigMap.put(projectName, projetNameSet);
} catch (ClassNotFoundException e) {
e.printStackTrace();
} catch (SQLException e) {
e.printStackTrace();
}
}
@Override
public void execute(Tuple tuple) {
String message = tuple.getString(0);
logger.info("bolt receive message : " + message);
boolean isOk = false;
for(final Map.Entry> maps : dataRuleConfigMap.entrySet()) {
for(final String ruleConfig : maps.getValue()) {
logger.info("当前线程:" + Thread.currentThread().getName() + ";此时队列数据:" + ruleConfig);
}
}
for(final Map.Entry> maps : dataRuleConfigMap.entrySet()) {
for(final String ruleConfig : maps.getValue()) {
if(message.contains(ruleConfig)) {
logger.info("当前线程:" + Thread.currentThread().getName() +"接收日志数据:" + message +";该日志包含敏感信息:" + ruleConfig);
isOk = true;
break;
}
}
}
if(isOk) {
collector.emit(tuple, new Values(message));
}
collector.ack(tuple);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("message"));
}
}
首先,当bolt初始化的时候,在prepare方法里面干几件事
1、初始化jedispool
2、启动一个线程,使用redis的发布订阅功能监听某个主题的变化
3、查询DB,加载该功能名对应的默认规则文件
存DB
package com.log.storm;
import java.sql.Types;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.storm.jdbc.bolt.JdbcInsertBolt;
import org.apache.storm.jdbc.common.Column;
import org.apache.storm.jdbc.common.ConnectionProvider;
import org.apache.storm.jdbc.common.HikariCPConnectionProvider;
import org.apache.storm.jdbc.mapper.JdbcMapper;
import org.apache.storm.jdbc.mapper.SimpleJdbcMapper;
import org.apache.storm.shade.com.google.common.collect.Lists;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@SuppressWarnings("serial")
public class PersistentBolt {
private static Logger logger = LoggerFactory.getLogger(PersistentBolt.class);
private static Map hikariConfigMap = new HashMap() {{
put("dataSourceClassName", "com.mysql.jdbc.jdbc2.optional.MysqlDataSource");
put("dataSource.url", "jdbc:mysql://192.168.80.132:3306/logmonitor?useUnicode=true&characterEncoding=UTF-8");
put("dataSource.user", "root");
put("dataSource.password", "123456");
}};
public static ConnectionProvider connectionProvider = new HikariCPConnectionProvider(hikariConfigMap);
public static JdbcInsertBolt getJdbcInsertBolt() {
JdbcInsertBolt jdbcInsertBolt = null;
@SuppressWarnings("rawtypes")
List schemaColumns = Lists.newArrayList(new Column("message", Types.VARCHAR));
for(@SuppressWarnings("rawtypes") final Column column : schemaColumns) {
if(null != column) {
logger.info("column:" + column.toString());
}
}
if(null != schemaColumns && !schemaColumns.isEmpty()) {
JdbcMapper simpleJdbcMapper = new SimpleJdbcMapper(schemaColumns);
jdbcInsertBolt = new JdbcInsertBolt(connectionProvider, simpleJdbcMapper)
.withInsertQuery("insert into t_source_log(message) values(?)")
.withQueryTimeoutSecs(50);
}
return jdbcInsertBolt;
}
}
./storm jar ../jars/storm-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.log.storm.CounterTopology storm2mysql
模拟一个kafka生产者客户端
[root@slave kafka_2.10-0.10.1.0]# bin/kafka-console-producer.sh --broker-list 192.168.80.132:9092 --topic test --producer.config config/producer.properties
启动redis-server
[root@slave src]# ./redis-server ../redis.conf >/dev/null 2>&1 &
[5] 3696
启动redis-cli
[root@slave src]# ./redis-cli -h 192.168.80.132
当storm启动topology的时候,prepare里面的初始化代码就会被执行,日志如下
2018-02-08 04:14:17.678 c.l.s.ParseBolt [INFO] 当前线程:Thread-6-parseBolt01-executor[3 3]数据库信息:身份证
2018-02-08 04:14:17.679 c.l.s.ParseBolt [INFO] 当前线程:Thread-6-parseBolt01-executor[3 3]数据库信息:户口本
2018-02-08 04:14:17.680 o.a.s.d.executor [INFO] Prepared bolt parseBolt01:(3)
2018-02-08 04:14:17.681 c.l.s.ParseBolt [INFO] 当前线程:Thread-14-parseBolt01-executor[5 5]数据库信息:身份证
2018-02-08 04:14:17.682 c.l.s.ParseBolt [INFO] 当前线程:Thread-14-parseBolt01-executor[5 5]数据库信息:户口本
2018-02-08 04:14:17.683 o.a.s.d.executor [INFO] Prepared bolt parseBolt01:(5)
2018-02-08 04:14:17.685 c.l.s.ParseBolt [INFO] 当前线程:Thread-16-parseBolt01-executor[4 4]数据库信息:身份证
2018-02-08 04:14:17.700 c.l.s.ParseBolt [INFO] 当前线程:Thread-16-parseBolt01-executor[4 4]数据库信息:户口本
注意,因为我给parseBolt的并行度设置为3(并行度主要是加快storm的处理),也就是在我的storm集群里面有3个bolt实例,所以这里看到是3个线程分别进行了初始化。
看看DB
可以看到,工程project01默认的规则有2个,已经被我们加载到了本地缓存里面,现在我们模拟生产一条消息,假设生产的消息就叫‘户口本’,和我们的规则匹配上了,属于敏感信息,看日志
2018-02-08 04:14:55.848 c.l.s.ParseBolt [INFO] 当前线程:Thread-14-parseBolt01-executor[5 5];此时队列数据:户口本
2018-02-08 04:14:55.856 c.l.s.ParseBolt [INFO] 当前线程:Thread-14-parseBolt01-executor[5 5];此时队列数据:身份证
2018-02-08 04:14:55.857 c.l.s.ParseBolt [INFO] 当前线程:Thread-14-parseBolt01-executor[5 5]接收日志数据:户口本;该日志包含敏感信息:户口本
现在用户有个新的需求,它想把haha这个单词也作为project01项目的敏感信息,那么他只需要这样做
1、登录某个页面
2、输入工程名,敏感信息
3、后端应用收到后,一是保存到DB做持久化,二是调用redis发布订阅相关接口,告知订阅者有新敏感信息
看storm的日志
2018-02-08 04:17:45.640 c.l.s.ParseBolt [INFO] 当前线程:Thread-24监听队列:project01;收到变更消息:haha
2018-02-08 04:17:45.641 c.l.s.ParseBolt [INFO] 当前线程:Thread-22监听队列:project01;收到变更消息:haha
2018-02-08 04:17:45.648 c.l.s.ParseBolt [INFO] 当前线程:Thread-23监听队列:project01;收到变更消息:haha
都收到了变更信息,说明haha目前也是我们的敏感,kafka在生产一个过来
2018-02-08 04:18:27.489 c.l.s.ParseBolt [INFO] bolt receive message : haha
2018-02-08 04:18:27.490 c.l.s.ParseBolt [INFO] 当前线程:Thread-6-parseBolt01-executor[3 3];此时队列数据:haha
2018-02-08 04:18:27.490 c.l.s.ParseBolt [INFO] 当前线程:Thread-6-parseBolt01-executor[3 3];此时队列数据:户口本
2018-02-08 04:18:27.490 c.l.s.ParseBolt [INFO] 当前线程:Thread-6-parseBolt01-executor[3 3];此时队列数据:身份证
2018-02-08 04:18:27.494 c.l.s.ParseBolt [INFO] 当前线程:Thread-6-parseBolt01-executor[3 3]接收日志数据:haha;该日志包含敏感信息:haha
再看数据库
OK,至此,基于redis的发布订阅在storm里已经完成了集成。如有疑问或者需要源代码的可留言。