在上一篇博客(Storm实时大数据处理(一))中,我介绍了Storm的基本概念和原理,本文我们开始基于Storm提供的API开发自己的应用程序。入门Storm应用程序开发很简单,这得益于设计者为我们精心设计的简单API。
一、搭建开发环境
在生产环境中,Storm集群运行在基于Linux操作系统的分布式集群中,可喜的是,Storm提供了本地模式(Local Mode)来方便开发者开发Storm Topology,而且本地模式支持Windows操作系统,因此搭建一个本地模式的Storm开发环境很简单。在已经搭建好的Java开发环境中,在Eclipse中安装配置好Maven项目管理工具即可,就这么简单,一步到位!
二、新建项目
在Eclipse中新建一个Maven Project,新建成功后,项目里面会有一个pom.xml文件,要开发Storm应用程序,需要Storm的jar包,之前的做法可能是自己去下载Storm的jar包,然后再导入到自己的项目中,有了Maven以后,这一切被彻底革命掉了,需要使用任何第三方jar包,直接去Maven中央仓库搜索一下,然后把依赖复制到pom.xml文件中即可,Maven会帮我们管理好项目依赖的jar包。Storm现在已经有很多版本了,在Maven中央仓库搜索以后,自己选择一个Release稳定版即可。我使用的是Storm 0.9.3版本,加入Storm依赖以后的pom.xml文件内容如下:
4.0.0
com.yistory
WordCount
0.0.1-SNAPSHOT
jar
WordCount
http://maven.apache.org
UTF-8
junit
junit
3.8.1
test
org.apache.storm
storm-core
0.9.3
三、编写Storm代码
本项目的功能就是统计生成的句子中各个单词出现的次数,业务逻辑很简单,我们的重点是关注怎么开发Storm应用程序。开发Storm应用程序的过程,就是搭建一个拓扑的过程,其实就是构造一个有向无环图。有向无环图由结点和有向边组成,而在Storm中,结点就是Spout或者Bolt,而边就是Spout和Bolt之间或者是Bolt和Bolt之间连接关系。
Storm把Topology中的流转的数据抽象为Stream(流),流的源头就是Spout,在这个例子中,随机生成句子的结点就是Spout,而Spout的具体体现形式就是一些特殊的类。在Storm中,编写Spout类的方法有2种,其一是实现IRichSpout接口,比如新建一个类WordEmitter,让其实现该接口,这是Eclipse会提醒你要实现该接口的所有方法,选择添加实现以后会自动生成一下代码。
package com.yistory.WordCount.spouts;
import java.util.Map;
import backtype.storm.spout.SpoutOutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.IRichSpout;
import backtype.storm.topology.OutputFieldsDeclarer;
public class WordEmitter implements IRichSpout{
public void ack(Object arg0) {
// TODO Auto-generated method stub
}
public void activate() {
// TODO Auto-generated method stub
}
public void close() {
// TODO Auto-generated method stub
}
public void deactivate() {
// TODO Auto-generated method stub
}
public void fail(Object arg0) {
// TODO Auto-generated method stub
}
public void nextTuple() {
// TODO Auto-generated method stub
}
public void open(Map arg0, TopologyContext arg1, SpoutOutputCollector arg2) {
// TODO Auto-generated method stub
}
public void declareOutputFields(OutputFieldsDeclarer arg0) {
// TODO Auto-generated method stub
}
public Map getComponentConfiguration() {
// TODO Auto-generated method stub
return null;
}
}
package com.yistory.WordCount.spouts;
import java.util.Map;
import backtype.storm.spout.SpoutOutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichSpout;
public class WordEmitter extends BaseRichSpout {
public void nextTuple() {
// TODO Auto-generated method stub
}
public void open(Map arg0, TopologyContext arg1, SpoutOutputCollector arg2) {
// TODO Auto-generated method stub
}
public void declareOutputFields(OutputFieldsDeclarer arg0) {
// TODO Auto-generated method stub
}
}
实现IRichSpout接口和继承BaseRichSpout类都可以实现Spout的业务逻辑,这两种方法中,实现
IRichSpout接口需要实现其所有方法,而继承BaseRichSpout类只需要实现Spout最主要的3个方法即可。推荐使用后一种方式,这样我们的Spout类会比较清爽,没有臃肿的感觉。实际上,Storm的设计者在这里使用了适配器设计模式。接下来的代码是我们的生成句子的Spout类的完整代码,我已经在代码里添加了注释,读者应该能够理解。
package com.yistory.WordCount.spouts;
import java.util.Map;
import java.util.UUID;
import java.util.concurrent.ConcurrentHashMap;
import backtype.storm.spout.SpoutOutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichSpout;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Values;
import backtype.storm.utils.Utils;
public class SentenceSpout extends BaseRichSpout {
private static final long serialVersionUID = 4608825077450573093L;
private ConcurrentHashMap pending;
private SpoutOutputCollector collector;
private String[] sentences = {
"connecting the dots",
"love and loss",
"keep looking",
"do not settle",
"stay hungry",
"stay foolish"
};
private int index;
/**
* 在Storm中,这个方法相当于Spout的构造函数,类初始化时被调用,
* 所以一般会把Spout初始化操作放在这个方法里
*/
public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
this.index = 0;
this.collector = collector;
this.pending = new ConcurrentHashMap();
}
/**
* 声明输出元组的字段信息
*/
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("sentence"));
}
/**
* 在Storm运行期间,这个方法会一直被调用,你可以把他理解为死循环
*/
public void nextTuple() {
Values value = new Values(sentences[index]);
UUID msgId = UUID.randomUUID();
this.pending.put(msgId, value);
this.collector.emit(value,msgId);
index++;
if(index >= sentences.length){
index = 0;
}
// 休眠0.1毫秒
Utils.sleep(100);
}
/**
* 元组被正常处理后的操作
*/
public void ack(Object msgId){
this.pending.remove(msgId);
}
/**
* 如果元组未被正常处理就重发
*/
public void fail(Object msgId){
this.collector.emit(this.pending.get(msgId),msgId);
}
}
接下来就是实现一系列的Bolt了,同样,在Storm中,实现Bolt也有2种方式,其一,实现IRichBolt接口,其二,继承BaseRichBolt类,二种方式的区别和前面介绍的Spout的实现方式一样。
这个Bolt的功能是把句子分割成为单词,然后传递到下游的Bolt,代码如下。
package com.yistory.WordCount.bolts;
import java.util.Map;
import backtype.storm.task.OutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
public class SplitSentenceBolt extends BaseRichBolt {
private static final long serialVersionUID = 2390867112177953110L;
private OutputCollector collector;
/**
* 在Storm中,这个方法相当于Bolt的构造函数,类初始化时被调用,
* 所以一般会把Bolt初始化操作放在这个方法里
*/
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
}
/**
* 声明输出元组的字段信息
*/
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
/**
* 在Storm运行期间,一旦这个Bolt订阅的元组到达,这个方法就会被调用
*/
public void execute(Tuple tuple) {
String sentence = tuple.getStringByField("sentence");
String[] words = sentence.split(" ");
for(String word:words){
word = word.trim();
// 将输出的tuple和输入的tuple锚定
this.collector.emit(tuple,new Values(word));
}
// 告诉Spout,这个元组已经被成功处理了
this.collector.ack(tuple);
}
}
package com.yistory.WordCount.bolts;
import java.util.HashMap;
import java.util.Map;
import backtype.storm.task.OutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichBolt;
import backtype.storm.tuple.Fields;
import backtype.storm.tuple.Tuple;
import backtype.storm.tuple.Values;
public class WordCountBolt extends BaseRichBolt {
private static final long serialVersionUID = 360868701353402042L;
private OutputCollector collector;
private HashMap counters;
public void prepare(Map conf, TopologyContext context, OutputCollector collector) {
this.collector = collector;
counters = new HashMap();
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word","count"));
}
public void execute(Tuple tuple) {
String word = tuple.getStringByField("word");
Integer count = counters.get(word);
if(null == count){
count = 0;
}
count++;
this.counters.put(word, count);
// 将输出的tuple和输入的tuple锚定
this.collector.emit(tuple,new Values(word,count));
// 告诉上游Bolt,这个元组已经被成功处理了
this.collector.ack(tuple);
}
}
这个Bolt的功能是当拓扑运行结束时打印
单词计数(这里只是演示而这样做的,生成环境中Storm会一直运行下去,除非你主动停止它)
package com.yistory.WordCount.bolts;
import java.util.HashMap;
import java.util.Map;
import backtype.storm.task.OutputCollector;
import backtype.storm.task.TopologyContext;
import backtype.storm.topology.OutputFieldsDeclarer;
import backtype.storm.topology.base.BaseRichBolt;
import backtype.storm.tuple.Tuple;
public class ReportBolt extends BaseRichBolt {
private static final long serialVersionUID = -1884042962508663765L;
private HashMap counts;
public void prepare(Map conf, TopologyContext context, OutputCollector arg2) {
this.counts = new HashMap();
}
/**
* 这个Bolt什么也不输出
*/
public void declareOutputFields(OutputFieldsDeclarer arg0) {
}
public void execute(Tuple tuple) {
String word = tuple.getStringByField("word");
Integer count = tuple.getIntegerByField("count");
this.counts.put(word, count);
}
public void cleanup(){
System.out.println("******count result******");
for (Map.Entry entry : counts.entrySet()) {
System.out.println(entry.getKey() + ": " + entry.getValue());
}
}
}
package com.yistory.WordCount;
import com.yistory.WordCount.bolts.ReportBolt;
import com.yistory.WordCount.bolts.SplitSentenceBolt;
import com.yistory.WordCount.bolts.WordCountBolt;
import com.yistory.WordCount.spouts.SentenceSpout;
import backtype.storm.Config;
import backtype.storm.LocalCluster;
import backtype.storm.topology.TopologyBuilder;
import backtype.storm.tuple.Fields;
import backtype.storm.utils.Utils;
public class WordCountTopology {
private static final String CENTENER_SPOUT_ID = "sentence-spout";
private static final String SPLIT_BOLT_ID = "split-bolt";
private static final String COUNT_BOLT_ID = "count-bolt";
private static final String REPORT_BOLT_ID = "report-bolt";
private static final String TOPOLOGY_NAME = "word-count-toplogy";
public static void main(String[] args){
SentenceSpout spout = new SentenceSpout();
SplitSentenceBolt splitBolt = new SplitSentenceBolt();
WordCountBolt countBolt = new WordCountBolt();
ReportBolt reportBolt = new ReportBolt();
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(CENTENER_SPOUT_ID, spout);
// SentenceSpout ---> SplitSentenceBolt
builder.setBolt(SPLIT_BOLT_ID, splitBolt)
.shuffleGrouping(CENTENER_SPOUT_ID);
// SplitSentenceBolt ---> WordCountBolt
builder.setBolt(COUNT_BOLT_ID, countBolt)
.fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));
// WordCountBolt ---> ReportBolt
builder.setBolt(REPORT_BOLT_ID, reportBolt)
.globalGrouping(COUNT_BOLT_ID);
Config config = new Config();
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(TOPOLOGY_NAME, config,
builder.createTopology());
// 休眠10秒
Utils.sleep(10000);
cluster.killTopology(TOPOLOGY_NAME);
cluster.shutdown();
}
}
本地模式运行Storm拓扑很简单,在项目的入口(包含main函数的类)中选择Run as > Java Application即可。
上面的代码组织起来以后,整个项目的结构如下。
运行项目后的结果如下
至此,一个Storm的入门项目已经完成了,以后基于Storm的开发基本是这个思路,本文前后逻辑处理可能不太恰当,
读者可以通览一遍,再细看各个部分。