Kafka could be used as a basic log aggregation and is general purpose
producer-consumer messaging system. And has high throughput.
I feel Flume focus more on aggregation whereas Kafka on super-fast
messaging. Even though they can be sources of storm, if you archive
log aggregation and log data processing together Flume might be
better, or if you want a extreme low-latency data processing from
logs, Kafka would be winner.
Min, so it sounds Kafka is more suitable for real-time solution
providing by Storm. Aggregation calculations can be done already
inside Storm Bolt with some real ESP engine, for example with Drools
Fusion or Esper. I am building solution with Drools Fusion
Sam, I'm starting from AMQP solution for now.
Thank you for reply. I thought the problems was in parameters of
exchange but it really appears it was in parameters of queue. Solved.
搜集spout数据源的几种策略:
Another option would be using the Flume.
Some tools:
* Flume (we use this in our performance monitoring and analytics
services)
* Scribe
* Kafka
* Fluentd
* ...
storm作者的建议:
How you get log data into Storm depends on what you want your message processing guarantees to be. If you don't care about dropping messages, then you could do logs -> Scribe -> ScribeSpout. Scribe would have to be able to discover where the ScribeSpout tasks are (since they could be started on any machine), which could be accomplished using Zookeeper. I believe Scribe has this discovery functionality builtin already.
其他人:
I have a very similar requirement where I have to pipe data from many log files to a bunch of aggregators so that they can do the aggregation. I was thinking to setting up the log file readers as spouts and aggregators as bolts.
So now I have to run the log file reader's main and in that emit the log file lines.
A few questions
1. Do you recommend doing this?
2. How can I have a main from which I cam emit tuples. I cant package this class as a regular spout because it HAS to run on the right host and storm does not give you that control as far as I know.
In 0.8.0 you can write a custom class to dispatch your log tailing spout to the right boxes.
Hi =) My options/answers inline (mind you I'm learning Storm right now), but hope I can help some.
I looked at this post and my question is related to it https://groups.google.com/forum/#!msg/storm-user/Zvy6RrT8RHo/OVB03TlT2HgJ
I am looking to tail log files from many servers to storm and them carry out staged processing on the messages from these log files. So what I would like is to have a spout on each of the servres on which the log files exist that will emit the log lines and then have the blots to consume them running on different machines within the same data center.
The questions I have are:
1. Is this not recommended? From the post mentioned above it seems to me that it is not. If so why?
2. If we are to do this I need to be able to write a spout which has a main() so that I can read the log and emit the tuples. I cant package these in the topology since I need to control exactly where these should run (where the log files are).