流信息处理:从数据流到复杂事件处理
——读《Processing Flows of Information: From Data Stream to Complex Event Processing》笔记
偶然搜到这篇文章,其对目前data stream management system 以及complex event processing 系统有一个比较全面的介绍与调研,并对比了其中各个典型产品之间的特点。
An increasing number of distributed applications requires processing of continuously flowing data from geographically distributed sources with unpredictable rate to obtain timely responses to complex queries. After several years of research and development we can say that two models emerged and are today competing: the data stream processing model [Babcock et al. 2002] and the complex event processing model [Luckham 2001].
DSMSs are specialized in dealing with transient data that is continuously updated. On the other side, the complex event processing model sees flowing information items as notifications of events happened in the external world, which have to be filtered and combined to deduce what is happening in terms of higher-level events.
With the term Information Flow Processing (IFP) we refer to an application domain in which users need to collect information produced by multiple, distributed sources, to process it in a timely way, in order to extract new knowledge as soon as the relevant information is collected.
As we mentioned, IFP has attracted the attention of researchers coming from different fields. The first contributions came from the database community in the form of active database systems, which were introduced to allow actions to automatically execute when given conditions arise. Data Stream Management Systems (DSMSs) pushed this idea further, to perform query processing in the presence of continuous data streams. In the same years that saw the development of DSMSs, researchers with different backgrounds identified the need of developing systems capable of processing not generic data but event notifications, coming from different sources, to identify interesting situations [Luckham 2001]. These systems are usually known as Complex Event Processing (CEP) Systems.
Active Database Systems. 传统的DBMS是Human-Active, Database-Passive (HADP)的,而Active Database System 克服了这点限制。
Data Stream Management Systems. 上面提及的Active Database Systems还是限制于静态的数据存储,而DSMS突破了这个限制。users do not have to explicitly ask for updated information, rather the system actively notifies them according to installed queries. 这种形式的交互也称为:Database Active, Human-Passive (DAHP).
Complex Event Processing Systems. 上面提及的DSMS把那些需要处理的数据的语义留给客户端程序去解释。而CEP却是,they are notifications of events happened in the external world and observed by sources. The CEP engine is responsible for filtering and combining such notifications to deduce what is happening in terms of higher-level events (sometime also called composite events or situations) to be notified to sinks, which act as event consumers.
DSMSs and CEP engines.前者主要focus在flowing data and data transformations. 而CEP engines, either those developed as extensions of publish-subscribe middleware or those developed as totally new systems,他们focus在processing event notifications with their ordering relationships to capture complex event patterns; and on the communication aspects involved in event processing.
所以IFP需要考虑结合DSMS以及CEP的优点,既考虑effective data processing, 同时也including the ability to capture complex ordering relationships among data, as well as effcient event delivery, including the ability to process data in a strongly distributed fashion.
IFP的功能模型:
In summary, an IFP engine operates as follows: each time a new item (including those periodically produced by the Clock) enters the engine through the Receiver, a detection-production cycle is performed. Such a cycle first (detection phase) evaluates all the rules currently present in the Rules store to find those whose condition part is true. At the end of this phase we have a set of rules that have to be executed, The Producer takes this information and executes each triggered rule.
处理模型:Selection policy. Consumption policy: zero consumption policy, selected consumption policy.
Deployment Model:centralized vs. distributed:clustered vs. networked.
clustered and networked engines focus on different aspects: the former on increasing the available processing power by sharing the workload among a set of well connected machines, the latter on minimizing bandwidth usage by processing information as close as possible to the sources.
交互模型: push/pull.
Data Model: tuples, records. homogeneous information flows vs. heterogeneous flows.
Time Model: stream-only, absolute, causal, interval.
Rule Model: transforming rules and detecting rules.
Language Type: Transforming languages and Detecting, or pattern-based languages.