In-Stream Big Data Processing

http://highlyscalable.wordpress.com/2013/08/20/in-stream-big-data-processing/

Overview

In recent years, this idea got a lot of traction and a whole bunch of solutions like Twitter’s Storm, Yahoo’s S4, Cloudera’s Impala, Apache Spark, and Apache Tez appeared and joined the army of Big Data and NoSQL systems. This article is an effort to explore techniques used by developers of in-stream data processing systems, trace the connections of these techniques to massive batch processing and OLTP/OLAP databases, and discuss how one unified query engine can support in-stream, batch, and OLAP processing at the same time.

本文的目的: 阐述在各个流处理系统中使用到的技术, 分析这些技术和传统的batch或OLTP/OLAP技术的关联, 最终讨论开发一个可以涵盖in-stream, batch, and OLAP的unified query engine的可能性

At Grid Dynamics, we recently faced a necessity to build an in-stream data processing system that aimed to crunch about 8 billion events daily providing fault-tolerance and strict transactioanlity i.e. none of these events can be lost or duplicated.

我们需要这样的in-stream data processing system,

速度快, 8 billion events daily
易容错, fault-tolerance
数据的完整性, 不丢也不多

A high-level overview of the environment we worked with is shown in the figure below:

One can see that this environment is a typical Big Data installation: there is a set of applications that produce the raw data in multiple datacenters, the data is shipped by means of Data Collection subsystem to HDFS located in the central facility, then the raw data is aggregated and analyzed using the standard Hadoop stack (MapReduce, Pig, Hive) and the aggregated results are stored in HDFS and NoSQL, imported to the OLAP database and accessed by custom user applications. Our goal was to equip all facilities with a new in-stream engine (shown in the bottom of the figure) that processes most intensive data flows and ships the pre-aggregated data to the central facility, thus decreasing the amount of raw data and heavy batch jobs in Hadoop.

首先这是典型的Lamda架构, 这应该是作者对他说的unified query engine的整体的构想
Lamda架构的关键就是在原来的batch流程上, 加上in-stream engine部分降低系统的latency

Big Data Lambda Architecture

Nathan Marz, 提出的一种结合实时和batch的大数据架构, 在他的多篇文章中都表达了类似的思想
在可以牺牲几个小时的latency的条件下, Batch layer+Serving layer已经是完美的方案, 简单, 易容错, 易扩展...
当然很多场景是无法容忍这几个小时的latency的, 这样就需要加上Speed layer, 使用incremental update的方式实时的统计recent data.
可以想象这层的实现是相对复杂的, 也是容易出错的, 因为往往实时方案都是基于内存的, 但这个架构好的是, 如果speed layer发生任何问题, 可以简单的将其丢弃, 因为Batch layer最终在几个小时后产生正确的数据, 你所损失的仅仅是暂时的数据丢失和滞后

对in-stream engine提出的具体需求,
The design of the in-stream processing engine itself was driven by the following requirements:

SQL-like functionality. The engine has to evaluate SQL-like queries continuously, including joins over time windows and different aggregation functions that implement quite complex custom business logic. The engine can also involve relatively static data (admixtures) loaded from the stores of Aggregated Data. Complex multi-pass data mining algorithms are beyond the immediate goals.
Modularity and flexibility. It is not to say that one can simply issue a SQL-like query and the corresponding pipeline will be created and deployed automatically, but it should be relatively easy to assemble quite complex data processing chains by linking one block to another.
Fault-tolerance. Strict fault-tolerance is a principal requirement for the engine. As it sketched in the bottom part of the figure, one possible design of the engine is to use distributed data processing pipelines that implement operations like joins and aggregations or chains of such operations, and connect these pipelines by means of fault-tolerant persistent buffers. These buffers also improve modularity of the system by enabling publish/subscribe communication style and easy addition/removal of the pipelines. The pipelines can be stateful and the engine’s middleware should provide a persistent storage to enable state checkpointing. All these topics will be discussed in the later sections of the article.
Interoperability with Hadoop. The engine should be able to ingest both streaming data and data from Hadoop i.e. serve as a custom query engine atop of HDFS.
High performance and mobility. The system should deliver performance of tens of thousands messages per second even on clusters of minimal size. The engine should be compact and efficient, so one can deploy it in multiple datacenters on small clusters.

Basics of Distributed Query Processing

It is clear that distributed in-stream data processing has something to do with query processing in distributed relational databases. Many standard query processing techniques can be employed by in-stream processing engine, so it is extremely useful to understand classical algorithms of distributed query processing and see how it all relates to in-stream processing and other popular paradigms like MapReduce. Distributed query processing is a very large area of knowledge that was under development for decades, so we start with a brief overview of the main techniques just to provide a context for further discussion.
由于in-stream data processing和query processing有很密切的关系
理解classical algorithms of distributed query processing, 以及和in-stream processing与other popular paradigms like MapReduce之间的关系
所以这里讨论的是一些在各个系统中都需要的共有技术

Partitioning and Shuffling

Distributed and parallel query processing heavily relies on data partitioning to break down a large data set into multiple pieces that can be processed by independent processors. Query processing could consist of multiple steps and each step could require its own partitioning strategy, so data shuffling is an operation frequently performed by distributed databases.

分布式query, 需要有两个过程,
首先是把数据break down到各个节点, 称为data partitioning
但在query的时候, 经常有需要把部分数据合并起来, 称为data shuffling

Distributed Joins

下面通过distributed joins这个典型的例子来看看partitioning和shuffling技术

There are two main data partitioning techniques that can be employed by distributed join processing:

- Disjoint data partitioning

按照join key对两个数据集上的数据进行重新shuffling, 这样两个数据集都需要迁移
Disjoint data partitioning technique shuffles the data into several partitions in such a way that join keys in different partitions do not overlap. Each processor performs the join operation on each of these partitions and the final result is obtained as a simple concatenation of the results obtained from different processors. Consider an example where relation R is joined with relation S on a numerical key k and a simple modulo-based hash function is used to produce the partitions (it is assumes that the data initially distributed among the processors based on some other policy):

- Divide and broadcast join

将比较小数据集广播, 在每个processor上面保留整个数据集, 前提是这个数据集确实足够小
在流处理上的应用就是, joining a static data (admixture) to a data stream
The divide and broadcast join algorithm is illustrated in the figure below. This method divides the first data set into multiple disjoint partitions (R1, R2, and R3 in the figure) and replicates the second data set to all processors. In a distributed database, division typically is not a part of the query processing itself because data sets are initially distributed among multiple nodes.

This strategy is applicable for joining of a large relation with a small relation or two small relations. In-stream data processing systems can employ this technique for stream enrichment i.e. joining a static data (admixture) to a data stream.

GroupBy

类似map/reduce的过程
Processing of GroupBy queries also relies on shuffling and fundamentally similar to the MapReduce paradigm in its pure form.
Consider an example where the data is grouped by a string key and sum of the numerical values is computed in each group:

In this example, computation consists of two steps: local aggregation and global aggregation. These steps basically correspond to Map and Reduce operations. Local aggregation is optional and raw records can be emitted, shuffled, and aggregated on a global aggregation phase.

Pipelining

In the previous section, we noted that many distributed query processing algorithms resemble message passing networks.
However, it is not enough to organize efficient in-stream processing: all operators in a query should be chained in such a way that the data flows smoothly through the entire pipeline i.e. neither operation should block processing by waiting for a large piece of input data without producing any output or by writing intermediate results on disk. Some operations like sorting are inherently incompatible with this concept (obviously, a sorting block cannot produce any output until the entire input is ingested), but in many cases pipelining algorithms are applicable.
A typical example of pipelining is shown below:

讨论一些纯粹基于Pipelining, 流方式, 的处理算法,

a. Stream和Static Data进行join, Simple Hash Join
如下面的例子stream R1和static表S1, S2, S3进行join
比较简单, 把所有的static表都载入到hash table里面, 然后让stream R1以pipeline的方式流过

In this example, the hash join algorithm is employed to join four relations: R1, S1, S2, and S3 using 3 processors.
The idea is to build hash tables for S1, S2 and S3 in parallel and then stream R1 tuples one by one though the pipeline that joins them with S1, S2 and S3 by looking up matches in the hash tables. In-stream processing naturally employs this technique to join a data stream with the static data (admixtures).

b. Stream间的join, Symmetric Hash Join

其实这是Simple hash join方法的generalization, 通过下面的图很清晰
In relational databases, join operation can take advantage of pipelining by using the symmetric hash join algorithm or some of its advanced variants [1,2]. Symmetric hash join is a generalization of hash join. Whereas a normal hash join requires at least one of its inputs to be completely available to produce first results (the input is needed to build a hash table), symmetric hash join is able to produce first results immediately. In contrast to the normal hash join, it maintains hash tables for both inputs and populates these tables as tuples arrive:

As a tuple comes in, the joiner first looks it up in the hash table of the other stream. If match is found, an output tuple is produced. Then the tuple is inserted in its own hash table.

c. Infinite stream的join, Window-based

对于流, 往往是无限的, 所以流算法往往是window-based
However, it does not make a lot of sense to perform a complete join of infinite streams. In many cases join is performed on a finite time window or other type of buffer e.g. LFU cache that contains most frequent tuples in the stream. Symmetric hash join can be employed if the buffer is large comparing to the stream rate or buffer is flushed frequently according to some application logic or buffer eviction strategy is not predictable. In other cases, simple hash join is often sufficient since the buffer is constantly full and does not block the processing:

In-Stream Processing Patterns

In the previous section, we discussed a number of standard query processing techniques that can be used in massively parallel stream processing.
Thus, on a conceptual level, an efficient query engine in a distributed database can act as a stream processing system and vice versa, a stream processing system can act as a distributed database query engine. Shuffling and pipelining are the key techniques of distributed query processing and message passing networks can naturally implement them. However, things are not so simple. In a contrast to database query engines where reliability is not critical because a read-only query can always be restarted, streaming systems should pay a lot of attention to reliable events processing. In this section, we discuss a number of techniques that are used by streaming systems to provide message delivery guarantees and some other patterns that are not typical for standard query processing.

上一章描述了分布式query的相关技术,
并且作者认为在概念上, 分布式数据库的query engine和stream处理系统是等价的
但是事情没有那么简单, query engine没有关注fault tolerant, 因为read-only query失败, 可以again
但是stream处理系统必须考虑reliable events processing问题, 所有这里着重讨论一些message delivery guarantees技术

所以这里讨论的是在In-Stream Processing中所特有的技术

Stream Replay

Stream replay对于容错, 代码调试或优化都有非常重要的意义, Kafka具备这个功能, 是他不同于其他message queue最大特点
首先, 对于容错, 当发生异常时, 对于流处理系统常常的做法是replay, 但是当前比如Storm或Spark都是在source端replay, 当然Spark借助checkpoint会有所改善
而基于Kafka的Samza系统, 由于每一步中间数据都会放在kafka中, 会提高replay的效率
并且Storm本身不具备replay的能力, 必须要借助类似Kafka这样的系统

Ability to rewind data stream back in time and replay the data is very important for in-stream processing systems because of the following reasons:

This is the only way to guarantee correct data processing. Even if data processing pipeline is fault-tolerant, it is very problematic to guarantee that the deployed processing logic is defect-free. One can always face a necessity to fix and redeploy the system and replay the data on a new version of the pipeline.
Issue investigation could require ad hoc queries. If something goes wrong, one could need to rerun the system on the problematic data with better logging or with code alternations.
Although it is not always the case, the in-stream processing system can be designed in such a way that it re-reads individual messages from the source in case of processing errors and local failures, even if the system in general is fault-tolerant.

As a result, the input data typically goes from the data source to the in-stream pipeline via a persistent buffer that allows clients to move their reading pointers back and forth.

Kafka messaging queue is well known implementation of such a buffer that also supports scalable distributed deployments, fault-tolerance, and provides high performance.

As a bottom line, Stream Replay technique imposes the following requirements of the system design:

The system is able to store the raw input data for a preconfigured period time.
The system is able to revoke a part of the produced results, replay the corresponding input data and produce a new version of the results.
The system should work fast enough to rewind the data back in time, replay them, and then catch up with the constantly arriving data stream.

Lineage Tracking

In a streaming system, events flow through a chain of processors until the result reaches the final destination (like an external database). Each input event produces a directed graph of descendant events (lineage) that ends by the final results. To guarantee reliable data processing, it is necessary to ensure that the entire graph was processed successfully and to restart processing in case of failures.

在流处理中, 一个比较重要的问题是如何保证每个tuple都被成功的处理, 因为一个tuple在处理过程中, 可能会导致很多其他的tuple, 最终是DAG, 所以是个比较复杂的问题,

Storm

这个问题在storm中已经得到比较优雅的解决
具体过程如下, 核心思想就是
两个相同值异或为0, 所以在tuple发生的时候记录tupleid, 在tuple被ack的时候再用tupleid进行异或, 只要tupleid是成对出现的, 最终结果就应该是0
advantage,
逻辑简单, 设计优雅
大大提高了空间效率, 将对一个DAG的track, 转化为对一个字段的track
disadvantage,
大大增加了网络通信量, 使得需要发送的tuple数量double

Let us first consider how Twitter’s Storm tracks the messages to guarantee at-least-once delivery semantics.

All events that emitted by the sources (first nodes in the data processing graph) are marked by a random ID. For each source, the framework maintains a set of pairs [event ID -> signature] for each initial event. The signature is initially initialized by the event ID.
Downstream nodes can generate zero or more events based on the received initial event. Each event carries its own random ID and the ID of the initial event.
If the event is successfully received and processed by the next node in the graph, this node updates the signature of the corresponding initial event by XORing the signature with (a) ID of the incoming event and (b) IDs of all events produced based on the incoming event. In the part 2 of diagram below, event 01111 produces events 01100, 10010, and 00010, so the signature for event 01111 becomes 11100 (= 01111 (initial value) xor 01111 xor 01100 xor 10010 xor 00010).
An event can be produced based on more than one incoming event. In this case, it is attached several initial event and carries more than one initial IDs downstream (yellow-black event in the part 3 of the figure below).
The event considered to be successfully processed as soon as its signature turns into zero i.e. the final node acknowledged that the last event in the graph was processed successfully and no events were emitted downstream. The framework sends a commit message to the source node (see part 3 in the diagram below).
The framework traverses a table of the initial events periodically looking for old uncommitted events (events with non-zero signature). Such events are considered as failed and the framework asks the source nodes to replay them.
It is important to note that the order of signature updates is not important due to commutative nature of the XOR operation. In the figure below, acknowledgements depicted in the part 2 can arrive after acknowledgements depicted in the part 3. This enables fully asynchronous processing.
One can note that the algorithm above is not strictly reliable – the signature could turn into zero accidentally due to unfortunate combination of IDs. However, 64-bit IDs are sufficient to guarantee a very low probability of error, about 2^(-64), that is acceptable in almost all practical applications. As result, the table of signatures could have a small memory footprint.

The described approach is elegant due to its decentrilized nature: nodes act independently sending acknowledgement messages, there is no cental entity that tracks all lineages explicitly.
However, it could be difficult to manage transactional processing in this way for flows that maintain sliding windows or other buffers. For example, processing on a sliding window can involve hundreds of thousands events at each moment of time, so it becomes difficult to manage acknowledgements because many events stay uncommitted or computational state should be persisted frequently.

Spark

Spark是基于RDD数据结构, 只支持粗粒度的操作, 不支持对单个tuple的操作, 所以在track和replay上都简单的多
An alternative approach is used in Apache Spark [3]. The idea is to consider the final result as a function of the incoming data.
To simplify lineage tracking, the framework processes events in batches, so the result is a sequence of batches where each batch is a function of the input batches.
Resulting batches can be computed in parallel and if some computation fails, the framework simply reruns it. Consider an example:

In this example, the framework joins two streams on a sliding window and then the result passes through one more processing stage.
The framework considers the incoming streams not as streams, but as set of batches. Each batch has an ID and the framework can fetch it by the ID at any moment of time.
So, stream processing can be represented as a bunch of transactions where each transaction takes a group of input batches, transforms them using a processing function, and persists a result.
In the figure above, one of such transactions is highlighted in red. If the transaction fails, the framework simply reruns it. It is important that transactions can be executed in parallel.

This simple but powerful paradigm enables centralized transaction management and inherently provides exactly-once message processing semantics. It is worth nothing that this technique can be used both for batch processing and for stream processing because it treats the input data as a set of batches regardless to their streaming of static nature.

State Checkpointing

In the previous section we have considered the lineage tracking algorithm that uses signatures (checksums) to provide at-least-one message delivery semantics.
This technique improves reliability of the system, but it leaves at least two major open questions:

In many cases, exactly-once processing semantics is required. For example, the pipeline that counts events can produce incorrect results if some messages will be delivered twice.
Nodes in the pipeline can have a computational state that is updated as the messages processed. This state can be lost in case of node failure, so it is necessary to persist or replicate it.

上面Storm的Lineage Tracking技术可以保证at-least-once语义, 但是仍然有两个问题没有解决

a. exactly-once语义, 比如对于count

b. 临时状态的管理, 比如缓存的count值

Storm提供transaction topology来保证exactly-once语义, 参考Storm - Transactional-topologies
transaction topology的核心思想,
将tuple以batch为单位进行track和replay(提高效率, 以tuple为单位overhead太高)
transaction以transaction ID为标识(递增的)
将transaction分为processing阶段和commit阶段, processing阶段是可以各个transaction并行做的, 但是commit阶段必须按照transaction ID依次完成

Storm如何管理state?
采用比较简单的方法, 解决外部存储, 在每次transaction commit的时候, 都将临时state commit到外部存储当中
并且在存储state的同时, 需要存下transaciton ID, 用于防止transaction被执行多次

关于state management的其他技术, 参考Apache Samza - Reliable Stream Processing atop Apache Kafka and Hadoop YARN, State Management部分

Twitter’s Storm addresses these issues by using the following protocol:

Events are grouped into batches and each batch is associated with a transaction ID. A transaction ID is a monotonically growing numerical value (e.g. the first batch has ID 1, the second ID 2, and so on). If the pipeline fails to process a batch, this batch is re-emitted with the same transaction ID.
First, the framework announces to the nodes in the pipeline that a new transaction attempt is started. Second, the framework to sends the batch through the pipeline. Finally, the framework announces that transaction attempt if completed and all nodes can commit their state e.g. update it in the external database.
The framework guarantees that commit phases are globally ordered across all transactions i.e. the transaction 2 can never be committed before the transaction 1. This guarantee enables processing nodes to use following logic of persistent state updates:
- The latest transaction ID is persisted along with the state.
- If the framework requests to commit the current transaction with the ID that differs from the ID value persisted in the database, the state can be updated e.g. a counter in the database can be incremented. Assuming a strong ordering of transactions, such update will happen exactly one for each batch.
- If the current transaction ID equals to the value persisted in the storage, the node skips the commit because this is a batch replay. The node must have processed the batch earlier and updated the state accordingly, but the transaction failed due to an error somewhere else in the pipeline.
- Strong order of commits is important to achieve exactly-once processing semantics. However, strictly sequential processing of transactions is not feasible because first nodes in the pipeline will often be idle waiting until processing on the downstream nodes is completed. This issues can be alleviated by allowing parallel processing of transactions but serialization of commit steps only, as it shown in the figure below:

This technique allows one to achieve exactly-once processing semantics assuming that data sources are fault-tolerant and can be replayed.
However, persistent state updates can cause serious performance degradation even if large batches are used. By this reason, the intermediate computational state should be minimized or avoided whenever possible.

As a footnote, it is worth mentioning that state writing can be implemented in different ways. The most straightforward approach is to dump in-memory state to the persistent store as part of the transaction commit process. This does not work well for large states (sliding windows an so on). An alternative is to write a kind of transaction log i.e. a sequence of operations that transform the old state into the new one (for a sliding window it can be a set of added and evicted events). This approach complicates crash recovery because the state has to be reconstructed from the log, but can provide performance benefits in a variety of cases.

Additive State and Sketches

Additivity of intermediate and final computational results is an important property that drastically simplifies design, implementation, maintenance, and recovery of in-stream data processing systems. Additivity means that the computational result for a larger time range or a larger data partition can be calculated as a combination of results for smaller time ranges or smaller partitions.
For example, a daily number of page views can be calculated as a sum of hourly numbers of page views. Additive state allows one to split processing of a stream into processing of batches that can be computed and re-computed independently and, as we discussed in the previous sections, this helps to simplify lineage tracking and reduce complexity of state maintenance.

什么叫做Additivity? 简单的理解, 就是可以分而治之的数据, 可以计算局部最优, 比如count
比如要算一个月的page views, 可以先独立统计每天的page views, 然后加起来
如果需求变成一个星期的page views, 没有问题, 半年的page views, 仍然ok……
这样的数据非常便于做基于流的时间窗口相关处理

当然不是什么数据都是Additivity?

有些数据通过记录少许辅助数据就可以达到Additivity, 比如平均值, 需要记录下个数
有些数据是无法达到Additivity的, 比如unique visitors count

对于这样的数据, 可以使用Sketches技术来转化为Additivity数据, 比如使用Linear Counting来记录每个hourly的UV, 对于任意时间范围的UV, Linear Counting是可以合并的, 所以只需要将多个hourly的Linear Counting进行取or合并, 然后再估算整个的UV, 就解决了Additivity问题, 参考大数据处理中基于概率的数据结构

It is not always trivial to achieve additivity:

In many cases, additivity is indeed trivial. For example, simple counters are additive.
In some cases, it is possible to achieve additivity by storing a small amount of additional information. For example, consider a system that calculates average purchase value in the internet shop for each hour. Daily average cannot be obtained from 24 hourly average values. However, the system can easily store a number of transactions along with each hourly average and it is enough to calculate the daily average value.
In many cases, it is very difficult or impossible to achieve additivity. For example, consider a system that counts unique visitors on some internet site. If 100 unique users visited the site yesterday and 100 unique user visited the site today, the total number of unique user for two days can be from 100 to 200 depends on how many users visited the site both yesterday and today. One have to maintain lists of user IDs to achieve additivity through intersection/union of the ID lists. Size and processing complexity for these lists can be comparable to the size and processing complexity of the raw data.

Sketches is a very efficient way to transform non-additive values into additive. In the previous example, lists of ID can be replaced by compact additive statistical counters. These counters provide approximations instead of precise result, but it is acceptable for many practical applications. Sketches are very popular in certain areas like internet advertising and can be considered as an independent pattern of in-stream processing. A thorough overview of the sketching techniques can be found in [5].

Logical Time Tracking

It is very common for in-stream computations to depend on time: aggregations and joins are often performed on sliding time windows; processing logic often depends on a time interval between events and so on. Obviously, the in-stream processing system should have a notion of application’s view of time, instead of CPU wall-clock.
However, proper time tracking is not trivial because data streams and particular events can be replayed in case of failures. It is often a good idea to have a notion of global logical time that can be implemented as follows:

All events should be marked with a timestamp generated by the original application.
Each processor in a pipeline tracks the maximal timestamp it has seen in a stream and updates a global persistent clock by this timestamp if the global clock is behind. All other processors synchronize their time with the global clock.
Global clock can be reset in case of data replay.

Aggregation in a Persistent Store

如何使用Cassandra来进行基于streaming的join?
把所有流的数据都以join key为row key存到Cassandra中, Cassandra 会按row key排序
所有另外start一个进程, 定时的读出具有相同的join key的数据来进行join操作

We already have discussed that persistent store can be used for state checkpointing. However, it not the only way to employ an external store for in-stream processing. Let us consider an example that employs Cassandra to join multiple data streams over a time window. Instead of maintaining in-memory event buffers, one can simply save all incoming events from all data streams to Casandra using a join key as row key, as it shown in the figure below:

On the other side, the second process traverses the records periodically, assembles and emits joined events, and evicts the events that fell out of the time window. Cassandra even can facilitate this activity by sorting events according to their timestamps.

It is important to understand that such techniques can defeat the whole purpose of in-stream data processing if implemented incorrectly – writing individual events to the data store can introduce a serious performance bottleneck even for fast stores like Cassandra or Redis. On the other hand, this approach provides perfect persistence of the computational state and different performance optimizations – say, batch writes – can help to achieve acceptable performance in many use cases.

Aggregation on a Sliding Window

描述了在sliding window上的增量算法

In-stream data processing frequently deals with queries like “What is the sum of the values in the stream over last 10 minutes?” i.e. with continuous queries on a sliding time window.
A straightforward approach to processing of such queries is to compute the aggregation function like sum for each instance of the time window independently. It is clear that this approach is not optimal because of the high similarity between two sequential instances of the time window. If the window at the time T contains samples {s(0), s(1), s(2), …, s(T-1), s(T)}, then the window at the time T+1 contains samples {s(1), s(2), s(3), …, s(T), s(T+1)}. This observation suggests that incremental processing might be used.

Incremental computations over sliding windows is a group of techniques that are widely used in digital signal processing, in both software and hardware. A typical example is a computation of the sum function. If the sum over the current time window is known, then the sum over the next time window can be computed by adding a new sample and subtracting the eldest sample in the window:

Similar techniques exist not only for simple aggregations like sums or products, but also for more complex transformations. For example, the SDFT (Sliding Discreet Fourier Transform) algorithm [4] is a computationally efficient alternative to per-window calculation of the FFT (Fast Fourier Transform) algorithm.

Query Processing Pipeline: Storm, Cassandra, Kafka

Now let us return to the practical problem that was stated in the beginning of this article.
We have designed and implemented our in-stream data processing system on top of Storm, Kafka, and Cassandra adopting the techniques described earlier in this article. Here we provide just a very brief overview of the solution – a detailed description of all implementation pitfalls and tricks is too large and probably requires a separate article.
还记得文章刚开始的抽象架构图吗, 这里通过那么多分析, 使用Storm, Kafka, and Cassandra来给出实际的In-Stream Processing的架构图

Kafka 0.8 as a partitioned fault-tolerant event buffer to enable stream replay and improve system extensibility by easy addition of new event producers and consumers. Kafka’s ability to rewind read pointers also enables random access to the incoming batches and, consequently, Spark-style lineage tracking. It is also possible to point the system input to HDFS to process the historical data.

Cassandra is employed for state checkpointing and in-store aggregation, as described earlier. In many use cases, it also stores the final results.

Twitter’s Storm is a backbone of the system. All active query processing is performed in Storm’s topologies that interact with Kafka and Cassandra. Some data flows are simple and straightforward: the data arrives to Kafka; Storm reads and processes it and persist the results to Cassandra or other destination. Other flows are more sophisticated: one Storm topology can pass the data to another topology via Kafka or Cassandra. Two examples of such flows are shown in the figure above (red and blue curved arrows).

Towards Unified Big Data Processing

总结,

首先现在有很多的系统用于batch processing, real-time query processing和in-stream processing(注意, real-time和streaming不是一个概念)
Lambda Architecture可以converge所有这些系统, 来构建一个全栈的分析系统...

其次, 分析这3种不同的分析系统中所共有的一些技术, 比如shuffling and pipelining
而差异点是, In-stream processing更强调严格的数据delivery guarantees和中间结果的管理, 并且In-stream processing更依赖于pipeline技术

It is great that the existing technologies like Hive, Storm, and Impala enable us to crunch Big Data using both batch processing for complex analytics and machine learning, and real-time query processing for online analytics, and in-stream processing for continuous querying.
Moreover, techniques like Lambda Architecture [6, 7] were developed and adopted to combine these solutions efficiently. This brings us to the question of how all these technologies and approaches could converge to a solid solution in the future.

In this section, we discuss the striking similarity between distributed relational query processing, batch processing, and in-stream query processing to figure out the technologies that could cover all these use cases and, consequently, have the highest potential in this area.

The key observation is that relational query processing, MapReduce, and in-stream processing could be implemented using exactly the same concepts and techniques like shuffling and pipelining.

At the same time:

In-stream processing could require strict data delivery guarantees and persistence of the intermediate state. These properties are not crucial for batch processing where computations can be easily restarted.
In-stream processing is inseparable from pipelining. For batch processing, pipelining is not so crucial and even inapplicable in certain cases. Systems like Apache Hive are based on staged MapReduce with materialization of the intermediate state and do not take full advantage of pipelining.

The two statement above imply that tunable persistence (in-memory message passing versus on-disk materialization) and reliability are the distinctive features of the imaginary query engine that provides a set of processing primitives and interfaces to the high-level frameworks:

Among the emerging technologies, the following two are especially notable in the context of this discussion:

Apache Tez [8], a part of the Stinger Initiative [9]. Apache Tez is designed to succeed the MapReduce framework introducing a set of fine-grained query processing primitives. The goal is to enable frameworks like Apache Pig and Apache Hive to decompose their queries and scripts into efficient query processing pipelines instead of sequences of MapReduce jobs that are generally slow due to materialization of intermediate results.
Apache Spark [10]. This project is probably the most advanced and promising technology for unified Big Data processing that already includes a batch processing framework, SQL query engine, and a stream processing framework.

References

A. Wilschut and P. Apers, “Dataflow Query Execution in a Parallel Main-Memory Environment “
T. Urhan and M. Franklin, “XJoin: A Reactively-Scheduled Pipelined Join Operator“
M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica, “Discretized Streams: An Efﬁcient and Fault-Tolerant Model for Stream Processing on Large Clusters”
E. Jacobsen and R. Lyons, “The Sliding DFT“
A. Elmagarmid, Data Streams Models and Algorithms
N. Marz, “Big Data Lambda Architecture”
J. Kinley, “The Lambda architecture: principles for architecting realtime Big Data systems”
http://hortonworks.com/hadoop/tez/
http://hortonworks.com/stinger/
http://spark-project.org/

你可能感兴趣的:(process)

WebMagic：强大的Java爬虫框架解析与实战 Aaron_945 Java java 爬虫开发语言
文章目录引言官网链接WebMagic原理概述基础使用1.添加依赖2.编写PageProcessor高级使用1.自定义Pipeline2.分布式抓取优点结论引言在大数据时代，网络爬虫作为数据收集的重要工具，扮演着不可或缺的角色。Java作为一门广泛使用的编程语言，在爬虫开发领域也有其独特的优势。WebMagic是一个开源的Java爬虫框架，它提供了简单灵活的API，支持多线程、分布式抓取，以及丰富的
python结束子进程_如何清除python中的子进程 weixin_39995943 python结束子进程
我们使用python进程来管理长时间运行的python子进程。有时需要终止子进程。kill命令不会完全终止进程，只会使其失效。运行以下脚本将演示此行为。importsubprocessp=subprocess.Popen(['sleep','400'],stdout=subprocess.PIPE,shell=False)或者p=subprocess.Popen('sleep400',stdout
python获取子进程返回值_Python对进程Multiprocessing子进程返回值 weixin_39752157 python获取子进程返回值
在实际使用多进程的时候，可能需要获取到子进程运行的返回值。如果只是用来存储，则可以将返回值保存到一个数据结构中；如果需要判断此返回值，从而决定是否继续执行所有子进程，则会相对比较复杂。另外在Multiprocessing中，可以利用Process与Pool创建子进程，这两种用法在获取子进程返回值上的写法上也不相同。这篇中，我们直接上代码，分析多进程中获取子进程返回值的不同用法，以及优缺点。初级用法
spring security中几大组件的作用和执行顺序阿信在这里 java spring
springsecurity中几大组件的作用和执行顺序在SpringSecurity中，AuthenticationProvider、GroupPermissionEvaluator、PermissionEvaluator、AbstractAuthenticationProcessingFilter、DefaultMethodSecurityExpressionHandler和ManageSecu
Scanpy源码浅析之pp.normalize_total 何物昂
版本导入Scanpy,其版本为'1.9.1'，如果你看到的源码和下文有差异，其可能是由于版本差异。importscanpyasscsc.__version__#'1.9.1'例子函数pp.normalize_total用于Normalizecountspercell，其源代码在scanpy/preprocessing/_normalization.py我们通过一个简单例子来了解该函数主要功能:将一
golang学习笔记--MPG模型 xxzed golang #学习笔记学习笔记 golang
MPG模式：M（Machine）：操作系统的主线程P（Processor）：协程执行需要的资源（上下文context），可以看作一个局部的调度器，使go代码在一个线程上跑，他是实现从N：1到N：M映射的关键G（Goroutine）：协程，有自己的栈。包含指令指针（instructionpointer）和其它信息（正在等待的channel等等），用于调度。一个P下面可以有多个G1、当前程序有三个M,
基于Python执行lua脚本 xu-jssy Python自动化脚本 python lua 自动化 rpa
一、依赖安装pipinstalllupa二、源码将lua文件存放在base_path路径，将lua文件名称（不包含后缀名）传递给lua_runner函数即可importmultiprocessingimportlupa#lua文件存放位置base_path='D:\\test\\lua'classLuaFuncion:#创建Lua运行时环境lua=lupa.LuaRuntime(unpack_re
影刀RPA与WPS文档协同办公：实现高效自动化处理的策略与实践 enter回车键影刀RPA
摘要随着数字化转型的深入，企业对于办公自动化的需求日益增长。影刀RPA（RoboticProcessAutomation）与WPS文档的协同办公提供了一种高效、自动化的解决方案。本文旨在探讨影刀RPA与WPS文档如何配合使用，以实现工作流程的自动化，提高办公效率，并为企业带来实际效益。引言影刀RPA作为一种自动化工具，能够模拟人类用户的行为，执行重复性高、规则性强的工作任务。而WPS文档作为办公软
python+adb 0o一人情 adb命令 Python项目 python 开发语言
#!/usr/bin/pythonenv#-*-coding:utf-8-*-importosimportsysimportsubprocessfromtimeimportsleepimportlogginglogging.basicConfig(level=logging.DEBUG)classScreenCapture():defget_screen_size(self):"""获取手机分辨率
MySQL数据库全面学习之（上篇）一心只为学数据库 mysql 学习
Windows服务--启动MySQLnetstartmysql--创建Windows服务sccreatemysqlbinPath=mysqld_bin_path(注意：等号与值之间有空格)连接与断开服务器mysql-h地址-P端口-u用户名-p密码SHOWPROCESSLIST--显示哪些线程正在运行SHOWVARIABLES--显示系统变量信息数据库操作--查看当前数据库SELECTDATABA
什么是 PHP? 为什么用 PHP? 谁在用 PHP? m0_37438181 永远学习 php 开发语言
一、什么是PHP？PHP（HypertextPreprocessor，超文本预处理器）是一种广泛应用于Web开发的通用开源脚本语言。PHP主要用于服务器端编程，可以嵌入HTML中，与数据库进行交互，生成动态网页内容。它具有以下特点：简单易学：语法相对简单，容易上手，对于初学者来说是一个不错的选择。跨平台性：可以在多种操作系统上运行，如Windows、Linux、Unix等。丰富的函数库：提供了大量
Python 课程8-多线程编程和多进程编程可愛小吉 Python教學 python 开发语言 threading multiprocessing
前言在现代编程中，处理并发任务是提高程序性能的关键之一。Python提供了多线程（threading）和多进程（multiprocessing）两种方式来实现并发编程。多线程适用于I/O密集型任务，而多进程则更适合CPU密集型任务。通过这两种技术，你可以高效地处理大规模数据、加速程序执行并优化资源利用。在本篇详细教程中，我们将讨论如何使用Python的threading模块实现多线程，以及如何使用
通过进程Id终止进程好学松鼠 C++进程 C++Windows编程
#include#include//通过进程ID终止进程BOOLTerminateProcessFromID(DWORDdwID){BOOLbRet=FALSE;//打开进程HANDLEhProcess=::OpenProcess(PROCESS_ALL_ACCESS,FALSE,dwID);if(hProcess!=NULL){//终止进程bRet=::TerminateProcess(hPro
Webpack插件核心原理 gogo2027 webpack
引言围绕Webpack打包流程中最核心的机制就是所谓的Plugin机制。所谓插件即是webpack生态中最关键的部分，它为社区用户提供了一种强有力的方式来直接触及webpack的编译过程(compilationprocess)。今天，我们来聊聊Webpack中必不可少的核心Plugin机制～Plugin本质上在Webpack编译阶段会为各个编译对象初始化不同的Hook，开发者可以在自己编写的Plu
思维导图-ProcessOn 佛系猿
今天介绍一款特别好用的流程图、思维导图软件ProcessOn用途：在线画流程图、思维导图、UI原型图、UML、网络拓扑图、组织结构图等各种模板供你选择image支持团队协作支持不同格式下载image更多查看官网最后附上做的效果图：image
查看 CPU架构类型 BYAPESS windows
打开cmd窗口—>输入echo%PROCESSOR_ARCHITECTURE%接口显示，本人的是AMD64
appium中遇到WebDriverException: Message: An unknown server-side error occurred while processing the ... Kingtester
selenium.common.exceptions.WebDriverException:Message:Anunknownserver-sideerroroccurredwhileprocessingthecommand.Originalerror:Anewsessioncouldnotbecreated.Details:sessionnotcreated:pleaseclose'com.te
一天认识一个硬件之CPU 哲伦贼稳妥一天认识一个硬件 IT技术电脑硬件电脑运维硬件工程其他
CPU，全称为中央处理器（CentralProcessingUnit），是计算机硬件系统的核心部件之一，负责执行计算机程序中的指令和处理数据。它相当于计算机的大脑，今天就来给大家分享一下台式机和笔记本大脑的对比。性能差异核心数量和频率：台式机CPU通常支持更多的核心数量和更高的运行频率，这使得它们在处理多线程任务和多任务处理方面更具优势。性能释放：笔记本CPU受限于散热和供电条件，功耗通常较低，导
vue中给打包的文件指定自定义文件名以及加上哈希值解决每次打包上线存在缓存问题 miao_zz vue vue
vue中给打包的文件指定自定义文件名以及加上哈希值解决每次打包上线存在缓存问题vue.config.jsvue.config.jsconstport=process.env.port||8081//端口constTimestamp=newDate().getTime();constMiniCssExtractPlugin=require("mini-css-extract-plugin")modu
python io密集型应用案例-Python中单线程、多线程和多进程的效率对比实验实例 weixin_39635648
python的多进程性能要明显优于多线程，因为cpython的GIL对性能做了约束。Python是运行在解释器中的语言，查找资料知道，python中有一个全局锁（GIL），在使用多进程(Thread)的情况下，不能发挥多核的优势。而使用多进程(Multiprocess)，则可以发挥多核的优势真正地提高效率。对比实验资料显示，如果多线程的进程是CPU密集型的，那多线程并不能有多少效率上的提升，相反还
递归处理文件夹内所有音频的范例 shawncheer 语音算法
1、Python脚本功能：另有介绍可以参考：https://rollingstarky.github.io/2018/12/18/processing-audio-with-sox/该python脚本功能为递归处理文件夹下所有文件的，并递归输出到另一个文件夹，这里是格式转换，用sox把格式同样转换为单通道，8k16bit数据。#!/usr/bin/pythonimportosimportsysim
异步任务处理：FastAPI结合Celery的实战典范赖蓉旖Marlon
异步任务处理：FastAPI结合Celery的实战典范fastapi-celeryExampleofhowtohandlebackgroundprocesseswithFastAPI,Celery,andDocker项目地址:https://gitcode.com/gh_mirrors/fas/fastapi-celery在现代Web开发中，异步处理和后台任务调度成为了提高应用性能与响应速度的关键
为什么要学习使用C++常用软件分析工具？学会这些工具都有哪些好处？ dvlinker C/C++软件开发从入门到实战 C/C++实战专栏 c++常用分析工具 WIndbg IDA Depends ProcessExplorer Process Monitor
目录1、为什么要学习使用C++软件常用分析工具？2、C++软件常用分析工具有哪些？都能处理哪些具体的问题？2.1、窗口信息查看工具SPY++2.2、模块依赖关系查看工具DependencyWalker2.3、GDI对象查看器GDIView2.4、进程信息查看工具ProcessExplorer2.5、进程活动监测工具ProcessMonitor2.6、函数调用监测工具APIMonitor2.7、调试
Psutil：Python 系统和进程监控利器 ivwdcwso 运维开发 python 开发语言 Psutil 运维自动化系统管理
引言在现代IT运维和系统管理中，实时监控系统资源和进程状态是一项至关重要的任务。Python的psutil（PythonSystemandProcessUtilities）库为我们提供了一个跨平台的工具，使得获取系统信息和管理进程变得简单而高效。本文将详细介绍psutil的主要功能，并通过实际案例展示其在日常运维中的应用。什么是Psutil？Psutil是一个跨平台的库，用于获取运行进程和系统利用
IntelliJ IDEA下的使用 Lombok Artifacts
在idea安装lombok插件image在步骤4，应该是个install，我的这个截图是已经安装完成的。步骤5，如果在线安装不成，可以试试离线安装。开启EnableannotationprocessingimagePOM增加依赖org.projectlomboklombok1.16.18provided最后一步增加@Data标签，可以直接看到生成的getset等结构了image
pdf转换jpg（Python版本3.10）大头安 python python pdf 数学建模
importosimportrefromPILimportImagefrompdf2imageimportconvert_from_path,exceptionsfromconcurrent.futuresimportProcessPoolExecutorimporttempfile#解除Pillow的像素限制Image.MAX_IMAGE_PIXELS=Nonechunk_size=10#每个块
Spring如何进行动态注册Bean 小园子的小菜 java java 开发语言
在Spring框架中，Bean是应用程序的核心组成部分，而BeanDefinition则是这些Bean的元数据表示。随着应用程序的复杂性增加，我们可能需要更灵活地定义和注册Bean。Spring框架提供了几个扩展点，允许我们以编程方式影响Bean的创建和定义过程。本文将深入探讨BeanDefinitionRegistryPostProcessor、ImportBeanDefinitionRegis
sqlserver常用的sql命令一心只为学 sqlserver sql 数据库
查看当前用户查看当前用户selectsystem_user检查SQLAgent是否开启IFEXISTS(SELECTTOP11FROMsys.sysprocessesWHEREprogram_name='SQLAgent-GenericRefresher')SELECT'Running'ELSESELECT'NotRunning'查看是否做了镜像selecta.database_id,a.name
Lt-8 Multithreading yanlingyun0210 java
IntendedLearningOutcomesTounderstandtheconceptofconcurrency.Tounderstandthedifferenceofaprocessandathread.TodefineathreadusingtheThreadclassandRunnableinterface.TocontrolthreadswithvariousThreadmethod
【Azure Redis 缓存】Redis的指标显示CPU为70%，而Service Load却达到了100%。这两个指标意义的解释及如何缓解呢？云中路灯
问题描述为什么Redis的指标显示CPU为70%，而ServiceLoad却达到了100%，如何来解释这两个指标，以及如何来缓解这样的情况呢？问题回答CPU指标：该值表示的是用于Redis的Azure缓存服务器的CPU使用率（以百分比表示）。此值映射到操作系统\Processor(_Total)%ProcessorTime性能计数器。ServerLoad指标：该指标表示Redis服务器忙于处理消息
基本数据类型和引用类型的初始值 3213213333332132 java基础
package com.array; /** * @Description 测试初始值 * @author FuJianyong * 2015-1-22上午10:31:53 */ public class ArrayTest { ArrayTest at; String str; byte bt; short s; int i; long
摘抄笔记--《编写高质量代码：改善Java程序的151个建议》白糖_ 高质量代码
记得3年前刚到公司，同桌同事见我无事可做就借我看《编写高质量代码：改善Java程序的151个建议》这本书，当时看了几页没上心就没研究了。到上个月在公司偶然看到，于是乎又找来看看，我的天，真是非常多的干货，对于我这种静不下心的人真是帮助莫大呀。看完整本书，也记了不少笔记
【备忘】Django 常用命令及最佳实践 dongwei_6688 django
注意：本文基于 Django 1.8.2 版本生成数据库迁移脚本（python 脚本） python manage.py makemigrations polls 说明：polls 是你的应用名字，运行该命令时需要根据你的应用名字进行调整查看该次迁移需要执行的 SQL 语句（只查看语句，并不应用到数据库上）： python manage.p
阶乘算法之一N! 末尾有多少个零周凡杨 java 算法阶乘面试效率
&n
spring注入servlet g21121 Spring注入
传统的配置方法是无法将bean或属性直接注入到servlet中的，配置代理servlet亦比较麻烦，这里其实有比较简单的方法，其实就是在servlet的init()方法中加入要注入的内容： ServletContext application = getServletContext(); WebApplicationContext wac = WebApplicationContextUtil
Jenkins 命令行操作说明文档 510888780 centos
假设Jenkins的URL为http://22.11.140.38:9080/jenkins/ 基本的格式为 java 基本的格式为 java -jar jenkins-cli.jar [-s JENKINS_URL] command [options][args] 下面具体介绍各个命令的作用及基本使用方法 1. &nb
UnicodeBlock检测中文用法布衣凌宇 UnicodeBlock
/** * 判断输入的是汉字 */ public static boolean isChinese(char c) { Character.UnicodeBlock ub = Character.UnicodeBlock.of(c);
java下实现调用oracle的存储过程和函数 aijuans java orale
1.创建表：STOCK_PRICES 2.插入测试数据： 3.建立一个返回游标： PKG_PUB_UTILS 4.创建和存储过程：P_GET_PRICE 5.创建函数： 6.JAVA调用存储过程返回结果集 JDBCoracle10G_INVO
Velocity Toolbox antlove 模板 tool box velocity
velocity.VelocityUtil package velocity; import org.apache.velocity.Template; import org.apache.velocity.app.Velocity; import org.apache.velocity.app.VelocityEngine; import org.apache.velocity.c
JAVA正则表达式匹配基础百合不是茶 java 正则表达式的匹配
正则表达式;提高程序的性能,简化代码,提高代码的可读性,简化对字符串的操作正则表达式的用途; 字符串的匹配字符串的分割字符串的查找字符串的替换正则表达式的验证语法 [a] //[]表示这个字符只出现一次 ,[a] 表示a只出现一
是否使用EL表达式的配置 bijian1013 jsp web.xml EL EasyTemplate
今天在开发过程中发现一个细节问题，由于前端采用EasyTemplate模板方法实现数据展示，但老是不能正常显示出来。后来发现竟是EL将我的EasyTemplate的${...}解释执行了，导致我的模板不能正常展示后台数据。网
精通Oracle10编程SQL(1-3)PLSQL基础 bijian1013 oracle 数据库 plsql
--只包含执行部分的PL/SQL块 --set serveroutput off begin dbms_output.put_line('Hello,everyone!'); end; select * from emp; --包含定义部分和执行部分的PL/SQL块 declare v_ename varchar2(5); begin select
【Nginx三】Nginx作为反向代理服务器 bit1129 nginx
Nginx一个常用的功能是作为代理服务器。代理服务器通常完成如下的功能：接受客户端请求将请求转发给被代理的服务器从被代理的服务器获得响应结果把响应结果返回给客户端实例本文把Nginx配置成一个简单的代理服务器对于静态的html和图片，直接从Nginx获取对于动态的页面，例如JSP或者Servlet，Nginx则将请求转发给Res
Plugin execution not covered by lifecycle configuration: org.apache.maven.plugin blackproof maven 报错
转：http://stackoverflow.com/questions/6352208/how-to-solve-plugin-execution-not-covered-by-lifecycle-configuration-for-sprin maven报错： Plugin execution not covered by lifecycle configuration:
发布docker程序到marathon ronin47 docker 发布应用
1 发布docker程序到marathon 1.1 搭建私有docker registry 1.1.1 安装docker regisry docker pull docker-registry docker run -t -p 5000:5000 docker-registry 下载docker镜像并发布到私有registry docker pull consol/tomcat-8.0
java-57-用两个栈实现队列&&用两个队列实现一个栈 bylijinnan java
import java.util.ArrayList; import java.util.List; import java.util.Stack; /* * Q 57 用两个栈实现队列 */ public class QueueImplementByTwoStacks { private Stack<Integer> stack1; pr
Nginx配置性能优化 cfyme nginx
转载地址：http://blog.csdn.net/xifeijian/article/details/20956605 大多数的Nginx安装指南告诉你如下基础知识——通过apt-get安装，修改这里或那里的几行配置，好了，你已经有了一个Web服务器了。而且，在大多数情况下，一个常规安装的nginx对你的网站来说已经能很好地工作了。然而，如果你真的想挤压出Nginx的性能，你必
[JAVA图形图像]JAVA体系需要稳扎稳打,逐步推进图像图形处理技术 comsci java
对图形图像进行精确处理，需要大量的数学工具，即使是从底层硬件模拟层开始设计，也离不开大量的数学工具包，因为我认为，JAVA语言体系在图形图像处理模块上面的研发工作，需要从开发一些基础的，类似实时数学函数构造器和解析器的软件包入手，而不是急于利用第三方代码工具来实现一个不严格的图形图像处理软件...... &nb
MonkeyRunner的使用 dai_lm android MonkeyRunner
要使用MonkeyRunner，就要学习使用Python，哎先抄一段官方doc里的代码作用是启动一个程序（应该是启动程序默认的Activity），然后按MENU键，并截屏 # Imports the monkeyrunner modules used by this program from com.android.monkeyrunner import MonkeyRun
Hadoop-- 海量文件的分布式计算处理方案 datamachine mapreduce hadoop 分布式计算
csdn的一个关于hadoop的分布式处理方案，存档。原帖：http://blog.csdn.net/calvinxiu/article/details/1506112。 Hadoop 是Google MapReduce的一个Java实现。MapReduce是一种简化的分布式编程模式，让程序自动分布到一个由普通机器组成的超大集群上并发执行。就如同ja
以資料庫驗證登入 dcj3sjt126com yii
以資料庫驗證登入由於 Yii 內定的原始框架程式, 採用綁定在UserIdentity.php 的 demo 與 admin 帳號密碼: public function authenticate() { $users=array( &nbs
github做webhooks：[2]php版本自动触发更新 dcj3sjt126com github git webhooks
上次已经说过了如何在github控制面板做查看url的返回信息了。这次就到了直接贴钩子代码的时候了。工具/原料 git github 方法/步骤在github的setting里面的webhooks里把我们的url地址填进去。钩子更新的代码如下： error_reportin
Eos开发常用表达式蕃薯耀 Eos开发 Eos入门 Eos开发常用表达式
Eos开发常用表达式 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 蕃薯耀 2014年8月18日 15:03:35 星期一 &
SpringSecurity3.X--SpEL 表达式 hanqunfeng SpringSecurity
使用 Spring 表达式语言配置访问控制，要实现这一功能的直接方式是在<http>配置元素上添加 use-expressions 属性： <http auto-config="true" use-expressions="true"> 这样就会在投票器中自动增加一个投票器：org.springframework
Redis vs Memcache IXHONG redis
1. Redis中，并不是所有的数据都一直存储在内存中的，这是和Memcached相比一个最大的区别。 2. Redis不仅仅支持简单的k/v类型的数据，同时还提供list，set，hash等数据结构的存储。 3. Redis支持数据的备份，即master-slave模式的数据备份。 4. Redis支持数据的持久化，可以将内存中的数据保持在磁盘中，重启的时候可以再次加载进行使用。 Red
Python - 装饰器使用过程中的误区解读 kvhur JavaScript jquery html5 css
大家都知道装饰器是一个很著名的设计模式，经常被用于AOP(面向切面编程)的场景，较为经典的有插入日志，性能测试，事务处理，Web权限校验， Cache等。原文链接：http://www.gbtags.com/gb/share/5563.htm Python语言本身提供了装饰器语法（@），典型的装饰器实现如下： @function_wrapper de
架构师之mybatis-----update 带case when 针对多种情况更新 nannan408 case when
1.前言. 如题. 2. 代码. <update id="batchUpdate" parameterType="java.util.List"> <foreach collection="list" item="list" index=&
Algorithm算法视频教程栏目记者 Algorithm 算法
课程：Algorithm算法视频教程百度网盘下载地址： http://pan.baidu.com/s/1qWFjjQW 密码: 2mji 程序写的好不好,还得看算法屌不屌！Algorithm算法博大精深。一、课程内容：课时1、算法的基本概念 + Sequential search 课时2、Binary search 课时3、Hash table 课时4、Algor
C语言算法之冒泡排序 qiufeihu c 算法
任意输入10个数字由小到大进行排序。代码： #include <stdio.h> int main() { int i,j,t,a[11]; /*定义变量及数组为基本类型*/ for(i = 1;i < 11;i++){ scanf("%d",&a[i]); /*从键盘中输入10个数*/ } for
JSP异常处理 wyzuomumu Web jsp
1.在可能发生异常的网页中通过指令将HTTP请求转发给另一个专门处理异常的网页中: <%@ page errorPage="errors.jsp"%> 2.在处理异常的网页中做如下声明： errors.jsp: <%@ page isErrorPage="true"%>，这样设置完后就可以在网页中直接访问exc