weixin_38440581

StructuredStreaming: A Declarative API for Real-Time

Abstract

With the ubiquity of real-time data,organizations need streaming systems that are scalable, easy to use, and easyto integrate into business applications. Structured Streaming is a newhigh-level streaming API in Apache Spark based on our experience with SparkStreaming. Structured Streaming differs from other recent streaming APIs, suchas Google Dataflow, in two main ways. First, it is a purely declarative API based on automaticallyincrementalizing a

static relational query (expressed using SQLor DataFrames), in contrast to APIs that ask the user to build a DAG ofphysical operators. Second, Structured Streaming aims to support end-to-end real-time applications thatintegrate streaming with batch and interactive analysis. We found that thisintegration was often a key challenge in practice. Structured Streamingachieves high performance via Spark SQL’s code generation engine and canoutperform Apache Flink by up to 2× and Apache Kafka Streams by 90×. It alsooffers rich operational features such as rollbacks, code updates, and mixedstreaming/batch execution. We describe the system’s design and use cases fromseveral hundred production deployments on Databricks, the largest of whichprocess over 1 PB of data per month.

ACM Reference Format:

M. Armbrust et al.. 2018. Structured Streaming: ADeclarative API for RealTime Applications in Apache Spark. In SIGMOD’18: 2018 International Conference onManagement of Data, June 10–15, 2018, Houston, TX, USA. ACM, New York, NY,USA, 13 pages. https://doi.org/10.1145/3183713.3190664

1 Introduction

Many high-volume data sources operate inreal time, including sensors, logs from mobile applications, and the Internetof Things. As organizations have gotten better at capturing this data, theyalso want to process it in real time, whether to give human analysts thefreshest possible data or drive automated decisions. Enabling broad access tostreaming computation requires systems that are scalable, easy to use and easyto integrate into business applications.

While there has beentremendous progress in distributed stream processing systems in the past fewyears [2, 15, 17, 27, 32], these systems still remain fairly challenging to usein practice. In this paper, we begin by describing these challenges, based onour experience

Permissionto make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made ordistributed for profit or commercial advantage and that copies bear this noticeand the full citation on the first page. Copyrights for components of this workowned by others than the author(s) must be honored. Abstracting with credit ispermitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected].

SIGMOD’18,June 10–15, 2018, Houston, TX, USA

with Spark Streaming [37], one of theearliest stream processing systems to provide a high-level, functional API. Wefound that two challenges frequently came up with users. First, streamingsystems often ask users to think in terms of complex physical executionconcepts, such as at-least-once delivery, state storage and triggering modes,that are unique to streaming. Second, many systems focus only on streaming computation, but in real use cases, streaming isoften part of a larger business application that also includes batch analytics,joins with static data, and interactive queries. Integrating streaming systemswith these other workloads (e.g., maintaining transactionality) requiressignificant engineering effort.

Motivated by these challenges,we describe Structured Streaming, a new high-level API for stream processingthat was developed in Apache Spark starting in 2016. Structured Streamingbuilds on many ideas in recent stream processing systems, such as separatingprocessing time from event time and triggers in Google Dataflow [2], using arelational execution engine for performance [12], and offering alanguage-integrated API [17, 37], but aims to make them simpler to use andintegrated with the rest of Apache Spark. Specifically, Structured Streamingdiffers from other widely used open source streaming APIs in two ways:

• Incremental query model: StructuredStreaming automatically incrementalizes queries on static datasets expressedthrough Spark’s SQL and DataFrame APIs [8], meaning that users typically onlyneed to understand Spark’s batch APIs to write a streaming query. Event timeconcepts are especially easy to express and understand in this model. Althoughincremental query execution and view maintenance are well studied [11, 24, 29,38], we believe Structured Streaming is the first effort to adopt them in awidely used open source system. We found that this incremental API generallyworked well for both novice and advanced users. For example, advanced users canuse a set of stateful processing operators that give fine-grained control toimplement custom logic while fitting into the incremental model.

• Supportforend-to-endapplications:Structured Streaming’s

API and built-in connectors make it easyto write code that is

“correct by default" when interactingwith external systems and can be integrated into larger applications usingSpark and other software. Data sources and sinks follow a simple transactionalmodel that enables “exactly-once" computation by default. Theincrementalization based API naturally makes it easy to run a streaming queryas a batch job or develop hybrid applications that join streams with staticdata computed through Spark’s batch APIs. In addition, users can managemultiple streaming queries dynamically and run interactive queries onconsistent snapshots of stream output, making it possible to write applicationsthat go beyond computing a fixed result to let users refine and drill intostreaming data.

Beyond these designdecisions, we made several other design choices in Structured Streaming thatsimplify operation and increase performance. First, Structured Streaming reusesthe Spark SQL execution engine [8], including its optimizer and runtime codegenerator. This leads to high throughput compared to other streaming systems(e.g., 2× the throughput of Apache Flink and 90× that of Apache Kafka Streamsin the Yahoo! Streaming Benchmark [14]), as in Trill [12], and also letsStructured Streaming automatically leverage new SQL functionality added toSpark. The engine runs in a microbatch execution mode by default [37] but itcan also use a low-latency continuous operators for some queries because theAPI is agnostic to execution strategy [6].

Second, we found thatoperating a streaming application can be challenging, so we designed the engineto support failures, code updates and recomputation of already outputted data.For example, one common issue is that new data in a stream causes anapplication to crash, or worse, to output an incorrect result that users do notnotice until much later (e.g., due to mis-parsing an input field). InStructured Streaming, each application maintains a write-ahead event log inhuman-readable JSON format that administrators can use to restart it from anarbitrary point. If the application crashes due to an error in a user-definedfunction, administrators can update the UDF and restart from where it left off,which happens automatically when the restarted application reads the log. Ifthe application was outputting incorrect data instead, administrators canmanually roll it back to a point before the problem started and recompute itsresults starting from there.

Our team has beenrunning Structured Streaming applications for customers of Databricks’ cloudservice since 2016, as well as using the system internally, so we end the paperwith some example use cases. Production applications range from interactivenetwork security analysis and automated alerts to incremental Extract,Transform and Load (ETL). Users often leverage the design of the engine ininteresting ways, e.g., by running a streaming query

“discontinuously" as a series ofsingle-microbatch jobs to leverage Structured Streaming’s transactional inputand output without having to pay for cloud servers running 24/7. The largestcustomer applications we discuss process over 1 PB of data per month onhundreds of machines. We also show that Structured Streaming outperforms ApacheFlink and Kafka Streams by 2× and 90× respectively in the widely used Yahoo!Streaming Benchmark [14].

The rest of this paperis organized as follows. We start by discussing the stream processingchallenges reported by users in Section 2. Next, we give an overview ofStructured Streaming (Section 3), then describe its API (Section 4), queryplanning (Section 5), execution (Section 6) and operational features (Section7). In Section 8, we describe several large use cases at Databricks and itscustomers. We then measure the system’s performance in Section 9, discussrelated work in Section 10 and conclude in Section 11.

2 Stream Processing Challenges

Despite extensive progress in the past fewyears, distributed streaming applications are still generally considereddifficult to develop and operate. Before designing Structured Streaming, wespent time discussing these challenges with users and designers of otherstreaming systems, including Spark Streaming, Truviso, Storm, Dataflow andFlink. This section details the challenges we saw.

2.1 Complex and Low-Level APIs

Streaming systems were invariably consideredmore difficult to use than batch ones due to complex API semantics. Somecomplexity is to be expected due to new concerns that arise only in streaming:for example, the user needs to think about what type of intermediate resultsthe system should output before it has received all the data relevant to aparticular entity, e.g., to a customer’s browsing session on a website.However, other complexity arises due to the lowlevelnature of many streaming APIs: these APIs often ask users to specifyapplications at the level of physicaloperators with complex semantics instead of a more declarative level.

As a concrete example,the Google Dataflow model [2] has a powerful API with a rich set of options forhandling event time aggregation, windowing and out-of-order data. However, inthis model, users need to specify a windowing mode, triggering mode and triggerrefinement mode (essentially, whether the operator outputs deltas oraccumulated results) for each aggregation operator. Adding an operator thatexpects deltas after an aggregation that outputs accumulated results will leadto unexpected results. In essence, the raw API [10] asks the user to write aphysical operator graph, not a logical query, so every user of the system needsto understand the intricacies of incremental processing.

Other APIs, such as Spark Streaming [37] andFlink’s DataStream

API [18], are also based on writing DAGs ofphysical operators and offer a complex array of options for managing state [20].In addition, reasoning about applications becomes even more complex in systemsthat relax exactly-once semantics [32], effectively requiring the user todesign and implement a consistency model.

To address thisissue, we designed Structured Streaming to make simple applications simple toexpress using its incremental query model. In addition, we found that addingcustomizable stateful processing operatorsto this model still enabled advanced users to build their own processing logic,such as custom session-based windows, while staying within the incrementalmodel (e.g., these same operators also work in batch jobs). Other open sourcesystems have also recently added incremental SQL queries [15, 19], and ofcourse databases have long supported them [11, 24, 29, 38].

2.2 Integration in End-to-End Applications

The second challenge we found was thatnearly every streaming workload must run in the context of a largerapplication, and this integration often requires significant engineeringeffort. Many streaming APIs focus primarily on reading streaming input from asource and writing streaming output to a sink, but end-to-end businessapplications need to perform other tasks. Examples include:

(1) The business purpose of the application may be to enable interactivequeries on fresh data. In this case, a streaming job is used to update summarytables in a structured storage system such as an RDBMS or Apache Hive [33]. Itis important that when the streaming job updates its result, it does soatomically, so users do not see partial results. This can be difficult withfile-based big data systems like Hive, where tables are partitioned acrossfiles, or even with parallel loads into a data warehouse.

(2) An Extract, Transform and Load (ETL) job might need to join a streamwith static data loaded from another storage system or transformed using abatch computation. In this case, it is important to be able to reason aboutconsistency across the two systems (e.g., what happens when the static data isupdated?), and it is useful to write the whole computation in a single API.

(3) A team may occasionally need to run its streaming business logic asa batch application, e.g., to backfill a result on old data or test alternateversions of the code. Rewriting the code in a separate system would betime-consuming and error-prone.

We address thischallenge by integrating Structured Streaming closely with Spark’s batch andinteractive APIs.

2.3 Operational Challenges

One of the largest challenges to deployingstreaming applications in practice is management and operation. Some key issuesinclude:

• Failures: This is the mostheavily studied issue in the research literature. In addition to single nodefailures, systems also need to support graceful shutdown and restart of thewhole application, e.g., to let operators migrate it to a new cluster.

• Code Updates: Applications arerarely perfect, so developers may need to update their code. After an update,they may want the application to restart where it left off, or possibly to recompute past results that wereerroneous due to a bug. Both cases need to be supported in the streamingsystem’s state management and fault recovery mechanisms. Systems should alsosupport updating the runtime itself (e.g., patching Spark).

• Rescaling: Applications seevarying load over time, and generally increasingload in the long term, so operators may want to scale them up and downdynamically, especially in the cloud. Systems based on a static communicationtopology, while conceptually simple, are difficult to scale dynamically.

• Stragglers: Instead of outrightfailing, nodes in the streaming system can slow down due to hardware orsoftware issues and degrade the throughput of the whole application. Systemsshould automatically handle this situation.

• Monitoring: Streaming systemsneed to give operators clear visibility into system load, backlogs, state sizeand other metrics.

2.4 Cost and Performance Challenges

Beyond operational and engineering issues,the cost-performance of streaming applications can be an obstacle because theseapplications run 24/7. For example, without dynamic rescaling, an applicationwill waste resources outside peak hours; and even with rescaling, it may bemore expensive to compute a result continuously than to run a periodic batchjob. We thus designed Structured Streaming to leverage all the executionoptimizations in Spark SQL [8].

So far, we chose tooptimize throughput as our main performancemetric because we found that it was often the most important metric inlarge-scale streaming applications. Applications that require a distributed streaming system usuallywork with large data volumes coming from external sources (e.g., mobile devices,sensors or IoT), where data may already incur a delay just getting to thesystem. This is one reason why event time processing is an important feature inthese systems [2]. In contrast, latency-sensitive applications

Figure 1: The components of StructuredStreaming.

such as high-frequency trading or physicalsystem control loops often run on a single scale-up processor, or even customhardware like ASICs and FPGAs [3]. However, we also designed StructuredStreaming to support executing over latency-optimized engines, and implementeda continuous processing mode for this task, which we describe in Section 6.3.This is a change over Spark Streaming, where microbatching was “baked into"the API.

3 Structured Streaming Overview

Structured Streaming aims to tackle thestream processing challenges we identified through a combination of API andexecution engine design. In this section, we give a brief overview of theoverall system. Figure 1 shows Structured Streaming’s main components.

Input and Output. StructuredStreaming connects to a variety of inputsources and output sinks for I/O.To provide “exactly-once" output and fault tolerance, it places tworestrictions on sources and sinks, which are similar to other exactly-oncesystems [17, 37]:

(1) Input sources must be replayable,allowing the system to re-read recent input data if a node crashes. Inpractice, organizations use a reliable message bus such as Amazon Kinesis orApache Kafka [5, 23] for this purpose, or simply a durable file system.

(2) Output sinks must support idempotentwrites, to ensure reliable recovery if a node fails while writing.Structured Streaming can also provide atomicoutput for certain sinks that support it, where the entire update to thejob’s output appears atomically even if it was written by multiple nodesworking in parallel.

In addition to external systems, StructuredStreaming also supports input and output from tables in Spark SQL. For example,users can compute a static table from any of Spark’s batch input sources andjoin it with a stream, or ask Structured Streaming to output to an in-memorySpark table that users can query interactively.

API. Users program Structured Streamingby writing a query against one or more streams and tables using Spark SQL’sbatch APIs: SQL and DataFrames [8]. This query defines an output table that the user wants to compute, assuming that eachinput stream is replaced by a table holding all the data received from thatstream so far. The engine then determines how to compute and write this outputtable into a sink incrementally,using similar techniques to incremental view maintenance [11, 29]. Differentsinks also support different output modes,which determine how the system may write out its results: for example, somesinks are append-only by nature, while others allow updating records in placeby key.

To support streamingspecifically, Structured Streaming also adds several API features that fit inthe existing Spark SQL API:

(1) Triggers control how often the engine will attempt to compute a new resultand update the output sink, as in Dataflow [2].

(2) Users can mark a column as denoting event time (a timestamp set at the data source), and set a watermark policy to determine whenenough data has been received to output a result for a specific event time, asin [2].

(3) Stateful operators allow users to track and update mutable state by key in order toimplement complex processing, such as custom session-based windows. These aresimilar to Spark Streaming’s updateStateByKey API [37].

Note that windowing,another key feature for streaming, is done using Spark SQL’s existingaggregation APIs. In addition, all the new APIs added by Structured Streamingalso work in batch jobs.

Execution. Once it has received a query,Structured Streaming optimizes it, incrementalizes it, and begins executing it.By default, the system uses a microbatch model similar to Discretized Streamsin Spark Streaming, which supports dynamic load balancing, rescaling, faultrecovery and straggler mitigation by dividing work into small tasks [37]. Inaddition, it can use a continuous processing mode based on traditionallong-running operators (Section 6.3).

In both cases,Structured Streaming uses two forms of durable storage to achieve faulttolerance. First, a write-ahead log keepstrack of which data has been processed and reliably written to the output sinkfrom each input source. For some output sinks, this log can be integrated withthe sink to make updates to the sink atomic. Second, the system uses alarger-scale state store to holdsnapshots of operator states for long-running aggregation operators. These arewritten asynchronously, and may be “behind" the latest data written to theoutput sink. The system will automatically track which state it has lastupdated in its log, and recompute state starting from that point in the data onfailure. Both the log and state store can run over pluggable storage systems(e.g., HDFS or S3).

Operational Features. Using the durabilityof the write-ahead log and state store, users can achieve several forms ofrollback and recovery. An entire StructuredStreaming application can be shut down and restarted on new hardware. Runningapplications also tolerate node crashes, additions and stragglersautomatically, by sending tasks to new nodes. For code updates to UDFs, it issufficient to stop and restart the application, and it will begin using the newcode. In addition, users can manually roll back the application to a previouspoint in the log and redo the part of the computation starting then, beginningfrom an older snapshot of the state store.

In addition, StructuredStreaming’s ability to execute with microbatches lets it “adaptivelybatch" data so that it can quickly catch up with input data if the loadspikes or if a job is rolled back, then return to low latency later. This makesoperation significantly simpler (e.g., users can safely update job code moreoften).

The next sections gointo detail about Structured Streaming’s API (§4), query planning (§5) and jobexecution and operation (§6).

4 Programming Model

Structured Streaming combines elements ofGoogle Dataflow [2], incremental queries [11, 29, 38] and Spark Streaming [37]to enable stream processing beneath the Spark SQL API. In this section, westart by showing a short example, then describe the semantics of the model andthe streaming-specific operators we added in Spark SQL to support streaming usecases (e.g., stateful operators).

4.1 A Short Example

Structured Streaming operates within Spark’sstructured data APIs: SQL, DataFrames and Datasets [8]. The main abstractionusers work with is tables (representedby the DataFrames or Dataset classes), which each represent a view to becomputed from input sources to the system.^{^[1]}When users create a table/DataFrame from a streaming input source,and attempt to compute it, Spark will automatically launch a streamingcomputation.

As a simpleexample, let us start with a batch jobthat counts clicks by country of origin for a web application. Suppose that theinput data is JSON files and the output should be Parquet. This job can bewritten with Spark DataFrames in Scala as follows:

// Define a DataFrame to read from staticdata data = spark.read.format("json").load("/in")

// Transform it to compute a result counts= data.groupBy($"country").count()

// Write to a static data sink

counts.write.format("parquet").save("/counts")

Changing this job to useStructured Streaming only requires modifying the input and output sources, notthe transformation in the middle. For example, if new JSON files are going to continually be uploaded to the /indirectory, we can modify our job to continually update /countsby changing only the first and last lines:

// Define a DataFrame to read streamingdata data = spark.readStream.format("json").load("/in")

// Transform it to compute a result counts= data.groupBy($"country").count()

// Write to a streaming data sink

counts.writeStream.format("parquet")

.outputMode("complete").start("/counts")

The output modeparameter on the sink here specifies how Structured Streaming should update thesink. In this case, the complete mode means to write a complete result file foreach update, because the file output sink chosen does not support fine-grainedupdates. However, other sinks, such as key-value stores, support additionaloutput modes (e.g., updating just the changed keys).

Under the hood,Structured Streaming will automatically incrementalize the query specified bythe transformation(s) from input sources to data sinks, and execute it in astreaming fashion. The engine will also automatically maintain state andcheckpoint it to external storage as needed—in this case, for example, we havea running count aggregation since the start of the stream, so the engine willkeep track of the running counts for each country.

Finally, the API alsonaturally supports windowing and event time through Spark SQL’s existingaggregation operators. For example, instead of counting data by country, wecould count it in 1-hour sliding windows advancing every 5 minutes by changingjust the middle line of the computation as follows:

// Count events by windows on the"time" field

data.groupBy(window($"time","1h","5min")).count()

The time field here (event time) is just a field in the data, similar to countryearlier. Users can also set a watermark on this fieldto let the system forget state for old windows after a timeout (§4.3.1). 4.2 Programming Model Semantics

Formally, we define the semantics ofStructured Streaming’s programming model as follows:

(1) Each input source provides a partially ordered set of records overtime. We assume partial orders here because some message bus systems areparallel and do not define a total order across records—for example, Kafkadivides streams into “partitions" that are each ordered.

(2) The user provides a query to execute across the input data that canoutput a result table at any givenpoint in processing time. Structured Streaming will always produce resultsconsistent with running this query on a prefixof the data in all input sources. That is, it will never show results thatincorporate one input record but do not incorporate its ancestors in thepartial order. Moreover, these prefixes will be increasing over time.

(3) Triggers tell the system when torun a new incremental computation and update the result table. For example, inmicrobatch mode, the user may wish to trigger an incremental update everyminute (in processing time).

(4) The sink’s output mode specifieshow the result table is written to the output system. The engine supports threedistinct modes:

• Complete: The engine writes the wholeresult table at once,

e.g., replacing a whole file in HDFS witha new version. This is of course inefficient when the result is large.

• Append: The engine can only add recordsto the sink. For example, a map-only job on a set of input files results inmonotonically increasing output.

• Update: The engine updates the sink inplace based on a key for each record, updating only keys whose values changed.

Figure 2 illustrates themodel visually. One attractive property of the model is that the contents of the result table (which islogically just a view that need never be materialized) are definedindependently of the output mode (whether we output the whole table on everytrigger, or only deltas). In contrast, APIs such as Dataflow require theequivalent of an output mode on every operator, so users must plan the wholeoperator DAG keeping in mind whether each operator is outputting completeresults or positive or negative deltas, effectively incrementalizing the queryby hand.

Figure2: Structured Streaming’s semantics for two output modes. Logically, all inputdata received up to a point in processing time is viewed as a large inputtable, and the user provides a query that defines a result table based on thisinput. Physically, Structured Streaming computes changes to the result table incrementally (without having to storeall input data) and outputs results based on its output mode. For completemode, it outputs the whole result table (left), while for append mode, it onlyoutputs newly added records (right).

A second attractiveproperty is that the model has strong consistency semantics, which we call prefix consistency. First, it guaranteesthat when input records are relatively ordered within a source (e.g., logrecords from the same device), the system will only produce results thatincorporate them in the same records (e.g., never skipping a record). Second,because the result table is defined based on all data in the input prefix at once, we know that all rows in theresult table reflect all input records. In contrast, in some systems based onmessage-passing between nodes, the node that receives a record might send anupdate to two downstream nodes, but there is no guarantee that the outputs fromthese two are synchronized. Prefix consistency also makes operation easier, asusers can roll back the system to a specific point in the write-ahead log(i.e., a specific prefix of the data) and recompute outputs from that point.

In summary, withthe Structured Streaming models, as long as users understand a regular Spark orDataFrame query, they can understand the content of the result table for theirjob and the values that will be written to the sink. Users need not worry aboutconsistency, failures or incorrect processing orders.

Finally, the readermight notice that some of the output modes we defined are incompatible withcertain types of query. For example, suppose we are aggregating counts bycountry, as in our code example in the previous section, and we want to use theappend output mode. There is no way for the system to guarantee it has stoppedreceiving records for a given country, so this combination of query and outputmode will not be allowed by the system. We describe which combinations areallowed in Section 5.1. 4.3 Streaming Specific Operators

Many Structured Streaming queries can bewritten using just the standard operators in Spark SQL, such as selection,aggregation and joins. However, to support some requirements unique tostreaming, we added two new types of operators to Spark SQL: watermarking operators tell the systemwhen to “close" an event time window and output results or forget state,and stateful operators let userswrite custom logic to implement complex processing. Crucially, both of thesenew operators still fit in Structured Streaming’s incremental semantics (§4.2),and both can also be used in batch jobs.

4.3.1 Event TimeWatermarks. From a logical point of view, the key idea in event time is totreat application-specified timestamps as an arbitrary field in the data,allowing records to arrive out-of-order [2, 24]. We can then use standardoperators and incremental processing to update results grouped by event time.In practice, however, it is useful for the processing system to have some loosebounds on how late data can arrive, for two reasons:

(1) Allowing arbitrarily late data might require storing arbitrarilylarge state. For example, if we count data by 1-minute event time window, thesystem needs to remember a count for every 1-minute window since theapplication began, because a late record might still arrive for any particularminute. This can quickly lead to large amounts of state, especially if combinedwith another grouping key. The same issue happens with joins.

(2) Some sinks do not support data retraction, making it useful to beable to write the results for a given event time after a timeout. For example,custom downstream applications want to start working with a “final" resultand might not support retractions. Append-mode sinks also do not supportretractions.

Structured Streaminglets developers set a watermark [2] for event time columns using the withWatermarkoperator. This operator gives the system a delaythreshold t_Cfor a given timestampcolumn C. At any point in time, thewatermark for C is max(C)−tC, thatis, t_Cseconds before the maximum event time seen so far in C. Note that this choice of watermark isnaturally robust to backlogged data: if the system cannot keep up with theinput rate for a period of time, the watermark will not move forwardarbitrarily during that time, and all events that arrived within at most T seconds of being produced will stillbe processed.

When present, watermarksaffect when stateful operators can forget old state (e.g., if grouping by awindow derived from a watermarked column), and when Structured Streaming willoutput data with an event time key to append-mode sinks. Different inputstreams can have different watermarks.

4.3.2 StatefulOperators. For developers who want to write custom stream processing logic,Structured Streaming’s stateful operators are “UDFs with state" that giveusers control over the computation while fitting into Structured Streaming’ssemantics and fault tolerance mechanisms. There are two stateful operators, mapGroupsWithStateand flatMapGroupsWithState. Both operators act on data that has been assigned a key using groupByKey, and let the developers track and update a state for each key using custom logic, as well as output recordsfor each key. They are closely based on Spark Streaming’s updateStateByKeyoperator [37].

The mapGroupsWithStateoperator, on a grouped dataset with keys of type K and values of type V , takes in a user-defined update function with the followingarguments:

• key of type K

// Define an updatefunction that simply tracks the // number of events for each key as its state,returns // that as its result, and times out keys after 30 min. def updateFunc(key:UserId, newValues: Iterator[Event], state: GroupState[Int]):Int= {

valtotalEvents = state.get() + newValues.size() state.update(totalEvents)state.setTimeoutDuration("30 min")returntotalEvents

}

// Use this updatefunction on a stream, returning a // new table lens that contains the sessionlengths. lens = events.groupByKey(event => event.userId)

.mapGroupsWithState(updateFunc)

Figure3: Using mapGroupsWithStateto track the number of events per session, timingout sessions after 30 minutes.

• newValues oftype Iterator[V]

• state oftype GroupState[S], where S is auser-specified class.

The operator will invokethis function whenever one or more new values are received for a key. On eachcall, the function receives all of the values that were received for that keysince the last call (multiple values may be batched for efficiency). It alsoreceives a state object that wraps arounda user-defined data type S, andallows the user to update the state, drop this key from state tracking, or seta timeout for this specific key (either in event time or processing time). Thisallows the user to store arbitrary data for the key, as well as implementcustom logic for dropping state (e.g., custom exit conditions when implementingsession-based windows).

Finally, the update function returns auser-specified return type R for itskey. The return value of mapGroupsWithState is a new table with the final Rrecord outputted for each group in the data (when the group is closed ortimes out). For example, the developer may wish to track user sessions on awebsite using mapGroupsWithState, and outputthe total number of pages clicked for each session.

To illustrate, Figure 3shows how to use mapGroupsWithState to trackuser sessions, where a session is defined as a series of events with the same userIdand gaps less than 30 minutes between them. We outputthe final number of events in each session as our return value R. A job could then compute metrics suchas the average number of events per session by aggregating the result table lens. The second stateful operator, flatMapGroupsWithState, is very similar to mapGroupsWithState, except that the update function can return zero or more values of type Rper update instead of one. For example, this operator could be used tomanually implement a stream-to-table join. The return values can either bereturned all at once, when the group is closed, or incrementally across callsto the update function. Both operators also work in batch mode, in which casethe update function will only be called once.

5 Query Planning

We implemented Structured Streaming’s queryplanning using the Catalyst extensible optimizer in Spark SQL [8], which allowswriting composable rules using pattern matching in Scala. Query planningproceeds in three stages: analysis to determine whether the query is valid,incrementalization and optimization.

5.1 Analysis

The first stage of query planning isanalysis, where the engine validates the user’s query and resolves theattributes and data types referred to in the query. Structured Streaming usesSpark SQL’s existing analysis passes to resolve attributes and types, but addsnew rules to check that the query can be executed incrementally by the engine.It also checks that the user’s chosen output mode is valid for this specificquery. For example, the Append output mode can only be used with queries whoseoutput is monotonic [4]: that is, where a given output record will not beremoved once it is written. In this mode, only selections, joins, andaggregations over keys that include event time are allowed (in which case theengine will only output the value for a given event time once its watermark haspassed). Similarly, in the Complete output mode, where the whole output tableneeds to be written on each trigger, Structured Streaming only permits aggregationqueries where the amount of state that needs to be tracked is proportional tothe number of keys in the result. A full description of the supported modes isavailable in the Structured Streaming documentation [31].

5.2 Incrementalization

The next step of the query planning processis incrementalizing the static query provided by the user to efficiently updateresults in response to new data. In general, Structured Streaming’sincrementalizer aims to ensure that the query’s result can be updated in timeproportional to the amount of new data received before each trigger or to theamount of new rows that have to be produced, without a dependance on the totalamount of data received so far.

The engine canincrementalize a restricted, but growing, class of queries. As of Spark 2.3.0,the supported queries can contain:

• Any number of selections,projections and SELECT DISTINCTs.

• Inner, left-outer andright-outer joins between a stream and a table or between two streams. Forouter joins against a stream, the join condition must involve a watermarkedcolumn.

• Stateful operators like mapGroupsWithState(§4.3.2).

• Up to one aggregation (possiblyon compound keys).

• Sorting after an aggregation,only in complete output mode.

The engine uses Catalysttransformation rules to map these supported queries into trees of physicaloperators that perform both computation and state management. For example, anaggregation in the user query might be mapped to a StatefulAggregate operatorthat tracks open groups inside Structured Streaming’s state store (§6.1) andoutputs the desired result. Internally, Structured Streaming also tracks anoutput mode for each physical operator in the DAG produced duringincrementalization, similar to the refinement mode for aggregation operators inDataflow [2]. For example, some operators may update emitted records(equivalent to update mode), while others may only emit new records (appendmode). Crucially, in Structured Streaming, users do not have to specify theseintra-DAG modes manually.

Incrementalization is anactive area of work in Structured Streaming, but we have found that even therestricted set of queries available today is suitable for many use cases (§8).In other cases, users have leveraged Structured Streaming’s stateful operators(§4.3.2) to implement custom incremental processing logic that maintains

Figure4: State management during the execution of Structured Streaming. Inputoperators are responsible for defining epochs in each input source and savinginformation about them (e.g., offsets) reliably in the write-ahead log.Stateful operators also checkpoint state asynchronously, marking it with itsepoch, but this does not need to happen on every epoch. Finally, outputoperators log which epochs’outputshavebeenreliablycommittedtotheidempotentoutput sink; the very last epochmay be rewritten on failure.

state of their choice. We expect to add moreadvanced automatic incrementalization techniques into the engine over time.

5.3 Query Optimization

The final stage of planning is optimization.Structured Streaming applies most of the optimization rules in Spark SQL [8],such as predicate pushdown, projection pushdown, expression simplification andothers. In addition, it uses Spark SQL’s Tungsten binary format for data inmemory (avoiding the overhead of Java objects), and its runtime code generatorto compile chains of operators to Java bytecode that runs over this format.This design means that most of the work in logical and execution optimizationfor analytical workloads in Spark SQL automatically applies to streaming.

6 Application Execution

The final component of Structured Streamingis its execution strategy. In this section, we describe how the engine tracksstate, and then the two execution modes: microbatching via fine-grained tasksand continuous processing using long-lived operators. We then discuss operationalfeatures to simplify managing and deploying Structured Streaming applications.

6.1 State Management and Recovery

At a high level, Structured Streaming tracksstate in a manner similar to Spark Streaming [37], in both its microbatch andcontinuous modes. The state of an application is tracked using two externalstorage systems: a write-ahead log thatsupports durable, atomic writes at low latency, and a state store that can store larger amounts of data durably andallows parallel access (e.g., S3 or HDFS). Structured Streaming uses thesesystems together to recover on failure.

The engine places tworequirements on input sources and output sinks to provide fault tolerance.First, input sources should be replayable,i.e., allow re-reading recent data using some form of identifier, such as astream offset. Durable message bus systems like Kafka and Kinesis meet thisneed. Second, output sinks should be idempotent,allowing Structured Streaming to rewrite some already written data on failure.Sinks can implement this in different ways.

Given these properties,Structured Streaming performs state tracking using the following mechanism, asshown in Figure 4:

(1) As input operators read data, the master node of the Sparkapplication defines epochs based onoffsets in each input source. For example, Kafka and Kinesis present topics asa series of partitions, each of which are byte streams, and allow reading datausing offsets in these partitions. The master writes the start and end offsetsof each epoch durably to the log.

(2) Any operators requiring state checkpoint their state periodicallyand asynchronosuly to the state store, using incremental checkpoints whenpossible. They store the epoch ID along with each checkpoint written. Thesecheckpoints do not need to happen on every epoch or to block processing.^{^[2]}

(3) Output operators write the epochs they committed to the log. Themaster waits for all nodes running an operator to report a commit for a givenepoch before allowing commits for the next epoch. Depending on the sink, themaster can also run an operation to finalize the writes from multiple nodes ifthe sink supports this. This means that if the streaming application fails,only one epoch may be partially written.³

(4) Upon recovery, the new instance of the application starts by readingthe log to find the last epoch that has not been committed to the sink,including its start and end offsets. It then uses the offsets of earlier epochsto reconstruct the application’s inmemory state from the last epoch written tothe state store. This just requires loading the old state and running thoseepochs with the same offsets while disabling output. Finally, the system rerunsthe last epoch and relies on the sink’s idempotence to write its results, thenstarts defining new epochs.

Finally, all of thestate management in this design is transparent to user code. Both theaggregation operators and custom stateful processing operators (e.g., mapGroupsWithState) automatically checkpoint state to the state store, withoutrequiring custom code to do it. The user’s data types only need to beserializable.

6.2 Microbatch Execution Mode

Structured Streaming jobs can execute in twomodes: microbatching or continuous operators. The microbatch mode uses the discretized streams execution model fromSpark Streaming [37], and inherits its benefits, such as dynamic loadbalancing, rescaling, straggler mitigation and fault recovery withoutwhole-system rollback.

In this mode, epochs aretypically set to be a few hundred milliseconds to a few seconds, and each epochexecutes as a traditional Spark job composed of a DAG of independent tasks[36]. For example, a query doing selection followed by stateful aggregationmight execute as a set of “map" tasks for the selection and “reduce"tasks for the aggregation, where the reduce tasks track state in memory onworker nodes and periodically checkpoint it to the state store. As in SparkStreaming, this mode provides the following benefits:

• Dynamic load balancing: Eachoperator’s work is divided into small, independent tasks that can be scheduledon any node, so the system can automatically balance these across nodes if someare executing slower than others.

• Fine-grained fault recovery: Ifa node fails, only its tasks need to be rerun, instead of having to roll backthe whole cluster to a checkpoint as in most systems based on topologies oflong-lived operators. Moreover, the lost tasks can be rerun in parallel, further reducing recoverytime [37].

• Straggler mitigation: Sparkwill launch backup copies of slow tasks as it does in batch jobs, anddownstream tasks will simply use the output from whichever copy finishes first.

• Rescaling: Adding or removing anode is simple as tasks will automatically be scheduled on all the availablenodes.

• Scale and throughput: Becausethis mode reuses Spark’s batch execution engine, it inherits all theoptimizations in this engine, such as a high-performance shuffle implementation[34] and the ability to run on thousands of nodes.

The main disadvantage ofthis mode is a higher minimum latency, as there is overhead to launching a DAGof tasks in Spark. In practice, however, latencies of a few seconds areachievable even on large clusters running multi-step computations. Depending onthe application, these are on a similar time scale to data collection andalerting systems.

6.3 Continuous Processing Mode

A new continuous processing added in ApacheSpark 2.3 [6] executes Structured Streaming jobs using long-lived operators asin traditional streaming systems such as Telegraph and Borealis [1, 13]. Thismode enables lower latency at a cost of less operational flexibility (e.g.,limited support for rescaling the job at runtime).

The key enabler for thisexecution mode was choosing a declarative API for Structured Streaming that isnot tied to the execution strategy. For example, the original Spark StreamingAPI had some operators based on processing time that leaked the concept ofmicrobatches into the programming model, making it hard to move programs toanother type of engine. In contrast, Structured Streaming’s API and semanticsare independent of the execution engine: continuous execution is similar tohaving a much larger number of triggers. Note that unlike systems based purelyon unsynchronized message passing, such as Storm [32], we do retain the conceptof triggers and epochs in this mode so the output from multiple nodes can becoordinated and committed together to the sink.

Because the APIsupports fine-grained execution, Structured Streaming jobs could theoreticallyrun on any existing distributed streaming engine design [1, 13, 17]. Incontinuous processing, we built a simple continuous operator engine that livesinside Spark and can reuse Spark’s scheduling infrastructure and per-nodeoperators

(e.g., code-generated operators). The firstversion released in Spark 2.3.0 only supports “map-like” jobs (i.e., no shuffleoperations), which were one of the most common scenarios where users wantedlower latency, but the design can be extended to support shuffles.

Compared to microbatchexecution, there are two differences when using continuous processing:

(1) The master launches long-running tasks on each partition usingSpark’s scheduler that each read one partition of the input source (e.g.,Kinesis stream) but execute multiple epochs.If one of these tasks fails, Spark will simply relaunch it.

(2) Epochs are coordinated differently. The master periodically tellsnodes to start a new epoch, and receives a start offset for the epoch on eachinput partition, which it inserts into the writeahead log. When it asks them tostart the next epoch, it also receives end offsets for the previous one, writesthese to the log, and tells nodes to commit the epoch when it has written allthe end offsets. Thus, the master is not on the critical path for inspectingall the input sources and defining start/end offsets.

We found that the mostcommon use case where organizations wanted low latency and the scale of a distributed processing engine was “stream tostream” map operations to transform data before it is used in other streamingapplications. For example, an organization might upload events to Kafka, runsome simple ETL transformations as a streaming job, and write the transformeddata to Kafka again for consumption by other streaming applications. In thistype of design, each transformation job will add latency to all downstreamsteps, so organizations wish to minimize this latency.

7 Operational Features

We used several properties of our executionstrategy and API to design a number of operational features in StructuredStreaming that tackle common problems in deployments. Perhaps most importantlyacross these features, we aimed to make both Structured Streaming’s semanticsand its fault tolerance model easy to understand. With a simple design,operators can form an accurate model of how a system runs and what variousactions will do to it.

7.1 Code Updates

Developers can update User-Defined Functions(UDFs) in their program and simply restart the application to use the newversion of the code. For example, if a UDF is crashing on a particular inputrecord, that epoch of processing will fail, so the developer can update thecode and restart the application again to continue processing. This alsoapplies to stateful operator UDFs, which can be updated as long as they retainthe same schema for their state objects. We also designed Spark’s log and statestore formats to be binary compatible across Spark framework updates.

7.2 Manual Rollback

Sometimes, an application outputs wrong results for some time before auser notices: for example, a field that fails to parse might simply be reportedas NULL. Therefore, rollbacks are a factof life for many operators. In Structured Streaming, it is easy to determinewhich records went into each epoch from the write-ahead log and roll back theapplication to the epoch where a problem started occurring. We chose to storethe write-ahead log as JSON to let administrators perform these operationsmanually.^{^[3]}As long as the input sources and state store still have data fromthe failed epoch, the job can start again from a previous point. Message buseslike Kafka are typically configured for several weeks of retention so rollbacksare often possible.

Manual rollbacksinteract well with Structured Streaming’s prefix consistency guarantee forexecution semantics 4.2. Specifically, when an administrator rolls back the jobto a point in the writeahead log, she knows which prefix of the input streamsthis point corresponds to, and the job can recompute output from that point onwhile retaining consistency within the new output. Beyond this guarantee,Structured Streaming’s support for running the same code as a batch job and forrescaling means that administrators can run the recovery on a temporarilylarger cluster to catch up quickly, further reducing the operational complexityof manual rollbacks.

7.3 Hybrid Batch and Streaming Execution

The most obvious benefit of StructuredStreaming’s unified API is that users can share code between batch andstreaming jobs, or run the same program as a batch job for testing. However, wehave also found this useful for purely streaming scenarios in two ways:

• “Run-once" triggers for cost savings: Many Databricks customers wanted the transactionality and statemanagement properties of a streaming engine withoutrunning servers 24/7. Virtually all ETL workloads require tracking how farin the input one has gotten and which results have been saved reliably, whichcan be difficult to implement by hand. These functions are exactly whatStructured Streaming’s state management provides. Thus, several customersimplemented ETL jobs by running a single epochof a Structured Streaming job every few hours as a batch computation, using theprovided “run once" trigger that was originally designed for testing. Thisleads to significant cost savings (in one case, up to 10× [35]) forlower-volume applications. With all the major cloud providers now supportingper-second or per-minute billing [9], we believe this type of “discontinuousprocessing" will become more common.

• Adaptive batching: Even streamingapplications occasionally experience large backlogs. For example, a linkbetween two datacenters might go down, temporarily delaying data transfer, orthere might simply be a spike in user activity. In these cases, StructuredStreaming will automatically execute longer epochs in order to catch up withthe input streams, often achieving similar throughput to Spark’s batch jobs.This will not greatly increase latency, given that data is already backlogged,but will let the system catch up faster. In cloud environments, operators canalso add extra nodes to the cluster temporarily.

7.4 Monitoring

Structured Streaming uses Spark’s existingmetrics API and structured event log to report information such as number ofrecords processed, bytes shuffled across the network, etc. These interfaces arefamiliar to operators and easy to connect to a variety of UI tools.

7.5 Fault and Straggler Recovery

As discussed in §6.2, Structured Streaming’smicrobatch mode can recover from node failures, stragglers and load imbalancesusing Spark’s fine-grained task execution model. The continuous processing moderecovers from node failures, but does not yet protect against stragglers orload imbalance.

Figure5: Information security platform use case. Using Structured Streaming and SparkSQL, a team of analysts can query both streaming and historical data and easilyinstall queries for new attack patterns as streaming alerts.

8 Production Use Cases

We have supported Structured Streaming onDatabricks’ managed cloud service [16] since 2016, and today, our cloud isrunning hundreds of production streaming applications at a given time (i.e.,applications running 24/7). The largest of these applications ingest over 1 PBof data per month and run on hundreds of servers. We also use StructuredStreaming internally to monitor our services, including the execution ofStructured Streaming itself. In this section, we describe three customerworkloads that leverage various aspects of Structured Streaming, as well as ourinternal use case.

8.1 Information Security Platform

A large customer has used StructuredStreaming to develop a largescale security platform to enable over 100 analyststo scour through network traffic logs to quickly identify and respond tosecurity incidents, as well as to generate automated alerts. This platformcombines streaming with batch and interactive queries and is thus a greatexample of the system’s support for end-to-endapplications.

Figure 5 shows thearchitecture of the platform. Intrusion Detection Systems (IDSes) monitor allthe network traffic in the organization, and output logs to S3. From here, aStructured Streaming jobs ETLs these logs into a compact Apache Parquet basedtable stored on Databricks Delta [7] to enable fast and concurrent access frommultiple downstream applications. Other Structured Streaming jobs then processthese logs to produce additional tables (e.g., by joining them with otherdata). Analysts query these tables interactively, using SQL or Dataframes, todetect and diagnose new attack patterns. If they identify a compromise, theyalso look back through historical data to trace previous actions from thatattacker. Finally, in parallel, the Parquet logs are processed by anotherStructured Streaming cluster that generates real-time alerts based onpre-written rules.

The key challenges inrealizing this platform are (1) building a robust and scalable streamingpipeline, while (2) providing the analysts with an effective environment toquery both fresh and historical data. Using standard tools and servicesavailable on AWS, a team of 20 people took over six months to build and deploya previous version of this platform in production. This previous version hadseveral limitations, including only being able to store a small amount of datafor historical queries due to using a traditional data warehouse for theinteractive queries. In contrast, a team of five engineers was able toreimplement the platform using Structured Streaming in two weeks. The newplatform was simultaneously more scalable and able to support more complexanalysis using Spark’s ML APIs. Next, we provide a few examples to illustratethe advantages of Structured Streaming that made this possible.

First, StructuredStreaming’s ability to adaptively vary the batch size enabled the developers tobuild a streaming pipeline that deals not only with spikes in the workload, butalso with failures and code upgrades. Consider a streaming job that goesoffline either due to failure or upgrades. When the cluster is brought backonline, it will start automatically to process the data all the way back fromthe moment it went offline. Initially, the cluster will use large batches tomaximize the throughput. Once it catches up, the cluster switches to smallbatches for low latency. This allows administrators to regularly upgradeclusters without the fear of excessive downtime.

Second, the ability tojoin a stream with other streams, as well as with historical tables, has considerablysimplified the analysis. Consider the simple task of figuring out which devicea TCP connection originates at. It turns out that this task is challenging inthe presence of mobile devices, as these devices are given dynamic IP addresses every time they join the network. Hence, fromTCP logs alone, is not possible to track down the end-points of a connection.With Structured Streaming, an analyst can easily solve this problem. She cansimply join the TCP logs with DHCP logs to map the IP address to the MACaddress, and then use the organization’s internal database of network devicesto map the MAC address to a particular machine and user. In addition, userswere able to do this join in real time using stateful operators as both the TCPand DHCP logs were being streamed in.

Finally, using the samesystem for streaming, interactive queries and ETL has provided developers withthe ability to quickly iterate and deploy new alerts. In particular, it enablesanalysts to build and test queries for detecting new attacks on offline data,and then deploy these queries directly on the alerting cluster. In one example,an analyst developed a query to identify exfiltration attacks via DNS. In thisattack, malware leaks confidential information from the compromised host bypiggybacking this information into DNS requests sent to an external DNS serverowned by the attacker. One simplified query to detect such an attackessentially computes the aggregate size of the DNS requests sent by every hostover a time interval. If the aggregate is greater than a given threshold, thequery flags the corresponding host as potentially being compromised. Theanalyst used historical data to set this threshold, so as to achieve thedesired balance between false positive and false negative rates. Once satisfiedwith the result, the analyst simply pushed the query to the alerting cluster.The ability to use the same system and the same API for data analysis and forimplementing the alerts led not only to significant engineering cost savings,but also to better security, as it is significantly easier to deploy new rules.

8.2 Monitoring Live Video Delivery

A large media company is usingStructured Streaming to compute quality metrics for their live video trafficand interactively identify delivery problems. Live video delivery is especiallychallenging because network problems can severely disrupt utility. Forprerecorded video, clients can use large buffers to mask issues, and adegradation at most results in extra buffering time; but for live video, aproblem may mean missing a critical moment in a sports match or similar event.This organization collects video quality metrics from clients in real time,performs ETL operations and aggregation using Structured Streaming, then storesthe results in a data warehouse. This allows operations engineers tointeractively query fresh data to detect and diagnose quality issues (e.g.,determine whether an issue is tied to a specific ISP, video server or othercause).

8.3 Analyzing Game Performance

A large gaming company uses StructuredStreaming to monitor the latency experienced by players in a popular onlinegame with tens of millions of monthly active users. As in the video use case,high network performance is essential for the user experience when gaming, andrepeated problems can quickly lead to player churn. This organization collectslatency logs from its game clients to cloud storage and then performs a varietyof streaming analyses.

For example, one job joins the measurementswith a table of Internet

Autonomous Systems (ASes) and thenaggregates the performance by AS over time to identify poorly performing ASes.When such an AS is identified, the streaming job triggers an alert, and ITstaff can contact the AS in question to remediate the issue.

8.4 Cloud Monitoring at Databricks

At Databricks, we have been using ApacheSpark since the start of the company to monitor our own cloud service,understand workload statistics, trigger alerts, and let our engineersinteractively debug issues. The monitoring pipeline produces dozens ofinteractive dashboards as well as structured Parquet tables for ad-hoc SQLqueries. These dashboards also play a key role for business users to understandwhich customers have increasing or decreasing usage, prioritize featuredevelopment, and proactively identify customers that are experiencing problems.

We built at least threeversions of a monitoring pipeline using a combination of batch and streamingAPIs starting four years ago, and in all the cases, we found that the majorchallenges were operational. Despite our best efforts, pipelines could be brittle,experiencing frequent failures when aspects of our input data changed (e.g.,new schemas or reading from more locations than before), and upgrading them wasa daunting exercise. Worse yet, failures and upgrades often resulted in missingdata, so we had to manually go back and re-run jobs to reconstruct the missingdata. Testing pipelines was also challenging due to their reliance on multipledistinct Spark jobs and storage systems. Our experience with StructuredStreaming shows that it successfully addresses many of these challenges. Notonly we were able to reimplement our pipelines in weeks, but the managementoverhead decreased drastically. Restartability coupled with adaptive batching,transactional sources/sinks and well-defined consistency semantics have enabledsimpler fault recovery, upgrades, and rollbacks to repair old results.Moreover, we can test the same code in batch mode on data samples or use manyof the same functions in interactive queries.

Our pipelines withStructured Streaming also regularly combine its batch and streamingcapabilities. For example, the pipeline to monitor streaming jobs starts withan ETL job that reads JSON

(a)vs. Other Systems (b)System Scaling

Figure6: Throughput results on the Yahoo! benchmark.

events from Kafka and writes them to acolumnar Parquet table in S3. Dozens of other batch and streaming jobs thenquery this table to produce dashboards and other reports. Because Parquet is acompact and column-oriented format, this architecture consumes drastically fewer resources than having every jobread directly from Kafka, and simultaneously places less load on the Kafkabrokers. Overall, streaming jobs’ latencies range from seconds to minutes, andusers can also query the Parquet table interactively in seconds.

9 Performance Evaluation

In this section, we measure the performanceof Structured Streaming using controlled benchmarks. We study performance vs.other systems on the Yahoo! Streaming Benchmark [14], scalability, and thethroughput-latency tradeoff with continuous processing.

9.1 Performance vs. Other Streaming Systems

To evaluate performance compared toother streaming engines, we used the Yahoo! Streaming Benchmark [14], a widelyused workload that has also been evaluated in other open source systems. Thisbenchmark requires systems to read ad click events, join them against a statictable of ad campaigns by campaign ID, and output counts by campaign on10-second event-time windows.

We compared Kafka Streams 0.10.2, ApacheFlink 1.2.1 and Spark

2.3.0 on a cluster with five c3.2xlargeAmazon EC2 workers (each with 8 virtual cores and 15 GBRAM) and one master. For Flink, we used the optimized version of the benchmarkpublished by dataArtisans for a similar cluster [22]. Like in that benchmark, thesystems read data from a Kafka cluster running on the workers with 40partitions (one per core), and write results to Kafka. The original Yahoo!benchmark used Redis to hold the static table for joining ad campaigns, but wefound that Redis could be a bottleneck, so we replaced it with a table in eachsystem (a KTable in Kafka, a DataFramein Spark, and an in-memory hash map in Flink).

Figure 6a shows eachsystem’s maximum stable throughput, i.e., the throughput it can process beforea backlog begins to form. We see that streaming system performance can varysignificantly. Kafka Streams implements a simple message-passing model throughthe

Kafka message bus, but only attains 700,000records/second on our

40-corecluster. Apache Flink reaches 33 million records/s. Finally, StructuredStreaming reaches 65 million records/s, nearly 2× the throughput of Flink. Thisparticular Structured Streaming query is

0 200000 400000 600000 800000 1000000

Input Rate (records/s)

Figure7: Latency of continuous processing vs. input rate. Dashed line shows maxthroughput in microbatch mode.

implemented using just DataFrame operationswith no UDF code. The performance thus comes solely from Spark SQL’s built inexecution optimizations, including storing data in a compact binary format andruntime code generation. As pointed out by the authors of Trill [12] andothers, execution optimizations can make a large difference in streamingworkloads, and many systems based on per-record operations do not maximizeperformance.

9.2 Scalability

Figure 6b shows how Structured Streaming’sperformance scales for the Yahoo! benchmark as we vary the size of our cluster.We used 1, 5, 10 and 20 c3.2xlarge AmazonEC2 workers (with 8 virtual cores and 15 GB RAM each) and the same experimentalsetup as in §9.1, including one Kafka partition per core. We see that throughputscales close to linearly, from 11.5 million records/s on 1 node to 225 millionrecords/s on 20 nodes (i.e., 160 cores).

9.3 Continuous Processing

We benchmarked Structurd Streaming’scontinuous processing mode on a 4-core server to show the latency-throughputtradeoffs it can achieve. (Because partitions run independently in this mode,we expect the latency to stay the same as more nodes are added.) Figure 7 showsthe results for a map job reading from Kafka,with the dashed line showing the maximum throughput achievable by microbatchmode. We see that continuous mode is able to achieve much lower latency withouta large drop in throughput (e.g., less than 10 ms latency at half the maximumthroughput of microbatching). Its maximum stable throughput is also slightlyhigher because microbatch mode incurs latency due to task scheduling.

10 Related Work

Structured Streaming builds on many existingsystems for stream processing and big data analytics, including Spark SQL’sDataFrame API [8], Spark Streaming [37], Dataflow [2], incremental querysystems [11, 24, 29, 38] and distributed stream processing [21]. At a highlevel, the main contributions of this work are:

• An account of real-world userchallenges with streaming systems, including operational challenges that arenot always discussed in the research literature (§2).

• A simple, declarativeprogramming model that incrementalizes awidely used batch API (Spark DataFrames/SQL) to provide similar capabilities toDataflow [2] and other streaming systems.

• An execution engine providinghigh throughput, fault tolerance, and rich operational features that combineswith the rest of Apache Spark to let users easily build end-to-end applications.

From an API standpoint,the closest work is incremental query systems [11, 24, 29, 38], includingrecent distributed systems such as

Stateful Bulk Processing [25] and Naiad[26]. Structured Streaming’s

API is an extension of Spark SQL [8],including its declarative DataFrame interface for programmatic construction ofrelational queries. Apache Flink also recently added a table API (currently inbeta) for defining relational queries that can map to either streaming or batchexecution [19], but this API lacks some of the features of StructuredStreaming, such as custom stateful operators (§4.3.2).

Other recent streamingsystems have language-integrated APIs that operate at a lower, more“imperative" level. In particular, Spark Streaming [37], Google Dataflow[2] and Flink’s DataStream API [18] provide various functional operators butrequire users to choose the right DAG of operators to implement a particularincrementalization strategy (e.g., when to pass on deltas versus completeresults); essentially, these are equivalent to writing a physical executionplan. Structured Streaming’s API is simpler for users who are not experts onincrementalization. Structured Streaming adopts the definitions of event time,processing time, watermarks and triggers from Dataflow but incorporates them inan incremental model.

For execution,Structured Streaming uses concepts similar to discretized streams for microbatchmode [37] and traditional streaming engines for continuous processing mode [1,13, 21]. It also builds on an analytical engine for performance like Trill[12]. The most unique contribution here is the integration of batch andstreaming queries to enable sophisticated end-to-end applications. As describedin §8, Structured Streaming users can easily write applications that combinebatch, interactive and stream processing using the same code (e.g., securitylog analysis). In addition, they leverage powerful operational features such asrun-once triggers (running a streaming application “discontinuously" asbatch jobs to retain its transactional features but lower costs), code updates,and batch processing to handle backlogs or code rollbacks (§7).

11 Conclusion

Stream processing is a powerful tool, butstreaming systems are still difficult to use, operate and integrate into largerapplications. We designed Structured Streaming to simplify all three of thesetasks while integrating with the rest of Apache Spark. Unlike many other opensource streaming engines, Structured Streaming purposefully adopts a veryhigh-level API: incrementalizing an existing Spark SQL or DataFrame query. Thismakes it accessible to a wide range of users. Although Structured Streaming’sAPI is more declarative and constrained, we found that works well for a diverserange of applications, including those that require custom logic for statefulprocessing. Beyond this focus on a high-level API, Structured Streaming alsoincludes several powerful operational features and achieves high performanceusing the Spark SQL engine. Experience across hundreds of customer use casesshows that users can leverage the system to build sophisticated businessapplications.

12 Acknowledgements

We would like to thank the diverse ApacheSpark developer community that has contributed to Structured Streaming, SparkStreaming and Spark SQL over the years. We also thank the SIGMOD reviewers fortheir detailed feedback on the paper.

References

[1] DanielJ. Abadi, Yanif Ahmad, Magdalena Balazinska, Mitch Cherniack, Jeong hyon Hwang,Wolfgang Lindner, Anurag S. Maskey, Er Rasin, Esther Ryvkina, Nesime Tatbul,Ying Xing, and Stan Zdonik. 2005. The design of the borealis stream processingengine. In In CIDR. 277–289.

[2] TylerAkidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J.Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, EricSchmidt, and Sam Whittle. 2015. The Dataflow Model: A Practical Approach toBalancing Correctness, Latency, and Cost in Massive-scale, Unbounded,Out-of-order Data Processing. Proc. VLDBEndow. 8, 12 (Aug. 2015), 1792–1803. https://doi.org/10.14778/2824032.2824076

[3] Intel Altera. 2017. Financial/HPC – Financial Offload. https:

//www.altera.com/solutions/industry/computer-and-storage/applications/computer/financial-offload.html.(2017).

[4] PeterAlvaro, Neil Conway, Joseph M. Hellerstein, and William R. Marczak. 2011.Consistency analysis in Bloom: A CALM and collected approach. In In Proceedings 5th Biennial Conference onInnovative Data Systems Research. 249–260.

[5] Amazon.2017. Amazon Kinesis. https://aws.amazon.com/kinesis/.(2017).

[6] MichaelArmbrust. 2017. SPARK-20928: Continuous Processing Mode for StructuredStreaming. https://issues.apache.org/jira/browse/SPARK-20928. (2017).

[7] Michael Armbrust, Bill Chambers, and Matei Zaharia. 2017.

DatabricksDelta: A Unified Data Management System for Real-time Big Data. https://databricks.com/blog/2017/10/25/databricks-delta-a-unified-management-system-for-real-time-big-data.html.(2017).

[8] MichaelArmbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K.

Bradley,Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and MateiZaharia. 2015. Spark SQL: Relational Data Processing in Spark. 1383–1394. https://doi.org/10.1145/2723372.2742797

[9] JeffBarr. 2017. New – Per-Second Billing for EC2 Instances and EBS Volumes. https://aws.amazon.com/blogs/aws/ new-per-second-billing-for-ec2-instances-and-ebs-volumes/.(2017).

[10] ApacheBeam. 2017. Apache Beam programming guide. https://beam.apache. org/documentation/programming-guide/.(2017).

[11] JoseA. Blakeley, Per-Ake Larson, and Frank Wm Tompa. 1986. Efficiently UpdatingMaterialized Views. SIGMOD Rec. 15, 2(June 1986), 61–71. https:

//doi.org/10.1145/16856.16861

[12] BadrishChandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel Fisher,John C. Platt, James F. Terwilliger, and John Wernsing. 2014. Trill: AHigh-performance Incremental Query Processor for Diverse Analytics. Proc.

VLDB Endow. 8, 4 (Dec. 2014), 401–412. https://doi.org/10.14778/2735496.2735503

[13] SirishChandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin,

Joseph M.Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel R. Madden, Fred Reiss, andMehul A. Shah. 2003. TelegraphCQ: Continuous Dataflow Processing. In Proceedings of the 2003 ACM SIGMODInternational Conference on Management of Data (SIGMOD ’03). ACM, New York,NY, USA, 668–668. https://doi.org/10. 1145/872757.872857

[14] SanketChintapalli, Derek Dagit, Bobby Evans, Reza Farivar, Tom Graves,

MarkHolderbaugh, Zhuo Liu, Kyle Nusbaum, Kishorkumar Patil, Boyang Peng, and PaulPoulosky. 2015. Benchmarking Streaming Computation Engines at Yahoo! https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at.(2015).

[15] Confluent.2017. KSQL: Streaming SQL for Kafka. https://www.confluent.io/ product/ksql/. (2017).

[16] Databricks.2017. Databricks unified analytics platform. https://databricks.com/ product/unified-analytics-platform.(2017).

[17] ApacheFlink. 2017. Apache Flink. http://flink.apache.org.(2017).

[18] ApacheFlink. 2017. Flink DataStream API Programming Guide. https://ci.apache.

org/projects/flink/flink-docs-release-1.3/dev/datastream_api.html.(2017).

[19] ApacheFlink. 2017. Flink Table & SQL API Beta. https://ci.apache.org/projects/ flink/flink-docs-release-1.3/dev/table/index.html.(2017).

[20] ApacheFlink. 2017. Working with State. https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/stream/state.html.(2017).

[21] LukaszGolab and M. Tamer Özsu. 2003. Issues in Data Stream Management. SIGMOD Rec. 32, 2 (June 2003), 5–14. https://doi.org/10.1145/776985.776986

[22] JamieGrier. 2016. Extending the Yahoo! Streaming Benchmark. https:// data-artisans.com/blog/extending-the-yahoo-streaming-benchmark.(2016).

[23] ApacheKafka. 2017. Kafka. http://kafka.apache.org.(2017).

[24] SaileshKrishnamurthy, Michael J. Franklin, Jeffrey Davis, Daniel Farina, PashaGolovko, Alan Li, and Neil Thombre. 2010. Continuous Analytics overDiscontinuous Streams. In Proceedings ofthe 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD ’10).ACM, New York, NY, USA, 1081–1092.

https://doi.org/10.1145/1807167.1807290

[25] DionysiosLogothetis, Christopher Olston, Benjamin Reed, Kevin C. Webb, and Ken Yocum.2010. Stateful Bulk Processing for Incremental Analytics. In Proceedings of the 1st ACM Symposium onCloud Computing (SoCC ’10). ACM, New York, NY, USA, 51–62. https://doi.org/10.1145/1807128.1807138

[26] FrankMcSherry, Derek Murray, Rebecca Isaacs, and Michael Isard. 2013. Differentialdataflow, In Proceedings of CIDR 2013. https://www.microsoft.com/en-us/ research/publication/differential-dataflow/

[27] DerekG. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, andMartín Abadi. 2013. Naiad: A Timely Dataflow System. 439–455. https: //doi.org/10.1145/2517349.2522738

[28] Pandas.2017. pandas Python data analysis library. http://pandas.pydata.org. (2017).

[29] X.Qian and Gio Wiederhold. 1991. Incremental Recomputation of Active RelationalExpressions. IEEE Trans. on Knowl. andData Eng. 3, 3 (Sept. 1991), 337–341. https://doi.org/10.1109/69.91063

[30] R[n. d.]. R project for statistical computing. http://www.r-project.org. ([n.d.]).

[31] ApacheSpark. 2017. Spark Documentation. http://spark.apache.org/docs/latest. (2017).

[32] ApacheStorm. 2017. Apache Storm. http://storm.apache.org.(2017).

[33] AshishThusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang,Suresh Anthony, Hao Liu, and Raghotham Murthy. 2010. Hive - a petabyte scaledata warehouse using Hadoop. In ICDE,Feifei Li, Mirella M. Moro, Shahram Ghandeharizadeh, Jayant R. Haritsa, GerhardWeikum, Michael J. Carey,

Fabio Casati,Edward Y. Chang, Ioana Manolescu, Sharad Mehrotra, Umeshwar

Dayal, andVassilis J. Tsotras (Eds.). IEEE, 996–1005. http://infolab.stanford.edu/ ~ragho/hive-icde2010.pdf

[34] ReynoldXin et al. [n. d.]. GraySort on Apache Spark by Databricks. http: //sortbenchmark.org/ApacheSpark2014.pdf.([n. d.]).

[35] BurakYavuz and Tyson Condie. 2017. Running Streaming Jobs Once a Day For 10x CostSavings. https://databricks.com/blog/2017/05/22/running-streaming-jobs-day-10x-cost-savings.html.(2017).

[36] MateiZaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, MurphyMcCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. ResilientDistributed Datasets: A Fault-tolerant Abstraction for In-memory ClusterComputing. 15–28.

[37] MateiZaharia, Tathagata Das, Haoyuan Li, Tim Hunter, Scott Shenker, and Ion Stoica.2013. Discretized Streams: Fault-Tolerant Streaming Computation at Scale. In SOSP.

[38] YueZhuge, Héctor García-Molina, Joachim Hammer, and Jennifer Widom. 1995.

View Maintenance in a Warehousing Environment. In Proceedings of the 1995 ACM SIGMODInternational Conference on Management of Data (SIGMOD ’95).

ACM, NewYork, NY, USA, 316–327. https://doi.org/10.1145/223784.223848

[1] Spark SQL offers several slightly different APIs that map to thesame query engine. The DataFrame API, modeled after data frames in R and Pandas[28, 30], offers a simple interface to build relational queriesprogrammatically that is familiar to many users. The Dataset API adds statictyping over DataFrames, similar to RDDs [36].

Alternatively,users can write SQL directly. All APIs produce a relational query plan.

[2] In Spark 2.3.0, we actually make one checkpoint per epoch, but weplan to make them less frequent in a future release, as is already done inSpark Streaming. ³Some sinks, such as Amazon S3, provide no way toatomically commit multiple writes from different writer nodes. In such cases,we have also created Spark data sources that add transactions over theunderlying storage system. For example, Databricks Delta [7] offers aconsistent view of S3 data for both streaming and batch queries, along withadditional features such as index maintenance.

[3] One additional step they may have to do is remove faulty data fromthe output sink, depending on the sink chosen. For the file sink, for example,it’s straightforward to find which files were written in a particular epoch andremove those.

你可能感兴趣的:(StructuredStreaming: A Declarative API for Real-Time)

精通Canvas：15款时钟特效代码实现指南烟幕缭绕
本文还有配套的精品资源，点击获取简介：HTML5的Canvas是一个用于绘制矢量图形的API，通过JavaScript实现动态效果。本项目集合了15种不同的时钟特效代码，帮助开发者通过学习绘制圆形、线条、时间更新、旋转、颜色样式设置及动画效果等概念，深化对Canvas的理解和应用。项目中的CSS文件负责时钟的样式设定，而JS文件则包含实现各种特效的逻辑，通过不同的函数或类处理时间更新和动画绘制，提
Cesium加载各类数据总结 zhu_zhu_xia cesium JavaScript javascript
接触到的加载数据类型：源地图、shp、Geojson、png、wms、地形底图一.Cesium加载各类底图#此类加载的本质在于newCesium.ImageryProvider()Apidefination：“Providesimagerytobedisplayedonthesurfaceofanellipsoid.Thistypedescribesaninterfaceandisnotinten
OpenWebUI(12)源码学习-后端constants.py常量定义文件青苔猿猿 AI大模型 openwebui constants常量定义
目录文件名：`constants.py`功能概述：主要功能点详解1.**MESSAGES枚举类**2.**WEBHOOK_MESSAGES枚举类**3.**ERROR_MESSAGES枚举类**✅默认错误模板✅认证与用户相关错误✅资源冲突与重复错误✅验证失败类错误✅权限限制类错误✅文件上传与格式错误✅模型与API错误✅请求频率与安全限制✅数据库与配置错误4.**TASKS枚举类**✅总结实际应用场
JavaScript 基础09：Web APIs——日期对象、DOM节点梦想当全栈 JavaScript javascript 前端开发语言
JavaScript基础09：WebAPIs——日期对象、DOM节点进一步学习DOM相关知识，实现可交互的网页特效能够插入、删除和替换元素节点。能够依据元素节点关系查找节点。一、日期对象掌握Date日期对象的使用，动态获取当前计算机的时间。ECMAScript中内置了获取系统时间的对象Date，使用Date时与之前学习的内置对象console和Math不同，它需要借助new关键字才能使用。1.实例
[Vue warn]: onUnmounted is called when there is no active component instance to be associated with 扬帆起航&d vue.js javascript 前端 ecmascript 前端框架
[Vuewarn]:onUnmountediscalledwhenthereisnoactivecomponentinstancetobeassociatedwith.LifecycleinjectionAPIscanonlybeusedduringexecutionofsetup().Ifyouareusingasyncsetup(),makesuretoregisterlifecyclehoo
Flutter——数据库Drift开发详细教程(七) 怀君 flutter flutter 数据库
目录入门设置漂移文件入门变量数组定义表支持的列类型漂移特有的功能导入嵌套结果LIST子查询Dart互操作SQL中的Dart组件类型转换器现有的行类Dart文档注释结果类名称支持的语句自定义SQL类型定义类型使用自定义类型在Dart中在SQL中方言意识支持的SQLite扩展json1fts5地缘垄断自定义查询带有生成的api的语句自定义选择语句自定义更新语句入门Drift提供了一个dart_api来
Flutter——数据库Drift开发详细教程之迁移(九) 怀君 flutter flutter 数据库
迁移入门引导式迁移配置用法例子切换到make-migrations开发过程中手动迁移迁移后回调导出模式导出架构下一步是什么？调试导出架构的问题修复这个问题架构迁移助手自定义分步迁移转向逐步迁移手动生成测试迁移编写测试验证数据完整性在运行时验证数据库模式迁移器API一般提示迁移视图、触发器和索引复杂的迁移更改列的类型更改列约束删除列重命名列合并列添加新列入门Drift通过严格的架构确保查询类型安全。
vue3面试题(个人笔记) 武昌库里写JAVA 面试题汇总与解析课程设计 spring boot vue.js java 学习
vue3比vue2有什么优势？性能更好，打包体积更小，更好的ts支持，更好的代码组织，更好的逻辑抽离，更多的新功能。描述Vue3生命周期CompositionAPI的生命周期：onMounted()onUpdated()onUnmounted()onBeforeMount()onBeforeUpdate()onBeforeUnmount()onErrorCaptured()onRenderTrac
【运维实战】解决 K8s 节点无法拉取 pause:3.6 镜像导致 API Server 启动失败的问题 gs80140 各种问题运维 kubernetes 容器
目录【运维实战】解决K8s节点无法拉取pause:3.6镜像导致APIServer启动失败的问题问题分析✅解决方案：替代拉取方式导入pause镜像Step1.从私有仓库拉取pause镜像Step2.重新打tag为Kubernetes默认命名Step3.导出镜像为tar包Step4.拷贝镜像到目标节点Step5.在目标节点导入镜像到containerd的k8s.io命名空间Step6.验证镜像是否导
swagger【个人笔记】撰卢笔记 java
文章目录swagger导入mave坐标在配置类(WebMvcConfiguration)中加入knife4j相关配置设置静态资源映射，主要是让拦截器放行swagger常用注解@Api(tags="\[描述这个类的作用]")@ApiModel(description="\[描述这个类的作用]")@ApiModelProPerty("描述这个类的作用")@ApiOperation("\[描述方法的作用
Django REST framework 与 django-import-export 扩展结合 Venre django python
DjangoRESTframework与django-import-export扩展结合DjangoRESTframework与django-import-export简单介绍DjangoRESTframework和django-import-export是两个非常强大的工具，分别用于构建RESTfulWebAPI和处理数据的导入导出。虽然它们在功能上有所不同，但可以结合使用以实现更复杂的数据管理
vllm本地部署bge-reranker-v2-m3模型API服务实战教程雷电法王大模型部署 linux python vscode language model
文章目录一、说明二、配置环境2.1安装虚拟环境2.2安装vllm2.3对应版本的pytorch安装2.4安装flash_attn2.5下载模型三、运行代码3.1启动服务3.2调用代码验证一、说明本文主要介绍vllm本地部署BAAI/bge-reranker-v2-m3模型API服务实战教程本文是在Ubuntu24.04+CUDA12.8+Python3.12环境下复现成功的二、配置环境2.1安装虚
Qualcomm Hexagon DSP 与 AI Engine 架构深度分析：从微架构原理到 Android 部署实战观熵国产 NPU ×Android 推理优化人工智能架构 android
QualcommHexagonDSP与AIEngine架构深度分析：从微架构原理到Android部署实战关键词QualcommHexagon、AIEngine、HTA、HVX、HMX、Snapdragon、DSP推理加速、AIC、QNNSDK、Tensor编排、AndroidNNAPI、异构调度摘要HexagonDSP架构是QualcommSnapdragonSoC平台中长期演进的异构计算核心之一
rtos内存管理林内克思 java linux 算法
FreeRTOS将内存分配API保留在其可移植层，提供了五种内存管理算法：heap_1：最简单，不允许释放内存。heap_2：允许释放内存，但不会合并相邻的空闲块。heap_3：简单包装了标准malloc()和free()，以保证线程安全。heap_4：合并相邻的空闲块以避免碎片化。包含绝对地址放置选项。heap_5：如同heap_4，能够跨越多个不相邻内存区域的堆。特点缺点heap_1简单、不支
2.4 基于dpdk的用户态协议栈的实现百亿苍狗高性能网络设计专栏开发语言网络
操作系统PosixAPI所提供的网络接口，数据收发是基于用户态与内核态的频繁切换实现。而dpdk实现了绕过内核监管，直接在用户态访问网络硬件，避免频繁状态切换。DPDK安装与配置虚拟机环境配置检查是否支持多队列网卡cat/proc/interrupts|grepens33(获取整个机器的终端)，结果19:4202120IO-APIC19-fasteoiens33，不支持多队列网卡。虚拟机关机，修改
OpenWebUI系列之如何通过docker自动将其更新到OpenWebUI最新版本知识大胖 NVIDIA GPU和大语言模型开发教程 docker llm openwebui
实战需求OpenWebUI是一个可扩展、功能丰富且用户友好的自托管WebUI，旨在完全离线运行。它支持各种LLM运行器，包括Ollama和OpenAI兼容API。如何通过docker自动将其更新到OpenWebUI最新版本？系列文章《OpenWebUI系列之如何通过docker更新到OpenWebUI的最新版本》权重0，本地类、opewebui类《OpenWebUI系列之如何通过docker自动将
在拉卡拉分账功能中实现实时更新，需结合异步回调通知和数据库事务来确保数据一致性。以下是具体实现方案肥仔全栈开发拉卡拉支付 php 拉卡拉支付三方支付
一、实时更新的核心逻辑依赖拉卡拉分账回调拉卡拉分账完成后会主动推送回调通知（类似支付回调），需监听该回调并更新订单分账状态。数据库事务保障分账金额更新、状态变更等操作需放在事务中，避免部分失败导致数据不一致。二、代码实现1.分账回调处理接口（监听拉卡拉分账结果推送，实时更新数据库）//文件：application/api/controller/Notify.phppublicfunctionlak
Likeshop单商户高级版对接拉卡拉支付收银台接入全流程详解肥仔全栈开发拉卡拉支付拉卡拉支付小程序
一、前期准备（1-3个工作日）商户认证在拉卡拉官网注册企业商户账号，提交营业执照、法人身份证等材料，完成实名认证并获取商户号（MCHID）和API密钥。在拉卡拉开发者后台下载API文档（含接口参数说明）和SDK工具包（支持Java/PHP等语言）。配置参数在Likeshop后台设置拉卡拉支付参数：商户号、API密钥、异步通知地址（如https://yourdomain.com/notify）。将拉
GPT实操——利用GPT创建一个应用狗木马深度学习 gpt-3 gpt
功能描述信息查询：用户可以询问各种问题，如天气、新闻、股票等，机器人会返回相关信息。任务执行：用户可以要求机器人执行一些简单的任务，如设置提醒、发送邮件等。情感支持：机器人可以与用户进行情感交流，提供安慰和支持。个性化设置：用户可以自定义机器人的回复风格和偏好。技术栈前端：React.js后端：Node.js+Express数据库：MongoDB自然语言处理：OpenAIGPT-3API其他工具：
Flink 2.0 DataStream算子全景 Edingbrugh.南空大数据 flink flink 人工智能
在实时流处理中，ApacheFlink的DataStreamAPI算子是构建流处理pipeline的基础单元。本文基于Flink2.0，聚焦算子的核心概念、分类及高级特性。一、算子核心概念：流处理的"原子操作1.数据流拓扑（StreamTopology）每个Flink应用可抽象为有向无环图（DAG），由源节点（Source）、算子节点（Operator）和汇节点（Sink）构成，算子通过数据流（S
Flink DataStream API详解（一） bxlj_jcj Flink flink 大数据
一、引言Flink的DataStreamAPI，在流处理领域大显身手的核心武器。在很多实时数据处理场景中，如电商平台实时分析用户购物行为以实现精准推荐，金融领域实时监控交易数据以防范风险，DataStreamAPI都发挥着关键作用，能够对源源不断的数据流进行高效处理和分析。接下来，就让我们一起深入探索FlinkDataStreamAPI。二、DataStream编程基础搭建在开始使用FlinkDa
flink自定义函数逆风飞翔的小叔 flink 入门到精通 flink 大数据 big data
前言在很多情况下，尽管flink提供了丰富的转换算子API可供开发者对数据进行各自处理，比如map()，filter()等，但在实际使用的时候仍然不能满足所有的场景，这时候，就需要开发人员基于常用的转换算子的基础上，做一些自定义函数的处理1、来看一个常用的操作原始待读取的文件核心代码importorg.apache.flink.api.common.functions.FilterFunction
Flink DataStream API详解（二）
一、引言咱两书接上回，上一篇文章主要介绍了DataStreamAPI一些基本的使用，主要是针对单数据流的场景下，但是在实际的流处理场景中，常常需要对多个数据流进行合并、拆分等操作，以满足复杂的业务需求。Flink的DataStreamAPI提供了一系列强大的多流转换算子，如union、connect和split等，下面我们来详细了解一下它们的功能和用法。二、多流转换2.1union算子union算
【实战AI】macbook M1 本地ollama运行deepseek 东方鲤鱼 chat AI macos ai llama AIGC chatgpt
由于deepseek官网或者Aapi调用会有网络延迟或不响应的情况，故在本地搭建部署；前提条件1.由于需要拉取开源镜像，受网络限制，部分资源在前提中会下载的更快！请自行；2.设备macbookM132G下载ollamaOllama是一款跨平台推理框架客户端（MacOS、Windows、Linux），专为无缝部署大型语言模型（LLM）（如Llama2、Mistral、Llava等）而设计。通过一键式
【前端】接口日志追踪毕业茄前端
1.问题描述场景：前端提交数据后，接口回调再次添加参数，但页面跳转/刷新导致之前的console.log数据丢失。影响：无法追踪完整的请求流程，调试困难。2.环境信息项目说明浏览器GoogleChrome120+开发者工具ChromeDevTools技术栈前端：Vue/React/其他接口类型RESTfulAPI/GraphQL3.解决方案3.1保留控制台日志（推荐）步骤：打开Chrome开发者工
iOS获取当前所连接的WIFI名称 iOS猿 ios开发 ios wi-fi
由于苹果是闭源的，所以我们不能像安卓那样对一些东西进行操作，比如WIFI，通过使用一些私有的API并在越狱的iPhone上面或许你能够实现那些功能，但是这样做有很大的局限性：1.私有API苹果审核不会让你通过，2.现在很多iPhone用户都不再选择越狱，但是如果我们仅仅想要知道自己现在所连接的WiFi叫啥名我们还是可以通过苹果提供的共有API去实现的。这就要用到SystemConfiguratio
redission 实现滑动窗口（注解）推荐
结构目录相关代码org.redissonredisson-spring-boot-starter3.17.0packageorg.example.redission.config;importorg.redisson.Redisson;importorg.redisson.api.RedissonClient;importorg.redisson.config.Config;importorg.s
FastAPI 实用教程：构建高性能 Python Web API 的终极指南熊猫钓鱼>_> 大数据 hadoop 分布式
本文为原创实战教程，涵盖FastAPI核心特性、路由设计、数据验证、数据库集成、认证授权、测试部署全流程，4000+字助你快速掌握现代PythonWeb开发利器。一、FastAPI为何成为开发者新宠？在PythonWeb框架领域，Flask和Django长期占据主导地位。但FastAPI自2018年发布以来迅速崛起，其魅力在于：极致的性能：基于Starlette（异步Web框架）和Pydantic
java使用poi实现读取复杂Excel文件车车不吃香菇 java基础 java excel poi 大数据 hive
读取的问价格式如下：直接上代码：controller层@ApiOperation(value="全自动导入资源和编目")@PostMapping("/autoExcelToSql")publicResponsereadExcelToList(@RequestPart("file")MultipartFilefile)throwsIOException,BizException{Stringfile
uniapp小程序无感刷新token 一只一只妖 uni-app 小程序前端
request.js//request.jsimport{getApptoken,getStoredApptoken}from'./tokenRequest'//从合并模块导入//全局配置constMAX_RETRIES=1//最大重试次数constbaseURL='https://your-api.com'//请求队列和刷新状态letrequestsQueue=[]letisRefreshing
Nginx负载均衡 510888780 nginx 应用服务器
Nginx负载均衡一些基础知识: nginx 的 upstream目前支持 4 种方式的分配 1)、轮询（默认）每个请求按时间顺序逐一分配到不同的后端服务器，如果后端服务器down掉，能自动剔除。 2)、weight 指定轮询几率，weight和访问比率成正比
RedHat 6.4 安装 rabbitmq bylijinnan erlang rabbitmq redhat
在 linux 下安装软件就是折腾，首先是测试机不能上外网要找运维开通，开通后发现测试机的 yum 不能使用于是又要配置 yum 源，最后安装 rabbitmq 时也尝试了两种方法最后才安装成功机器版本： [root@redhat1 rabbitmq]# lsb_release LSB Version: :base-4.0-amd64:base-4.0-noarch:core
FilenameUtils工具类 eksliang FilenameUtils common-io
转载请出自出处：http://eksliang.iteye.com/blog/2217081 一、概述这是一个Java操作文件的常用库，是Apache对java的IO包的封装，这里面有两个非常核心的类FilenameUtils跟FileUtils，其中FilenameUtils是对文件名操作的封装;FileUtils是文件封装，开发中对文件的操作，几乎都可以在这个框架里面找到。非常的好用。
xml文件解析SAX 不懂事的小屁孩 xml
xml文件解析:xml文件解析有四种方式， 1.DOM生成和解析XML文档(SAX是基于事件流的解析) 2.SAX生成和解析XML文档(基于XML文档树结构的解析) 3.DOM4J生成和解析XML文档 4.JDOM生成和解析XML 本文章用第一种方法进行解析，使用android常用的DefaultHandler import org.xml.sax.Attributes;
通过定时任务执行mysql的定期删除和新建分区，此处是按日分区酷的飞上天空 mysql
使用python脚本作为命令脚本，linux的定时任务来每天定时执行 #!/usr/bin/python # -*- coding: utf8 -*- import pymysql import datetime import calendar #要分区的表 table_name = 'my_table' #连接数据库的信息 host,user,passwd,db =
如何搭建数据湖架构？听听专家的意见蓝儿唯美架构
Edo Interactive在几年前遇到一个大问题：公司使用交易数据来帮助零售商和餐馆进行个性化促销，但其数据仓库没有足够时间去处理所有的信用卡和借记卡交易数据 “我们要花费27小时来处理每日的数据量，”Edo主管基础设施和信息系统的高级副总裁Tim Garnto说道：“所以在2013年，我们放弃了现有的基于PostgreSQL的关系型数据库系统，使用了Hadoop集群作为公司的数
spring学习——控制反转与依赖注入 a-john spring
控制反转（Inversion of Control，英文缩写为IoC）是一个重要的面向对象编程的法则来削减计算机程序的耦合问题，也是轻量级的Spring框架的核心。控制反转一般分为两种类型，依赖注入（Dependency Injection，简称DI）和依赖查找（Dependency Lookup）。依赖注入应用比较广泛。
用spool+unixshell生成文本文件的方法 aijuans xshell
例如我们把scott.dept表生成文本文件的语句写成dept.sql,内容如下: 　　set pages 50000; 　　set lines 200; 　　set trims on; 　　set heading off; 　　spool /oracle_backup/log/test/dept.lst; 　　select deptno||','||dname||','||loc
1、基础--名词解析(OOA/OOD/OOP) asia007 学习基础知识
OOA:Object-Oriented Analysis（面向对象分析方法）是在一个系统的开发过程中进行了系统业务调查以后，按照面向对象的思想来分析问题。OOA与结构化分析有较大的区别。OOA所强调的是在系统调查资料的基础上，针对OO方法所需要的素材进行的归类分析和整理，而不是对管理业务现状和方法的分析。　　OOA（面向对象的分析）模型由5个层次（主题层、对象类层、结构层、属性层和服务层）
浅谈java转成json编码格式技术百合不是茶 json编码 java转成json编码
json编码;是一个轻量级的数据存储和传输的语言在java中需要引入json相关的包,引包方式在工程的lib下就可以了 JSON与JAVA数据的转换（JSON 即 JavaScript Object Natation，它是一种轻量级的数据交换格式，非常适合于服务器与 JavaScript 之间的数据的交
web.xml之Spring配置(基于Spring+Struts+Ibatis) bijian1013 java web.xml SSI spring配置
指定Spring配置文件位置 <context-param> <param-name>contextConfigLocation</param-name> <param-value> /WEB-INF/spring-dao-bean.xml,/WEB-INF/spring-resources.xml, /WEB-INF/
Installing SonarQube（Fail to download libraries from server） sunjing Install Sonar
1. Download and unzip the SonarQube distribution 2. Starting the Web Server The default port is "9000" and the context path is "/". These values can be changed in &l
【MongoDB学习笔记十一】Mongo副本集基本的增删查 bit1129 mongodb
一、创建复本集假设mongod,mongo已经配置在系统路径变量上，启动三个命令行窗口，分别执行如下命令： mongod --port 27017 --dbpath data1 --replSet rs0 mongod --port 27018 --dbpath data2 --replSet rs0 mongod --port 27019 -
Anychart图表系列二之执行Flash和HTML5渲染白糖_ Flash
今天介绍Anychart的Flash和HTML5渲染功能 HTML5 Anychart从6.0第一个版本起，已经逐渐开始支持各种图的HTML5渲染效果了，也就是说即使你没有安装Flash插件，只要浏览器支持HTML5，也能看到Anychart的图形（不过这些是需要做一些配置的）。这里要提醒下大家，Anychart6.0版本对HTML5的支持还不算很成熟，目前还处于
Laravel版本更新异常4.2.8-> 4.2.9 Declaration of ... CompilerEngine ... should be compa bozch laravel
昨天在为了把laravel升级到最新的版本，突然之间就出现了如下错误： ErrorException thrown with message "Declaration of Illuminate\View\Engines\CompilerEngine::handleViewException() should be compatible with Illuminate\View\Eng
编程之美-NIM游戏分析-石头总数为奇数时如何保证先动手者必胜 bylijinnan 编程之美
import java.util.Arrays; import java.util.Random; public class Nim { /**编程之美 NIM游戏分析问题：有N块石头和两个玩家A和B，玩家A先将石头随机分成若干堆，然后按照BABA...的顺序不断轮流取石头，能将剩下的石头一次取光的玩家获胜，每次取石头时，每个玩家只能从若干堆石头中任选一堆，
lunce创建索引及简单查询 chengxuyuancsdn 查询创建索引 lunce
import java.io.File; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Docume
[IT与投资]坚持独立自主的研究核心技术 comsci it
和别人合作开发某项产品....如果互相之间的技术水平不同,那么这种合作很难进行,一般都会成为强者控制弱者的方法和手段..... 所以弱者,在遇到技术难题的时候,最好不要一开始就去寻求强者的帮助,因为在我们这颗星球上,生物都有一种控制其
flashback transaction闪回事务查询 daizj oracle sql 闪回事务
闪回事务查询有别于闪回查询的特点有以下3个：（1）其正常工作不但需要利用撤销数据，还需要事先启用最小补充日志。（2）返回的结果不是以前的“旧”数据，而是能够将当前数据修改为以前的样子的撤销SQL（Undo SQL）语句。（3）集中地在名为flashback_transaction_query表上查询，而不是在各个表上通过“as of”或“vers
Java I/O之FilenameFilter类列举出指定路径下某个扩展名的文件游其是你 FilenameFilter
这是一个FilenameFilter类用法的例子，实现的列举出“c:\\folder“路径下所有以“.jpg”扩展名的文件。 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
C语言学习五函数，函数的前置声明以及如何在软件开发中合理的设计函数来解决实际问题 dcj3sjt126com c
# include <stdio.h> int f(void) //括号中的void表示该函数不能接受数据，int表示返回的类型为int类型 { return 10; //向主调函数返回10 } void g(void) //函数名前面的void表示该函数没有返回值 { //return 10; //error 与第8行行首的void相矛盾 } in
今天在测试环境使用yum安装，遇到一个问题： Error: Cannot retrieve metalink for repository: epel. Pl dcj3sjt126com centos
今天在测试环境使用yum安装，遇到一个问题： Error: Cannot retrieve metalink for repository: epel. Please verify its path and try again 处理很简单，修改文件“/etc/yum.repos.d/epel.repo”，将baseurl的注释取消， mirrorlist注释掉。即可。 &n
单例模式 shuizhaosi888 单例模式
单例模式懒汉式 public class RunMain { /** * 私有构造 */ private RunMain() { } /** * 内部类，用于占位，只有 */ private static class SingletonRunMain { priv
Spring Security（09）——Filter 234390216 Spring Security
Filter 目录 1.1 Filter顺序 1.2 添加Filter到FilterChain 1.3 DelegatingFilterProxy 1.4 FilterChainProxy 1.5
公司项目NODEJS实践0.1 逐行分析JS源代码 mongodb nginx ubuntu nodejs
一、前言前端如何独立用nodeJs实现一个简单的注册、登录功能，是不是只用nodejs+sql就可以了？其实是可以实现，但离实际应用还有距离，那要怎么做才是实际可用的。网上有很多nod
java.lang.Math liuhaibo_ljf java Math lang
System.out.println(Math.PI); System.out.println(Math.abs(1.2)); System.out.println(Math.abs(1.2)); System.out.println(Math.abs(1)); System.out.println(Math.abs(111111111)); System.out.println(Mat
linux下时间同步 nonobaba ntp
今天在linux下做hbase集群的时候，发现hmaster启动成功了，但是用hbase命令进入shell的时候报了一个错误 PleaseHoldException: Master is initializing，查看了日志，大致意思是说master和slave时间不同步，没办法，只好找一种手动同步一下，后来发现一共部署了10来台机器，手动同步偏差又比较大，所以还是从网上找现成的解决方
ZooKeeper3.4.6的集群部署 roadrunners zookeeper 集群部署
ZooKeeper是Apache的一个开源项目，在分布式服务中应用比较广泛。它主要用来解决分布式应用中经常遇到的一些数据管理问题，如：统一命名服务、状态同步、集群管理、配置文件管理、同步锁、队列等。这里主要讲集群中ZooKeeper的部署。 1、准备工作我们准备3台机器做ZooKeeper集群，分别在3台机器上创建ZooKeeper需要的目录。数据存储目录
Java高效读取大文件 tomcat_oracle java
　　读取文件行的标准方式是在内存中读取，Guava 和Apache Commons IO都提供了如下所示快速读取文件行的方法：　　Files.readLines(new File(path), Charsets.UTF_8); 　　FileUtils.readLines(new File(path)); 　　这种方法带来的问题是文件的所有行都被存放在内存中，当文件足够大时很快就会导致
微信支付api返回的xml转换为Map的方法 xu3508620 xml map 微信api
举例如下： <xml> <return_code><![CDATA[SUCCESS]]></return_code> <return_msg><![CDATA[OK]]></return_msg> <appid><