Spark 2017欧洲技术峰会摘要（流计算分类）

下载全部视频和PPT，请关注公众号(bigdata_summit)，并点击“视频下载”菜单

Apache Spark Streaming + Kafka 0.10: An Integration Story

by Joan Viladrosa Riera, Billy Mobile
video, slide
Spark Streaming has supported Kafka since it’s inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself. So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples. Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.Session hashtag: #EUstr5

下面的内容来自机器翻译:
Spark Streaming从开始就支持Kafka，但自那时以来，Spark和Kafka两方面都发生了很大的变化，使得这种集成更具容错性和可靠性.Apache Kafka 0.10（实际上从0.9开始）引入了新的Consumer API，建立在由卡夫卡本身提供的新的团体协调协议之上。因此，一个新的Spark Streaming集成到操场中，具有类似于0.8 Direct DStream方法的设计。但是，在使用方面存在显着差异，并且有许多令人兴奋的新功能。在这个演讲中，我们将介绍这个新的整合和前一个（对于Kafka 0.8）之间的主要区别，以及为什么Direct DStream已经取代了Receiver。我们还将看到如何用代码示例实现不同的语义（至少一个，最多只有一个）。最后，我们将简要介绍在Billy Mobile中使用这种集成技术来从AdNetwork.Session主题标签中获取并处理连续的事件流：＃EUstr5

Apache Spark Streaming Programming Techniques You Should Know

by Gerard Maas, Lightbend
video, slide
At its heart, Spark Streaming is a scheduling framework, able to efficiently collect and deliver data to Spark for further processing. While the DStream abstraction provides high-level functions to process streams, several operations also grant us access to deeper levels of the API, where we can directly operate on RDDs, transform them to Datasets to make use of that abstraction or store the data for later processing. Between these API layers lie many hooks that we can manipulate to enrich our Spark Streaming jobs. In this presentation we will demonstrate how to tap into the Spark Streaming scheduler to run arbitrary data workloads, we will show practical uses of the forgotten ‘ConstantInputDStream’ and will explain how to combine Spark Streaming with probabilistic data structures to optimize the use of memory in order to improve the resource usage of long-running streaming jobs. Attendees of this session will come out with a richer toolbox of techniques to widen the use of Spark Streaming and improve the robustness of new or existing jobs.Session hashtag: #EUstr2

下面的内容来自机器翻译:
Spark Streaming的核心是一个调度框架，能够有效地收集数据并将其传递给Spark进行进一步处理。虽然DStream抽象提供了高级函数来处理流，但是一些操作还允许我们访问更深层次的API，在那里我们可以直接操作RDD，将它们转换为数据集以便利用这个抽象或者将数据存储起来处理。在这些API层之间有许多钩子，我们可以操作来丰富我们的Spark Streaming作业。在本演示中，我们将演示如何利用Spark Streaming调度程序来运行任意数据工作负载，我们将展示被遗忘的'ConstantInputDStream'的实际用法，并将解释如何将Spark Streaming与概率数据结构结合起来以优化内存的使用为了改善长时间运行的流式作业的资源使用情况。本次会议的与会者将提供更丰富的技术工具箱，以扩大Spark Streaming的使用，并提高新的或现有工作的健壮性。会话标签：＃EUstr2

Apache Spark Structured Streaming Helps Smart Manufacturing

by Xiaochang Wu, Intel
video,
This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.Session hashtag: #EUstr1

下面的内容来自机器翻译:
本演讲介绍了如何使用最新的Spark Structured Streaming框架来设计和实施实时处理平台，以智能化地改造制造业中的生产线。在传统的生产线上，存在着传感器数据，机器屏幕输出，日志输出，数据库记录等多种孤立的结构化，半结构化和非结构化的数据。主要有两种数据场景：1）图像和视频数据与频率低但数量多; 2）高频连续的数据。他们不是每个单位的大量数据。然而，它们的总量是非常大的，例如用于检测设备质量的振动数据。这些数据具有流数据的特征：实时性，易失性，突发性，无序性和无穷大性。做出有效的实时决策以从这些数据中检索出价值对智能制造来说至关重要。最新的Spark Structured Streaming框架极大地降低了构建高度可扩展性和容错流应用程序的难度。得益于Spark，我们能够构建一个低延迟，高吞吐量，可靠的数据采集，传输，分析和存储的操作系统。实际的用户案例证明，系统满足了实时决策的需要。该系统大大提高了预测性故障修复和生产线材料跟踪效率的生产过程，可以减少约一半的生产线劳动力。会议主题标签：＃EUstr1

Building a Business Logic Translation Engine with Spark Streaming for Communicating Between Legacy Code and Microservices

by Patrick Bamba, Attestation Légale
video, slide
Attestation Legale is a social networking service for companies that alleviates the administrative burden European countries are imposing on client supplier relationships. It helps companies from construction, staffing and transport industries, digitalize, secure and share their legal documents. With clients ranging from one-person businesses to industry leaders such as Orange or Bouygues Construction, they ease business relationships for a social network of companies that would be equivalent to a 34 billion dollar industry. While providing a high quality of service through our SAAS platform, we faced many challenges including refactoring our monolith into microservices, a daunting architectural task a lot of organizations are facing today. Strategies for tackling that problem primarily revolve around extracting business logic from the monolith or building new applications with their own logic that interfaces with the legacy. Sometimes however, especially in companies sustaining an important growth, new business opportunities arise and the required logic from your microservices might greatly differs from the legacy. We will discuss how we used Spark Streaming and Kafka to build a real time business logic translation engine that allows loose technical and business coupling between our microservices and legacy code. You will also hear about how making Apache Spark a part of our consumer facing product also came with technical challenges, especially when it comes to reliability. Finally, we will share the lambda architecture that allowed us to use move data in batch (migrating data from the monolith for initialization) and real time (handling data generated after through use). Key takeaways include: – Breaking down this strategy and its derived technical and business profits – Feedback on how we achieved reliability – Examples of implementations using RabbitMQ (then Kafka) and GraphX – Testing business rules and data transformation.Session hashtag: #EUstr6

下面的内容来自机器翻译:
认证Legale是一个为减轻欧洲国家对客户供应商关系施加的管理负担的公司提供的社交网络服务。它帮助建筑，人员和运输行业的公司数字化，保护和分享他们的法律文件。从一人企业到Orange或Bouygues Construction等行业领导者，客户都可以轻松建立一个相当于340亿美元行业的社交网络。在通过我们的SAAS平台提供高质量的服务的同时，我们还面临着许多挑战，包括将我们的巨无霸重构为微服务，这是许多组织今天面临的艰巨的架构任务。解决这个问题的策略主要围绕从整体中提取业务逻辑，或者用自己的与遗产接口的逻辑构建新的应用程序。然而，有时候，特别是在保持重要增长的公司中，出现了新的商业机会，微服务所需的逻辑可能会大大不同于传统。我们将讨论我们如何使用Spark Streaming和Kafka构建一个实时业务逻辑转换引擎，从而允许我们的微服务和传统代码之间的技术和业务耦合松散。您也将听到如何使Apache Spark成为我们面向消费者的产品的一部分，也带来了技术挑战，特别是在可靠性方面。最后，我们将分享一下lambda体系结构，它允许我们批量使用移动数据（从monolith移植数据进行初始化）和实时（处理使用后生成的数据）。关键要点包括： - 打破这一战略及其派生的技术和业务利润 - 对如何实现可靠性的反馈 - 使用RabbitMQ（然后是Kafka）和GraphX - 测试业务规则和数据转换的实现示例.Session主题标签：＃EUstr6

Deep Dive into Stateful Stream Processing in Structured Streaming

by Tathagata Das, Databricks
video, slide
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming. In particular, I am going to discuss the following. – Different stateful operations in Structured Streaming – How state data is stored in a distributed, fault-tolerant manner using State Stores – How you can write custom State Stores for saving state to external storage systems.Session hashtag: #EUstr7

下面的内容来自机器翻译:
有状态处理是分布式容错流处理中最具挑战性的方面之一。结构化数据流中的DataFrame API使开发人员可以非常容易地表达他们的有状态逻辑，无论是隐式地（流聚合）还是显式地（mapGroupsWithState）。然而，引擎盖下面有许多可移动的部件，使得所有的魔法成为可能。在这个演讲中，我将深入探讨有状态处理在结构化流处理中的工作原理。我特别要讨论以下内容。 - 结构化数据流中的不同有状态操作 - 状态数据如何使用状态存储以分布式，容错方式存储 - 如何编写自定义状态存储以将状态保存到外部存储系统。Session标签：＃EUstr7

Fast Data with Apache Ignite and Apache Spark

by Christos Erotocritou, GridGain
video, slide
Spark and Ignite are two of the most popular open source projects in the area of high-performance Big Data and Fast Data. But did you know that one of the best ways to boost performance for your next generation real-time applications is to use them together? In this session, Christos Erotocritou – Lead GridGain solutions architect, will explain in detail how IgniteRDD – an implementation of native Spark RDD and DataFrame APIs – shares the state of the RDD across other Spark jobs, applications and workers. Christos will also demonstrate how IgniteRDD, with its advanced in-memory indexing capabilities, allows execution of SQL queries many times faster than native Spark RDDs or Data Frames. Furthermore we will be discussing the newest feature additions and what the future holds for this integration.Session hashtag: #EUstr10

下面的内容来自机器翻译:
Spark和Ignite是高性能大数据和快速数据领域最受欢迎的两个开源项目。但是您是否知道为下一代实时应用程序提高性能的最佳方法之一是将它们一起使用？在本次会议中，首席GridGain解决方案架构师Christos Erotocritou将详细解释IgniteRDD（原生Spark RDD和DataFrame API的实现）如何与其他Spark作业，应用程序和工作人员共享RDD的状态。 Christos还将演示IgniteRDD如何利用其先进的内存索引功能，使得SQL查询的执行速度比原生Spark RDD或数据框快许多倍。此外，我们将讨论最新的功能添加以及未来的这种整合。Session标签：＃EUstr10

Monitoring Structured Streaming Applications Using Web UI

by Jacek Laskowski,
video,
Spark Structured Streaming in Apache Spark 2.2 comes with quite a few unique Catalyst operators, most notably stateful streaming operators and three different output modes. Understanding how Spark Structured Streaming manages intermediate state between triggers and how it affects performance is paramount. After all you use Apache Spark for processing huge amount of data that alone can be tricky to get right, and Spark Structured Streaming adds the additional streaming factor that given a structured query can make the data even bigger due to state management.This deep-dive talk is going to show you what is included in execution diagrams, logical and physical plans, and metrics in SQL tab’s Details for Query page.The talk is going to answer the following questions:* What do blue boxes represent in Details for Query page in SQL tab?

下面的内容来自机器翻译:
Apache Spark 2.2中的Spark结构化流式处理器附带了不少独特的Catalyst运算器，其中最引人注目的是有状态的流式运算符和三种不同的输出模式。理解Spark Structured Streaming如何管理触发器之间的中间状态以及它如何影响性能是极为重要的。毕竟，您使用Apache Spark来处理大量的数据，而Spark结构化数据流只是一个棘手的问题，而Spark Structured Streaming增加了额外的数据流因素，给定的结构化查询可以通过状态管理使数据更大。谈话将向您展示执行图，逻辑和物理计划以及SQL选项卡的“查询详细信息”页面中的指标。谈话将回答以下问题：*“详细信息查询”页面中的蓝色框表示什么SQL选项卡？

Productionizing Behavioural Features for Machine Learning with Apache Spark Streaming

by Ben Teeuwen, Booking.com
video, slide
We are using Spark Streaming for building online Machine Learning(ML) features that are used in Booking.com for real-time prediction of behaviour and preferences of our users, demand for hotels and improve processes in customer support. Our initial set of goals was to speedup experimentation with real-time features, make features reusable by Data Scientists (DS) within the company and reduce training/serving data skew problem. The tooling that we’ve built and integrated into company’s infrastructure simplifies development of new features to the level that online feature collection can be implemented and deployed into production by DS with very little or no help from developers. That makes this approach scalable and allows us to iterate fast. We use Kafka as a streaming source of real-time events from the website as well as other sources and with connectivity to Cassandra and Hive we were able to make data more consistent between training and serving phases of ML pipelines. Our key takeaways: – It is possible to design production pipelines in a way that allows DS to build and deploy them without help of a developer. – Constructing online features is a much more complex job than offline construction and business-wise it is not always a priority to invest into their construction even if they are proven to be beneficial to the model performance. We plan to invest further into development of pipelines with Spark Streaming and are happy to see that support for streaming operations in Spark evolves in right direction.Session hashtag: #EUstr4

下面的内容来自机器翻译:
我们正在使用Spark Streaming来构建在线机器学习（ML）功能，用于实时预测用户的行为和偏好，对酒店的需求以及改善客户支持流程。我们最初的目标是加速对实时功能的实验，使公司内的数据科学家（DS）可重复使用功能，并减少培训/服务数据歪斜问题。我们构建并集成到公司基础架构中的工具简化了新功能的开发，使得在线功能收集可以由DS很少或根本没有帮助地部署到生产环境中。这使得这种方法可以扩展，并使我们能够快速迭代。我们使用Kafka作为来自网站的实时事件以及其他来源的流媒体源，并且与Cassandra和Hive连接，我们能够使ML管道的培训和服务阶段之间的数据更加一致。我们的关键要点： - 可以设计生产流水线，让DS可以在没有开发人员帮助的情况下构建和部署生产流水线。 - 建设在线功能比线下建设和商业化要复杂得多，即使它们被证明对模型性能有利，投资建设并不总是优先考虑的事情。我们计划通过Spark Streaming进一步投资管道开发，并且很高兴看到在Spark中支持流式处理操作的方向正在朝着正确的方向发展。会话标签：＃EUstr4

Story Deduplication and Mutation

by Andrew Morgan, ByteSumo Ltd
video, slide
We demonstrate how to use Spark Streaming to build a global News Scanner that scrapes news in near real time, and uses sophisticated text analysis, SimHash,Session hashtag: #EUstr9

下面的内容来自机器翻译:
我们演示了如何使用Spark Streaming来构建一个全球性的新闻扫描器，它可以近乎实时地扫描新闻，并使用复杂的文本分析SimHash，Session＃标签：＃EUstr9