Spark 2017欧洲技术峰会摘要（Spark 核心技术分类）

下载全部视频和PPT，请关注公众号(bigdata_summit)，并点击“视频下载”菜单

A Developer's View Into Spark's Memory Model

by Wenchen Fan, Databricks
video, slide
As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of modern hardware. In this talk, we’ll take a deep dive into Apache Spark’s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection.Session hashtag: #EUdd2

下面的内容来自机器翻译:
作为Project Tungsten的一部分，我们开始不断努力，大幅度提高Apache Spark后端执行的内存和CPU效率，并将性能推向接近现代硬件的极限。在这次演讲中，我们将深入探讨Apache Spark的统一内存模型，并讨论Spark如何利用内存层次结构并利用应用程序语义显式管理内存（在堆内外），以消除JVM对象模型的开销和垃圾collection.Session主题标签：＃EUdd2

Deep Dive into Deep Learning Pipelines

by Sue Ann Hong, Databricks
video, slide
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.Session hashtag: #EUdd3

下面的内容来自机器翻译:
深度学习已经取得了巨大的成功，但通常需要付出很大努力才能发挥其作用。现有的深度学习框架需要编写大量代码来运行模型，更不用说分布式的方式。深度学习管道是一个Spark包装库，可以根据Spark MLlib Pipelines API进行实际的深度学习。利用Spark，深度学习管道扩展了许多计算密集型深度学习任务。在这个演讲中，我们深入探讨了深度学习管道的各种使用案例，如大规模预测，转移学习和超参数调整，其中许多可以用几行代码完成。 - 如何处理复杂的数据，如Spark和Deep Learning Pipelines中的图像。 - 如何通过熟悉的Spark API（例如MLlib和Spark SQL）来部署深度学习模型，从而使机器学习从业者到业务分析人员都能获得成功。最后，我们讨论与流行的深度学习框架的整合。会议主题标签：＃EUdd3

Deep Dive into Deep Learning Pipelines - continues

Easy, Scalable, Fault-Tolerant Stream Processing with Structured Streaming in Apache Spark

by Tathagata Das, Databricks
video, slide
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.Session hashtag: #EUdd1

下面的内容来自机器翻译:
去年，Databricks在Apache Spark 2.0中引入了Structured Streaming，这是一个基于Spark SQL的新型流处理引擎，它彻底改变了开发人员如何编写流处理应用程序。结构化流式传输使用户能够以与在静态数据上表示批量查询相同的方式表达其计算。开发人员可以使用强大的高级API（包括DataFrame，Dataset和SQL）来表达查询。然后，Spark SQL引擎能够将这些类似批处理的转换转换为一个可以处理流数据的增量执行计划，同时自动处理延迟的乱序数据，并确保端到端的准确一次容错保证。自从Spark 2.0以来，Databricks一直在努力与Kafka建立一流的整合。借助这种新的连接性，执行复杂的低延迟分析现在与编写标准的SQL查询一样简单。除了现有的Spark SQL连接性之外，该功能还可以使用统一的框架轻松分析数据。用户现在可以无缝地从数据中提取见解，无论是来自杂乱/非结构化文件，结构化/列式历史数据仓库，还是从Kafka / Kinesis实时到达。在本次会议中，Das将通过具体的例如，在少于10行的内容中，您可以阅读Kafka，将JSON有效内容数据解析为单独的列，对其进行转换，通过加入静态数据将其填充，然后将其作为表格写出，以备最新的批处理和临时查询最后一分钟的数据。他将使用包括基于事件时间的聚合，任意有状态操作以及使用事件时间水印的自动状态管理等技术。Session hashtag：＃EUdd1

Easy, Scalable, Fault-Tolerant Stream Processing with Structured Streaming in Apache Spark - continues

From Basic to Advanced Aggregate Operators in Apache Spark SQL 2.2 by Examples and their Catalyst Optimizations

by Jacek Laskowski,
video,
There are many different aggregate operators in Spark SQL. They range from the very basic groupBy and not so basic groupByKey that shines bright in Apache Spark Structured Streaming’s stateful aggregations, including the more advanced cube, rollup and pivot to my beloved windowed aggregations. It’s unbelievable how different the performance characteristic they have, even for the same use cases.What is particularly interesting is the comparison of the simplicity and performance of windowed aggregations vs groupBy. And that’s just Spark SQL alone. Then there is Spark Structured Streaming that has put groupByKey operator at the forefront of stateful stream processing (and to my surprise as the performance might not be that satisfactory).This deep-dive talk is going to show all the different use cases for the aggregate operators and functions as well as their performance differences in Spark SQL 2.2 and beyond. Code and fun included!Session hashtag: #EUdd5

下面的内容来自机器翻译:
Spark SQL中有许多不同的聚合运算符。他们的范围从非常基本的groupBy，而不是基本的groupByKey，在Apache Spark结构化流的有状态聚合，包括更高级的多维数据集，汇总和透视到我心爱的窗口聚合明亮。即使在相同的使用情况下，它们的性能特点也不相同。特别有趣的是，窗口化聚合的简单性和性能与groupBy的比较。而这仅仅是Spark SQL。然后是Spark Structured Streaming，它把groupByKey运算放在了有状态流处理的最前沿（令人惊讶的是性能可能不那么令人满意）。这个深入的讨论将会展示所有不同的用例集合运算符和函数以及在Spark SQL 2.2及更高版本中的性能差异。代码和乐趣包括！会议主题标签：＃EUdd5

From Basic to Advanced Aggregate Operators in Apache Spark SQL 2.2 by Examples and their Catalyst Optimizations - continues

Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and TensorFlow

by Alexander Thomas, Indeed
video, slide
Natural language processing is a key component in many data science systems that must understand or reason about text. Common use cases include question answering, paraphrasing or summarization, sentiment analysis, natural language BI, language modeling, and disambiguation. Building such systems usually requires combining three types of software libraries: NLP annotation frameworks, machine learning frameworks, and deep learning frameworks. Ideally, all three of these pieces should be able to be integrated into a single workflow. This makes development, experimentation, and deploying results much easier. Spark’s MLlib provides a number of machine learning algorithms, and now there are also projects making deep learning achievable in MLlib pipelines. All we need is the NLP annotation frameworks. SparkNLP adds NLP annotations into the MLlib ecosystem. This talk will introduce SparkNLP: how to use it, its current functionality, and where it is going in the future.Session hashtag: #EUdd4

下面的内容来自机器翻译:
自然语言处理是许多数据科学系统的关键组成部分，必须理解或推理文本。常见的用例包括问题回答，释义或摘要，情感分析，自然语言BI，语言建模和消歧。构建这样的系统通常需要结合三种类型的软件库：NLP注释框架，机器学习框架和深度学习框架。理想情况下，所有这三个部分应该能够被集成到一个工作流程中。这使得开发，实验和部署结果变得更加容易。 Spark的MLlib提供了许多机器学习算法，现在也有一些项目可以在MLlib管道中实现深度学习。我们需要的是NLP注释框架。 SparkNLP将NLP注释添加到MLlib生态系统中。本次演讲将介绍SparkNLP：如何使用它，当前的功能以及未来的发展方向。Session hashtag：＃EUdd4