下载全部视频和PPT,请关注公众号(bigdata_summit),并点击“视频下载”菜单。
Experimental Design for Distributed Machine Learning
by Myles Baker, Databricks
video, slide
Session hashtag: #EUent5
下面的内容来自机器翻译:
会议主题标签:#EUent5
Goal-Based Data Production: The Spark of a Revolution
by Sim Simeonov, Swoop
video, slide
Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.Session hashtag: #EUent1
下面的内容来自机器翻译:
自SQL和关系数据库的发明以来,数据生产一直在指定如何通过查询转换数据。虽然Apache Spark当然可以用作普通的分布式查询引擎,但Spark API的功能和粒度可以实现数据工程生产力的革命性提高:基于目标的数据生产。基于目标的数据生产关注于指定期望的结果是什么,将结果如何实现的细节留给运行在Spark之上的智能数据仓库。这不仅大大提高了生产力,而且还大大扩展了可直接与Spark合作的受众:从开发人员,数据科学家到技术业务用户。通过从ETL到机器学习数据准备以及现场演示的范围内的特定数据和体系结构模式,本次会议将演示Spark用户如何从基于目标的数据生产中获得好处。Session#hasheg:#EUent1
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—Industry 4.0 and Logistics Success Examples
by Francisco J. Lacueva, ITAINNOVA
video, slide
In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects.In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API.For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.Session hashtag: #EUent6
下面的内容来自机器翻译:
在很多情况下,由于缺乏支持项目开发和部署的技术要求和/或参与项目的不同人员之间交流的流畅性的工具,大数据成为另一个流行词。我们将介绍Moriarty,这是一套用于大数据应用程序快速原型开发的工具,可以部署在Apache Spark环境中。这些工具支持使用现有的功能块创建大数据工作流程,或者支持创建新的功能块。然后,创建的工作流可以部署在Spark基础架构中,并通过REST API使用。为了更好地理解Moriarty,原型开发流程以及将Spark环境隐藏到大数据用户和开发人员的方式,我们会将其与几个基于工业4.0成功案例和其他物流成功案例的例子。会议主题标签:#EUent6
How Nielsen Utilized Databricks for Large-Scale Research and Development
by Matt VanLandeghem, Nielsen
video, slide
Large-scale testing of new data products or enhancements to existing products in a research and development environment can be a technical challenge for data scientists. In some cases, tools available to data scientists lack production-level capacity, whereas other tools do not provide the algorithms needed to run the methodology. At Nielsen, the Databricks platform provided a solution to both of these challenges. This breakout session will cover a specific Nielsen business case where two methodology enhancements were developed and tested at large-scale using the Databricks platform. Development and large-scale testing of these enhancements would not have been possible using standard database tools.Session hashtag: #EUent4
下面的内容来自机器翻译:
在研究和开发环境中对新数据产品进行大规模测试或对现有产品进行增强可能是数据科学家面临的技术挑战。在某些情况下,数据科学家可用的工具缺乏生产级能力,而其他工具不提供运行该方法所需的算法。在尼尔森,Databricks平台为这些挑战提供了一个解决方案。本次分组会议将涵盖特定的尼尔森商业案例,其中使用Databricks平台大规模开发和测试了两种方法增强功能。这些增强功能的开发和大规模测试将不可能使用标准的数据库工具。Session hashtag:#EUent4
Improving Traffic Prediction Using Weather Data
by Ramya Raghavendra, IBM
video, slide
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.Session hashtag: #EUent7
下面的内容来自机器翻译:
根据常识,天气对交通有一定的影响。但多少?而在什么情况下?根据天气数据,我们可以改善交通(拥堵)预测吗?通过在用户出行之前提醒用户,找到旅行的最佳时间,并随时间学习新的物联网数据,如道路状况,事件等,预测性流量将会对驾驶员的日程安排产生重大影响。本次演讲将涵盖交通IBM和交通数据提供商联合进行的预测工作。作为这项工作的一部分,我们对美国的五个大城市进行了案例研究,25.8亿交通记录和2.62亿天气记录,以量化利用天气数据提高交通预测的准确性。我们将提供我们的lambda体系结构的概述,使用Apache Spark来构建天气和流量数据的预测模型,Spark Streaming用于评估模型并提供实时流量预测。本次演讲还将介绍Spark的一系列扩展,分析交通和天气数据中的地理空间和时间模式,以及Spark框架中使用的一套机器学习算法。这项工作的初步成果已于2017年4月在拉斯维加斯举行的全国广播协会会议上提出,并且有一系列工作可以在100多个城市进行预测。受众将了解我们在离线和流模式下使用Spark进行体验扩展,使用Spark构建统计和深度学习管道,以及使用地理空间和时间序列数据的技术。Session标签:#EUent7
Powering a Startup with Apache Spark
by Kevin (Sangwoo) Kim, Between VCNC
video, slide
In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.Session hashtag: #EUent8
下面的内容来自机器翻译:
在两者之间(一对夫妇的手机应用程序,下载20M在全球),从每日批量提取指标,分析和仪表板。 Spark的工程师和数据分析师广泛使用Spark,由于Spark的性能和可扩展性,数据操作变得非常高效。包括Biz Dev,Global Operation,Designers在内的整个团队都在享受数据结果,因此Spark正在使整个公司获得数据驱动的操作和思考。 Kevin之间的联合创始人兼数据组负责人将介绍两者之间的情况。在这个演示之后,听众将知道小型和敏捷团队是如何与数据共存的(我们如何构建组织,文化和技术基础)。Session#hasheg:#EUent8
Safely Manage AWS Security in Apache Spark
by Keith Gutfreund, Elsevier
,
Accessing AWS Services from a Spark program requires authentication credentials that, when improperly managed, seriously threaten system security. Spark clusters engage 10’s, 100’s or even 1000’s of machines, and managing authentication credentials across clusters can be very complicated. This complexity increases for systems that scale dynamically and for systems that make use of opportunistic scheduling strategies. For example, how would you disable and then re-issue resource credentials on a cluster of 10s or 100s of machines, all without restarting your application? How would you manage credentials embedded within an application or a sealed image without re-issuing the application? Consider the following very real scenarios: 1. Multiple users access your database and you want to periodically rotate credentials. 2. Your compiled application requires AWS credentials to post events to AWS topics and to read events from AWS queues. The credentials are either embedded within the application or are read from environment variables. 3. One of your AWS EC2 instances within a Spark cluster has come under attack and you are concerned that your security credentials might have been compromised. 4. You want to provide different levels of access to different users. 5. You want security credentials to automatically expire after a certain period. This paper explains these and other common security challenges and then shows safe techniques that reduce and/or eliminate security risk. While the solutions described are specific to Spark running in AWS, the principles described are universally suitable for big data applications on all platforms.Session hashtag: #EUent2
下面的内容来自机器翻译:
从Spark程序访问AWS服务需要身份验证凭据,如果管理不当,将严重威胁系统安全。 Spark集群涉及10台,100台甚至1000台机器,并且跨群集管理身份验证凭证可能非常复杂。对于动态扩展的系统和利用机会调度策略的系统来说,这种复杂性会增加。例如,如何在10或100台机器的集群上禁用然后重新颁发资源证书,而无需重新启动应用程序?如何在不重新发布应用程序的情况下管理嵌入在应用程序或密封图像中的证书?考虑以下非常真实的情况:1.多个用户访问您的数据库,并且想要定期轮换凭据。 2.编译的应用程序需要AWS凭证才能将事件发布到AWS主题,并从AWS队列中读取事件。凭证要么嵌入在应用程序中,要么从环境变量中读取。 3.您的一个Spark集群中的AWS EC2实例受到攻击,您担心您的安全凭证可能已被泄露。 4.你想为不同的用户提供不同的访问级别。 5.您希望安全凭证在一段时间后自动过期。本文解释了这些和其他常见的安全挑战,然后展示了减少和/或消除安全风险的安全技术。尽管所描述的解决方案特定于在AWS中运行的Spark,但所描述的原则通常适用于所有平台上的大数据应用程序。Session hashtag:#EUent2
Spline: Apache Spark Lineage, Not Only for the Banking Industry
by Jan Scherbaum, Barclays Africa Group Limited
video, slide
Data lineage tracking is one of the significant problems that financial institutions face when using modern big data tools. This presentation describes Spline – a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans and visualizes it in a user-friendly manner.Session hashtag: #EUent3
下面的内容来自机器翻译:
数据沿袭跟踪是金融机构在使用现代大数据工具时所面临的重大问题之一。本演示文稿描述了Spline - Apache Spark的数据沿袭跟踪和可视化工具。 Spline从内部的Spark执行计划捕获和存储沿袭信息,并以用户友好的方式将其可视化。会话#标签:#EUent3
The Architecture of the Next CERN Accelerator Logging Service
by Jakub Wozniak, CERN
video, slide
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.Session hashtag: #EUent9
下面的内容来自机器翻译:
下一个加速器日志记录服务(NXCALS)是CERN新的大数据项目,旨在取代现有的基于Oracle的服务。该系统的主要目的是存储和呈现从整个数千设备收集的控制/基础设施相关数据在这次演讲中,Jakub将讲述NXCALS的需求和设计选择,这些选择将导致基于Hadoop和Spark的选定架构。他将介绍Ingestion API,元数据服务和基于Spark的抽取API背后的抽象概念,对模式处理进行简单的更改大大提高了系统的整体可用性。这个系统本身并不是欧洲核子研究中心特有的,也可能是其他公司或研究机构面临类似的大数据问题的兴趣。会话标签:#EUent9