下载全部视频和PPT,请关注公众号(bigdata_summit),并点击“视频下载”菜单
Apache Spark-and-Tensorflow-as-a-Service
by Jim Dowling, KTH—Royal Institute of Technology
video, slide
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.Session hashtag: #EUai8
下面的内容来自机器翻译:
在瑞典,来自www.hops.site的Rise ICE数据中心,我们正在为Spark -as-a-Service以及最近的Tensorflow-as-a-Service提供研究人员,作为啤酒花平台的一部分的服务。在这个演讲中,我们将研究在不同的方式中,Tensorflow可以包含在Spark工作流中,从批处理到流式处理到结构化的流应用程序。我们将分析用于将Spark与Tensorflow,从Tensorframes到TensorflowOn Spark集成到Databrick深度学习管道的不同框架。我们介绍了支持的不同编程模型,并强调了代表用户管理不同版本的python库的集群支持的重要性。我们还将介绍对共享GPU的集群管理支持,包括Mesos和YARN(在Hops Hadoop中)。最后,我们将对可以从HDFS或 Kafka读取数据的Jupyter上的TensorflowOn Spark应用程序进行实时演示。 / span>,转换Spark中的数据,并在Tensorflow上训练一个深度神经网络。我们将演示如何使用Spark UI和Tensorboard调试应用程序,以及如何检查日志和监视培训。Session#hashai:#EUai8
Deduplication and Author-Disambiguation of Streaming Records via Supervised Models Based on Content Encoders
by Reza Karimi, Elsevier
video, slide
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.Session hashtag: #EUai2
下面的内容来自机器翻译:
在这里,我们通过Spark给出记录重复数据删除和作者消除歧义的一般监督框架。这项工作的区别在于:Databricks的应用和AWS使得这是一个可扩展的实现。计算资源比传统的使用大箱子24/7的传统技术要低。可伸缩性是至关重要的,因为Elsevier的Scopus数据是最大的科学抽象存储库,涵盖了几百年来7000万个摘要中的大约2.5亿个作者。 - 我们通过深度学习和/或word2vec算法为每个内容创建一个指纹,以加速两两相似度计算。这些编码器大大减少了计算时间,同时保持了语义相似性(不同于传统的TFIDF或预定义分类法)。我们将简要讨论如何以高并行化来优化word2vec训练。此外,我们还展示了如何使用这些编码器为我们的所有实体(如文档,作者,用户,期刊等)导出标准表示。此标准表示可以将推荐问题简化为成对相似性搜索,因此可以提供我们可能没有一个专门的推荐引擎设计的跨产品应用程序的基本推荐。 - 传统的作者消除歧义或记录重复数据消除算法是批量处理,从小到无的训练数据。但是,我们有大约2千5百万个作者,通过用户反馈手动策划或修正。因此,维护历史档案是至关重要的,因此我们开发了一个机器学习实现来处理数据流,并且一次一个地处理小批量或一个文档。我们将讨论如何测量这种系统的准确性,如何调整它,以及如何处理成对相似函数的原始数据到最后的聚类。从这次演讲中吸取的经验教训可以帮助各种想要整合数据或重复数据删除用户/客户/产品数据库的公司。会议标签:#EUai2
Extending Apache Spark ML: Adding Your Own Algorithms and Tools
by Holden Karau, IBM
video,
Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. This talk introduces Spark’s ML pipelines, and then looks at how to extend them with your own custom algorithms. By integrating your own data preparation and machine learning tools into Spark’s ML pipelines, you will be able to take advantage of useful meta-algorithms, like parameter searching and pipeline persistence (with a bit more work, of course). Even if you don’t have your own machine learning algorithms that you want to implement, this session will give you an inside look at how the ML APIs are built. It will also help you make even more awesome ML pipelines and customize Spark models for your needs. And if you don’t want to extend Spark ML pipelines with custom algorithms, you’ll still benefit by developing a stronger background for future Spark ML projects. The examples in this talk will be presented in Scala, but any non-standard syntax will be explained.Session hashtag: #EUai6
下面的内容来自机器翻译:
Apache Spark的机器学习(ML)流水线提供了很大的功能,但是有时您需要的针对特定问题的工具还不可用。本文将介绍Spark的ML管道,然后介绍如何使用自定义算法扩展它们。通过将您自己的数据准备和机器学习工具集成到Spark管线中,您将能够利用有用的元算法,如参数搜索和管道持久性当然有更多的工作)。即使你没有自己想要实现的机器学习算法,这个会话也会给你一个关于如何构建ML API的内幕。它还将帮助您制作更棒的ML流水线,并根据您的需求定制Spark型号。如果您不想使用自定义算法来扩展Spark ML管道,您仍然可以通过为未来开发更强大的背景而获益
Getting Ready to Use Redis with Apache Spark
by Dvir Volk, Redis Labs
video, slide
is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.Session hashtag: #EUai4
下面的内容来自机器翻译:
是一个技术教程,旨在将Redis与Apache Spark部署集成,以提高服务复杂决策模型的性能。要为会话设置上下文,我们先简要介绍一下Redis以及Redis提供的功能。我们介绍了Redis提供的基本数据类型,并涵盖了模块系统。使用广告服务用例,我们看看Redis如何提高性能并降低在生产中使用复杂ML模型的成本。参与者将通过建立和整合Redis与Spark的关键步骤,包括如何使用Spark来训练模型使用Redis加载和提供服务,以及如何使用Spark Redis模块。我们将讨论Redis机器学习模块(redis-ml)的功能,主要关注决策树和回归(线性和逻辑)以及代码示例,以演示如何使用这些功能。在会议结束时,开发人员应该有信心使用Redis和Spark构建原型/概念验证应用程序。与会者将了解Redis如何补充 Spark span>以及如何使用Redis为高性能的复杂ML模型提供服务。Session#hashai:#EUai4
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library for Apache Spark
by Miruna Oprescu, Microsoft
video, slide
With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.Session hashtag: #EUai7
下面的内容来自机器翻译:
随着可用数据集的快速增长,拥有从大数据中提取洞察力的好工具势在必行。数据科学家发现自己正在努力解决以下问题:Spark ML库对于执行大规模数据处理和机器学习实验提供了极好的支持。操作,缺乏对图像处理,文本分析和深度学习的支持,以及无法使用Spark以及其他流行的机器学习库。为了解决这些问题,微软最近发布了针对Apache的微软机器学习库Spark(MML Spark),一个开源机器学习库建立在 Spark之上“,旨在简化数据科学过程,并集成Spark计算机视觉库,如Microsoft认知工具包(CNTK)和OpenCV。通过使用MML Spark,Data Scientists可以通过Pipeline对象构建具有1/10代码的模型,这些对象可以与Spark ML生态系统。在本课程中,我们将探讨构建MML Spark的一些主要经验教训。加入我们,如果你想知道如何扩展管道,以确保与Spark ML的无缝集成,如何从Scala变形金刚和估算器自动生成Python和R包装,如何集成并以分布式方式使用以前的非分布式库,以及如何在多个平台上有效地部署Spark库。Session hashtag:#EUai7
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization Based on Spark SQL
by Mingjie Tang, Hortonworks
video, slide
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.Session hashtag: #EUai1
下面的内容来自机器翻译:
大规模机器学习和数据挖掘方法的使用正在从商业智能和生物信息学到自驾车等许多应用领域中无处不在。这些方法很大程度上依赖于矩阵计算,因此使这些计算具有可扩展性和高效性至关重要。这些矩阵计算通常很复杂,需要对多个步骤进行优化和排序才能有效执行。这项工作提出了基于Spark的新的高效和可扩展的矩阵处理和优化技术。所提出的技术估计中间矩阵计算结果的稀疏性并优化通信成本。引入了用于复杂矩阵计算的评估计划生成器以及利用基于动态成本的分析和基于规则的启发式算法的分布式计划优化器。矩阵运算的结果经常用作另一个矩阵运算的输入,由此定义矩阵矩阵程序中的数据依赖关系。矩阵查询计划生成器通过基于执行计划中的数据依赖关系对矩阵进行分区来生成查询执行计划,以最大限度地减少内存使用和通信开销。我们在 Spark SQL中实现了所提出的矩阵技术,并基于 Spark SQL Catalyst优化了矩阵执行计划。我们对一系列ML模型和具有不同数据集特征的矩阵计算进行案例研究。这些是PageRank,GNMF,BFGS,稀疏矩阵链乘法和生物数据分析。开源库ScaLAPACK和基于数组的数据库SciDB用于性能评估。我们的实验是在六个真实世界的数据集上进行的:社交网络数据(例如soc-pokec,cit-Patents,LiveJournal),Twitter2010,Netflix推荐数据和1000 Genomes Project样本。实验表明,我们提出的技术达到了一个数量级的性能。会议主题标签:#EUai1
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark
by Marcin Kulka, 9LivesData
video, slide
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.Session hashtag: #EUai9
下面的内容来自机器翻译:
构建精确的机器学习模型一直是数据科学家的一门艺术,即算法选择,超参数调整,特征选择等。最近,突破这个“黑色艺术”的挑战已经开始了。与我们的合作伙伴NEC Laboratories America合作,我们开发了一个基于Spark的自动预测建模系统。系统自动搜索最佳算法,参数和功能,无需任何手动工作。在这个演讲中,我们将分享如何设计自动化系统来利用Spark的吸引力优势。真实的开放数据评估表明,我们的系统可以探索数百种预测模型,并在超高密度服务器上发现最准确的模型,在3U机箱中采用272个CPU核心,2TB内存和17TB SSD。我们还将分享开放的挑战,以便在Spark上学习如此庞大的模型,特别是从可靠性和稳定性的角度来看。本次演讲将涵盖已经在Spark峰会SF'17(#SFds5)上展示的演示文稿,但从更多的技术角度来看。Session#hashai:#EUai9
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Learning
by Eiti Kimura, Movile
video, slide
Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.Session hashtag: #EUai10
下面的内容来自机器翻译:
你有没有想过一个简单的机器学习解决方案,能够防止收入泄漏,并监视您的分布式应用程为了回答这个问题,我们提供了一个实用而简单的机器学习解决方案,基于使用Apache Spark MLlib的简单数据分析创建智能监控应用程序。我们的应用程序使用线性回归模型进行预测,并检查平台是否遇到任何可能影响收入损失的操作问题。应用程序监控分布式系统,并提供通知,说明检测到的问题,这样用户可以快速操作,避免直接影响公司收入和缩短行动时间的严重问题。我们不仅要提供一个监控系统的架构,还要为我们的中断恢复提供一个积极的参与者。在演示结束时,您将可以访问我们的培训计划源代码,您将能够在公司中进行调整和实施。这个解决方案已经帮助去年防止了大约3美元的损失。会议主题标签:#EUai10