As a data scientist, you are developing notebooks that process large data that does not fit in your laptop using Spark. What would you do? This is not a trivial problem.
作为数据科学家,您正在开发使用Spark处理笔记本电脑无法容纳的大数据的笔记本电脑。 你会怎么做? 这不是一个小问题。
Let’s start with the most naive solution without install anything on your laptop.
让我们从最简单的解决方案开始,而不在笔记本电脑上安装任何东西。
- “No notebook”: SSH into the remote clusters and use Spark shell on the remote cluster. “无笔记本”:SSH进入远程群集,并在远程群集上使用Spark Shell。
- “Local notebook”: downsample the data and pull the data to your laptop. “本地笔记本”:对数据进行下采样并将数据拉到您的笔记本上。
The problem of “No notebook” is the developer experience is unacceptable on Spark shell:
“没有笔记本”的问题是在Spark shell上无法接受开发人员的体验:
- You cannot easily change the code and get the result printed like what you have in Jupyter notebook or Zeppelin notebook. 您无法像Jupyter笔记本电脑或Zeppelin笔记本电脑那样轻松地更改代码并获得打印结果。
- It is hard to show images/charts from Shell. 很难显示来自Shell的图像/图表。
- It is painful to do version control by git on a remote machine because you have to set up from the very beginning and make git operations like git diff. 在远程计算机上通过git进行版本控制很痛苦,因为您必须从一开始就进行设置并进行git diff之类的git操作。
The second option “Local notebook”: You have to downsample the data and pull the data to your laptop (downsample: if you have 100GB data on your clusters, you downsample the data to 1GB without losing too much important information). Then you could process the data on your local Jupyter notebook.
第二个选项“本地笔记本”:您必须对数据进行降采样并将数据拉至笔记本电脑(降采样:如果群集上有100GB数据,则可以将数据降采样为1GB,而不会丢失太多重要信息)。 然后,您可以在本地Jupyter笔记本上处理数据。
it creates a few new painful problems:
它带来了一些新的痛苦问题:
- You have to write extra code to downsample the data. 您必须编写额外的代码才能对数据进行下采样。
- Downsampling could lose vital information about the data, especially when you are working on visualization or machine learning models. 下采样可能会丢失有关数据的重要信息,尤其是在使用可视化或机器学习模型时。
- You have to spend extra hours to make sure your code for original data. If not, it takes extra extra hours to figure out what’s wrong. 您必须花费额外的时间来确保原始数据的代码。 如果不是这样,则需要花费额外的时间才能找出问题所在。
- You have to guarantee the local development environment is the same as the remote cluster. If not, it is error-prone and it may cause data issues that are hard to detect. 您必须保证本地开发环境与远程集群相同。 如果不是这样,则容易出错,并且可能导致难以检测的数据问题。
Ok, “No notebook” and “Local notebook” are obviously not the best approach. What if your data team has access to the cloud, e.g. AWS? Yes, AWS provides Jupyter notebook on their EMR clusters and SageMaker. The notebook server is accessed through AWS Web console and it is ready to use when the clusters are ready.
好的,“没有笔记本”和“本地笔记本”显然不是最好的方法。 如果您的数据团队可以访问云(例如AWS)怎么办? 是的,AWS在其EMR群集和SageMaker上提供Jupyter笔记本。 笔记本服务器可通过AWS Web控制台访问,并且在群集准备就绪后即可使用。
This approach is called “Remote notebook on a cloud”.
这种方法称为“云上的远程笔记本”。
AWS EMR with Jupyter Notebook by AWS AWS EMR与Jupyter Notebook by AWSThe problems of “remote notebook on the cloud” are
“远程笔记本在云上”的问题是
- You have to set up your development environment every time the clusters get to spin up. 每次集群启动时,您都必须设置开发环境。
- If you want your notebook run on different clusters or regions, you have to manually & repeatedly get it done. 如果要让笔记本在不同的群集或区域上运行,则必须手动重复执行此操作。
- If the clusters are terminated unexpectedly, you lost your work on those clusters. 如果群集意外终止,则您将失去在这些群集上的工作。
This approach, ironically, is the most popular one among the data scientists who have access to AWS. This can be explained by the principle of least effort: It provides one-click access to remote clusters so that data scientists can focus on their machine learning models, visualization, and business impact without spending too much time on clusters.
具有讽刺意味的是,这种方法是可以访问AWS的数据科学家中最受欢迎的一种。 可以用最少努力的原理来解释: 一键式访问远程集群,因此数据科学家可以专注于他们的机器学习模型,可视化和业务影响,而无需在集群上花费太多时间。
Besides “No notebook”, “Local notebook”, and “Remote notebook on Cloud”, there are options that point spark on a laptop to remote spark clusters. The code is submitted via a local notebook and send to a remote spark cluster. This approach is called “Bridge local & remote spark”.
除了“无笔记本”,“本地笔记本”和“云上的远程笔记本”之外,还有一些选项可将笔记本电脑上的火花指向远程火花群集。 该代码通过本地笔记本提交,并发送到远程Spark集群。 这种方法称为“桥接本地和远程火花”。
You can use set the remote master when you create sparkSession
创建sparkSession时可以使用set remote master
val spark = SparkSession.builder()
.appName(“SparkSample”)
.master(“spark://123.456.789:7077”)
.getOrCreate()
The problems are
问题是
- you have to figure out how to authenticate your laptop to remote spark clusters. 您必须弄清楚如何对远程火花群集进行身份验证。
it only works when Spark is deployed as Standalone not YARN. If your spark cluster is deployed on YARN, then you have to copy the configuration files
/etc/hadoop/conf
on remote clusters to your laptop and restart your local spark, assuming you have already figured out how to install Spark on your laptop.它仅在将Spark部署为独立版本而不是YARN时有效。 如果您的Spark集群已部署在YARN上,那么您必须将远程集群上的配置文件
/etc/hadoop/conf
复制到您的笔记本电脑上,然后重新启动本地spark,前提是您已经弄清楚了如何在笔记本电脑上安装Spark。
If you have multiple spark clusters, then you have to switch back and forth by copy configuration files. If the clusters are ephemeral on Cloud, then it easily becomes a nightmare.
如果您有多个Spark集群,则必须通过复制配置文件来回切换。 如果集群是短暂的,那么它很容易成为噩梦。
“Bridge local & remote spark” does not work for most of the data scientists. Luckily, we can switch back our attention to Jupyter notebook. There is a Jupyter notebook kernel called “Sparkmagic” which can send your code to a remote cluster with the assumption that Livy is installed on the remote spark clusters. This assumption is met for all cloud providers and it is not hard to install on in-house spark clusters with the help of Apache Ambari.
“桥接本地和远程火花”不适用于大多数数据科学家。 幸运的是,我们可以将注意力转移到Jupyter笔记本上。 有一个名为“ Sparkmagic”的Jupyter笔记本内核,可以将Livy安装在远程Spark群集上,从而将您的代码发送到远程群集。 所有云提供商均满足此假设,并且在Apache Ambari的帮助下将其安装在内部Spark集群上并不困难。
Sparkmagic Architecture Sparkmagic架构It seems “Sparkmagic” is the best solution at this point but why it is not the most popular one. There are 2 reasons:
目前看来,“ Sparkmagic”是最好的解决方案,但为什么它不是最受欢迎的解决方案。 有两个原因:
- Many data scientists have not heard of “Sparkmagic”. 许多数据科学家还没有听说过“ Sparkmagic”。
- There are installation, connection, and authentication issues that are hard for data scientists to fix. 存在安装,连接和身份验证问题,数据科学家很难修复。
To solve problem 2, sparkmagic introduces Docker containers that are ready to use. Docker container, indeed, has solved some of the issues in installation, but it also introduces new problems for data scientists:
为了解决问题2,sparkmagic引入了可立即使用的Docker容器。 Docker容器确实解决了安装中的一些问题,但是它也为数据科学家带来了新的问题:
- Designed for shipping applications, the learning curve of docker container is not considered friendly for data scientists. Docker容器的学习曲线专为运输应用而设计,对数据科学家而言并不友好。
- It is not designed to used intuitively for data scientists who come from diverse technical backgrounds. 它并不是为具有不同技术背景的数据科学家而直观地使用的。
The discussion of docker containers will stop here and another article that explains how to make Docker containers actually work for data scientists will be published in a few days.
关于Docker容器的讨论将在这里停止,另一篇文章将解释如何使Docker容器真正为数据科学家服务,将在几天后发布。
To summarize, we have two categories of solutions:
总而言之,我们有两种解决方案:
- Notebook & notebook kernel: “No notebook”, “Local notebook”, “Remote notebook on cloud”, “Sparkmagic” 笔记本和笔记本内核:“无笔记本”,“本地笔记本”,“云上的远程笔记本”,“ Sparkmagic”
- Spark itself: “Bridge local & remote spark”. Spark本身:“桥接本地和远程火花”。
Despite installation and connection issues, “Sparkmagic” is the recommended solution. However, there are often other unsolved issues that reduce productivity and hurt developer experience:
尽管存在安装和连接问题,但建议使用“ Sparkmagic”解决方案。 但是,通常还有其他未解决的问题会降低生产率并损害开发人员的经验:
- What if other languages, python and R, are required to run on clusters? 如果要求其他语言python和R在群集上运行怎么办?
- What if the notebook is going to be run everyday? What if the notebook is going to be run only if another notebook run succeed? 如果笔记本要每天运行怎么办? 如果仅在另一个笔记本运行成功的情况下才要运行笔记本怎么办?
Let’s go over the current solutions:
让我们来看一下当前的解决方案:
Set up a remote Jupyter server and SSH tunneling (Reference). This definitely works but it takes time to set it up, and notebooks are on the remote servers.
设置远程Jupyter服务器和SSH隧道(R eference )。 绝对可以,但是设置起来很费时间,并且笔记本在远程服务器上。
- Set up a cron scheduler. Most data scientists are OK with cron scheduler, but what if the notebook failed to run? Yes, a shell script can help but are the most majority of data scientists comfortable writing shell script? Even if the answer is yes, data scientists have to 1. get access to finished notebook 2. to get a status update. Even if there are data scientists that are happy with writing shell scripts, why should every data scientist write their own scripts to automate the exact same stuff? 设置一个cron调度程序。 大多数数据科学家都可以使用cron计划程序,但是如果笔记本无法运行怎么办? 是的,shell脚本可以提供帮助,但是大多数数据科学家是否愿意编写shell脚本? 即使答案是肯定的,数据科学家也必须1.可以访问完成的笔记本电脑2.可以获取状态更新。 即使有些数据科学家对编写Shell脚本感到满意,但为什么每个数据科学家都应该编写自己的脚本来自动化完全相同的东西呢?
- Set up Airflow. This is a very popular solution among data engineers and it can get stuff done. If there are Airflow servers supported by data engineers or data platform engineers, data scientists can manage to learn the operators of Airflow and get it to work for Jupyter Notebook. 设置气流。 这是数据工程师中非常流行的解决方案,它可以完成工作。 如果有数据工程师或数据平台工程师支持的Airflow服务器,则数据科学家可以设法学习Airflow的操作员,并使它适用于Jupyter Notebook。
- Set up Kubeflow and other Kubernetes-based solutions. Admittedly kubeflow can get stuff done, but in reality how many data scientists have access to Kubernetes clusters, including managed solutions running on the cloud? 设置Kubeflow和其他基于Kubernetes的解决方案。 诚然,kubeflow可以完成工作,但实际上有多少数据科学家可以访问Kubernetes集群,包括在云上运行的托管解决方案?
Let’s reframe the problems:
让我们重新构造问题:
- How to develop on the local laptop with access to remote clusters? 如何在可访问远程群集的本地笔记本电脑上进行开发?
- How to operate on the remote clusters? 如何在远程集群上运行?
The solutions implemented by Bayesnote (a new open source Notebook project https://github.com/Bayesnote/Bayesnote) follows this principle:
Bayesnote(一个新的开源Notebook项目https://github.com/Bayesnote/Bayesnote )实现的解决方案遵循以下原则:
- “Auto installation, not manual”: Data scientists should not waste their time on installing anything on remote servers. “自动安装,而不是手动”:数据科学家不应浪费时间在远程服务器上安装任何东西。
- “Local notebook, not remote notebooks”: local notebooks makes better development experience and makes version control easier. “本地笔记本,而不是远程笔记本”:本地笔记本可提供更好的开发体验,并使版本控制更加容易。
- “Works for everyone, not someone”: assuming data scientists have no access to help from the engineering team. Works for data scientists come from diverse technical backgrounds. “为所有人服务,而不是为每个人服务”:假设数据科学家无法获得工程团队的帮助。 数据科学家的作品来自不同的技术背景。
- “Works for every language/framework”. Works for any languages, python, SQL, R, and Spark, etc. “适用于每种语言/框架”。 适用于任何语言,python,SQL,R和Spark等。
- “Combining development and operation, not separate them”: development and operations of a notebook can be done in one place. Data scientists should not spend time on fix issues in the disparity of development and operation “将开发和操作结合在一起,而不是将它们分开”:笔记本的开发和操作可以在一个地方完成。 数据科学家不应将时间花在解决开发和运营差异方面的修复问题上
These ideas are implemented by feature “auto self-deployment” of Bayesnote. In the development phase, the only required input from data scientists is authentication information, like IP and password. Then Bayesnote deploys itself to remote servers and started to listen for socket messages. The code will be sent to a remote server and get results back for users.
这些想法是通过Bayesnote的功能“自动自我部署”实现的。 在开发阶段,数据科学家唯一需要的输入就是身份验证信息,例如IP和密码。 然后,Bayesnote将自己部署到远程服务器,并开始侦听套接字消息。 该代码将被发送到远程服务器,并为用户返回结果。
Bayesnote: auto self-deployment Bayesnote:自动自我部署In the operation phase, a YAML file is specified and Bayesnote would run notebooks on remote servers, get finished notebooks back, and send a status update to emails or slack.
在操作阶段,将指定一个YAML文件,并且Bayesnote将在远程服务器上运行笔记本,取回已完成的笔记本,并将状态更新发送到电子邮件或备用服务器。
Workflow YAML by Bayesnote Bayesnote的工作流程YAML(Users will configure by filling out forms rather than YAML files, and the dependency of notebooks will be visualized nicely. )
(用户将通过填写表单(而不是YAML文件)进行配置,并且笔记本的依赖关系将得到很好的可视化。)
The (partial) implementation can be found on Github. https://github.com/Bayesnote/Bayesnote
可以在Github上找到(部分)实现。 https://github.com/Bayesnote/Bayesnote
Free data scientists from tooling issues so they can be happy and productive in their jobs.
使数据科学家从工具问题中解放出来,使他们在工作中感到快乐和高效率。
翻译自: https://towardsdatascience.com/how-to-connect-jupyter-notebook-to-remote-spark-clusters-and-run-spark-jobs-every-day-2c5a0c1b61df