aws emr_AWS胶水与EMR

aws emr

Amazon Web Services provide two service options capable of performing ETL: Glue and Elastic MapReduce (EMR). If they both do a similar job, why would you choose one over the other? This article details some fundamental differences between the two.

Amazon Web Services提供了两个能够执行ETL的服务选项: Glue和Elastic MapReduce (EMR)。 如果他们俩都做类似的工作,为什么还要选择一个? 本文详细介绍了两者之间的一些根本区别。

AWS Glue is a pay as you go, server-less ETL tool with very little infrastructure set up required. It automates much of the effort involved in writing, executing and monitoring ETL jobs. If your data is structured you can take advantage of Crawlers which can infer the schema, identify file formats and populate metadata in Glue’s Data Catalogue. Based on your specified ETL criteria, Glue can automatically generate Python or Scala code for you and provides a nice UI for job monitoring and scheduling.

AWS Glue是一种随用随付的,无服务器的ETL工具,只需要很少的基础架构设置。 它可以自动完成编写,执行和监视ETL作业所涉及的大部分工作。 如果您的数据是结构化的,则可以利用抓取工具,它们可以推断模式,识别文件格式并在Glue的数据目录中填充元数据。 根据您指定的ETL标准,Glue可以自动为您生成Python或Scala代码,并提供一个不错的UI来监视和调度作业。

In comparison, EMR is a big data platform designed to reduce the cost of processing and analysing huge amounts of data. It is a managed service where you configure your own cluster of EC2 instances. You have complete control over the configuration and can install Hadoop ecosystem components, which makes EMR an incredibly flexible and complex service. Its use cases are vast. Data scientists can use EMR to run machine learning jobs utilising the TensorFlow library, analysts can run SQL queries on Presto, engineers can utilise EMR’s integration with streaming applications such as Kinesis or Spark… the list goes on!

相比之下,EMR是一个大数据平台,旨在降低处理和分析大量数据的成本。 它是一项托管服务,您可以在其中配置自己的EC2实例集群。 您可以完全控制配置,并且可以安装Hadoop生态系统组件 ,这使EMR变得异常灵活和复杂。 它的用例非常广泛。 数据科学家可以使用TensorFlow库使用EMR来运行机器学习作业,分析人员可以在Presto上运行SQL查询,工程师可以利用EMR与流应用程序(例如Kinesis或Spark)的集成……清单还在继续!

You could replace Glue with EMR but not vice versa, EMR has far more capabilities than its server-less counterpart.

您可以用EMR代替Glue,反之亦然,因为EMR比无服务器的同类具有更多的功能。

Another thing to consider when choosing between these tools is cost. Glue is more expensive than EMR when comparing similar cluster configurations, probably because you’re paying for the server-less privilege and ease of set up.

在这些工具之间进行选择时要考虑的另一件事是成本。 比较相似的群集配置时 , Glue比EMR贵 ,这可能是因为您要为无需服务器的特权和易于设置付出代价。

Drop’s Data Lake solution found a reduction in cold start time and an 80% reduction in cost when migrating from Glue to EMR.

从Glue迁移到EMR时, Drop的Data Lake解决方案减少了冷启动时间,并将成本降低了80%。

There are currently only 3 Glue worker types available for configuration, providing a maximum of 32GB of executor memory. This restriction may become problematic if you’re writing complex joins in your business logic. If the join isn’t optimised for performance then executor memory can quickly be consumed and the job may fail. The same can occur if you have to unpack a very large zip/gzip file, all of the data will be held on one node (such is the workings of Spark!).

当前只有3种Glue工作程序类型可用于配置,最多提供32GB的执行程序内存。 如果您要在业务逻辑中编写复杂的联接,则此限制可能会成为问题。 如果未针对性能优化连接,则执行器内存可能很快被消耗,并且作业可能会失败。 如果您必须解压缩非常大的zip / gzip文件,则可能会发生同样的情况, 所有数据都将保存在一个节点上 (Spark的工作原理就是这样!)。

In contrast to this, EMR has a plethora of supported Instance Types to choose from! (although you’d still want to optimise joins to improve performance and ideally avoid zip and gzip formats!)

与此相反,EMR有多种受支持的实例类型可供选择! (尽管您仍然想优化连接以提高性能并且最好避免使用zip和gzip格式!)

One advantage of using AWS Glue, is that it automatically sends logs to CloudWatch, which is very handy if your architecture uses multiple AWS services — providing you with one centralised location for monitoring and alerting. EMR on the other hand, sends logs to S3 by default — although you can install the CloudWatch agent via EMR’s bootstrap configuration.

使用AWS Glue的一个优势是,它会自动将日志发送到CloudWatch ,如果您的架构使用多个AWS服务,这将非常方便-为您提供一个集中的位置来进行监视和警报。 另一方面,EMR默认将日志发送到S3,尽管您可以通过EMR的引导程序配置安装CloudWatch代理。

In conclusion, if your workforce is new to AWS configuration and you only wanted to execute simple ETL, Glue might be a sensible option. However if you wished to leverage Hadoop technologies and perform more complex transformation, EMR is the more viable solution.

总之,如果您的工作人员不熟悉AWS配置,而您只想执行简单的ETL,则Glue可能是明智的选择。 但是,如果您希望利用Hadoop技术并执行更复杂的转换,则EMR是更可行的解决方案。

Thank you for reading!

感谢您的阅读!

翻译自: https://medium.com/swlh/aws-glue-vs-emr-433b53872b30

aws emr

你可能感兴趣的:(aws emr_AWS胶水与EMR)