aws emr使用
Dask is an increasingly popular Python-ecosystem SDK for managing large-scale ETL jobs and ETL pipelines across multiple machines. Albeit somewhat newer than Apache Spark — its best-known competitor — Dask has captured a lot of mindshare in the data science community by virtue of its pandas
and numpy
-like API, which makes it easy to use and familiar to Pythonic data practitioners.
Dask是一种日益流行的Python生态系统SDK,用于管理多台机器上的大规模ETL作业和ETL管道。 尽管Dask比其最著名的竞争对手Apache Spark还要新,但Dask凭借其pandas
和numpy
的API在数据科学界引起了很多关注,这使得Python数据从业人员易于使用和熟悉。
In this tutorial, we will walk through setting up a Dask cluster on top of EMR (Elastic MapReduce), AWS’s distributed data platform, that we can interact with and submit jobs to from a JupyterLab notebook running on our local machine. We’ll then run some basic benchmarks on this cluster by performing a basic exploratory data analysis of NYC Open Data’s 2019 Yellow Taxi Trip Dataset.
在本教程中,我们将逐步在AWS的分布式数据平台EMR(Elastic MapReduce)上建立Dask集群,我们可以与在本地计算机上运行的JupyterLab笔记本进行交互并向其提交作业。 然后,我们将通过对NYC Open Data的2019 Yellow Taxi Trip Dataset进行基本探索性数据分析,在此集群上运行一些基本基准测试。
为什么要使用EMR? (Why EMR?)
The Cloud Deployments page in the Dask docs covers your options for deploying Dask on the cloud. At the time of writing, the three options are: Kubernetes, EMR, and an ephemeral option using the “Dask Cloud Provider”.
Dask文档中的Cloud Deployments页面涵盖了在云上部署Dask的选项。 在撰写本文时,这三个选项是:Kubernetes,EMR和使用“ Dask Cloud Provider”的临时选项。
My personal opinion is that EMR is the easiest way to get up and running with a distributed Dask cluster (if you want to experiment with it on a single machine, you can create a LocalCluster on your personal machine). Kubernetes is a complex service with a fairly steep learning curve, so I wouldn’t recommend going that route unless you’re already on a Kubernetes cluster and very familiar with how Kubernetes works.
我个人认为,EMR是启动和运行分布式 Dask群集的最简单方法(如果要在单台计算机上进行实验,则可以在个人计算机上创建LocalCluster)。 Kubernetes是一项具有相当陡峭的学习曲线的复杂服务,因此,除非您已经在Kubernetes集群上并且非常熟悉Kubernetes的工作原理,否则我不建议您采用这种方法。
Note that it’s also possible to deploy Dask on Google Cloud Dataproc or Azure HDInsight — any service that provides managed YARN will work — but there isn’t any specific documentation on these alternative services at the moment.
请注意,也可以在Google Cloud Dataproc或Azure HDInsight上部署Dask-任何提供托管YARN的服务都可以使用-但目前这些替代服务上没有任何具体文档。
EMR如何运作 (How EMR works)
EMR, short for “Elastic Map Reduce”, is AWS’s big data as a service platform. Here’s how it works.
EMR是“ Elastic Map Reduce”的缩写,是AWS的大数据即服务平台。 运作方式如下。
One of AWS’s core offerings is EC2, which provides an API for reserving machines (so-called instances) on the cloud. EC2 provides a wide variety of options, ranging from tiny burstable shared CPUs (e.g. t2.micro
) to beefy (and expensive!) GPU servers (e.g. p3.16xlarge
). As a first step to launching an EMR cluster, consider what EC2 instance types you will use. For the purposes of this tutorial, I will launch a cluster with one m5.xlarge
master node and two m5.xlarge
worker nodes (m5.xlarge
is AWS’s recommended general-purpose CPU instance type). Note that when running on EMR, one of the instances will be reserved for the master node. The remainder become the worker pool.
AWS的核心产品之一是EC2,它提供了一个API,用于在云上保留机器(所谓的实例)。 EC2提供了多种选择,从微小的可共享共享CPU(例如t2.micro
)到功能强大(且价格昂贵!)的GPU服务器(例如p3.16xlarge
)。 作为启动EMR群集的第一步,请考虑将使用的EC2实例类型。 就本教程而言,我将启动一个群集,该群集具有一个m5.xlarge
主节点和两个m5.xlarge
工作节点( m5.xlarge
是AWS推荐的通用CPU实例类型)。 请注意,在EMR上运行时,将为主节点保留一个实例。 其余的成为工人池。
The next step is choosing your applications. The EMR API lets you reserve a set of EC2 machines that have been preconfigured with certain software onboard. The most popular application is probably HADOOP
. Hadoop was an important predecessor to Apache Spark which introduced the world to MapReduce way back in 2006. Modern Hadoop consists of several reusable sub-components, many of which are now used by other tools. Two of these have proved to be particularly popular: the Hadoop Distributed File System, HDFS, and the Hadoop scheduling engine, YARN (short for Yet Another Resource Negotiator — very tongue-in-cheek). When you hear someone talk about a tool as being part of the “Hadoop ecosystem”, they mean that it uses some Hadoop services as part of its service architecture.
下一步是选择您的应用程序 。 通过EMR API,您可以保留一组预先配置有板载某些软件的EC2计算机。 最受欢迎的应用程序可能是HADOOP
。 Hadoop是Apache Spark的重要前身,Apache Spark于2006年将世界引入MapReduce。现代Hadoop由多个可重用子组件组成,其中许多现在被其他工具使用。 事实证明,其中两个特别受欢迎:Hadoop分布式文件系统HDFS和Hadoop调度引擎YARN(Yet Another Resource Negotiator的缩写,很招人喜欢)。 当您听到有人谈论该工具是“ Hadoop生态系统”的一部分时,他们表示该工具将某些Hadoop服务用作其服务体系结构的一部分。
The last major thing to consider is your bootstrap script. The bootstrap script executes on every instance in the cluster immediately after machine initialization is complete, and it’s how you configure the instances with any cluster-wide customizations you want to make yourself.
最后要考虑的主要是您的引导脚本 。 引导脚本会在计算机初始化完成后立即在集群中的每个实例上执行,这是您使用希望自己进行的集群范围内的自定义配置实例的方式。
Let’s now look at a concrete example. This example AWS CLI script launches a transient spark cluster on three m5.xlarge
instances on EMR that runs a pyspark_job.py
and then auto-terminates:
现在让我们看一个具体的例子。 此示例AWS CLI脚本在运行pyspark_job.py
EMR上的三个m5.xlarge
实例上启动瞬态m5.xlarge
集群,然后自动终止:
$ aws emr create-cluster --name "Spark cluster with step" \
--release-label emr-5.24.1 \
--applications Name=Spark \
--log-uri s3://your-bucket/logs/ \
--ec2-attributes KeyName=your-key-pair \
--instance-type m5.xlarge \
--instance-count 3 \
--bootstrap-actions Path=s3://your-bucket/emr_bootstrap.sh \
--steps Type=Spark,Name="Spark job",ActionOnFailure=CONTINUE,Args=[--deploy-mode,cluster,--master,yarn,s3://your-bucket/pyspark_job.py] \
--use-default-roles \
--auto-terminate
There’s a few other things going on here worth understanding:
这里还有其他一些值得理解的事情:
- This script uses the Spark application and will auto-terminate on completion. To make the cluster persistent, remove auto-terminate. 该脚本使用Spark应用程序,并在完成时自动终止。 要使群集具有持久性,请删除自动终止。
S3 acts as both a source (for the bootstrap and execution scripts) and a sink (for logs). You will need to set up a bucket with the right resources and the right access credentials ahead of time.
S3既充当源(用于引导程序和执行脚本),又充当宿(用于日志)。 您需要提前设置具有正确资源和正确访问凭证的存储桶。
This command uses the optional
steps
API.steps
is an EMR feature that lets you submit well-formed jobs to the cluster right from the AWS CLI. Note that this is just a convenience feature: you can also submit jobs to the cluster directly, e.g. using the Spark SDK.此命令使用可选
steps
API。steps
是EMR功能,可让您直接从AWS CLI将格式正确的作业提交到集群。 请注意,这只是一项便利功能:您还可以直接将作业提交到集群,例如使用Spark SDK。It’s important to pass an EC2 keypair name to the
ec2-attributes
argument. This key can then be used for SSH access to the cluster master.将EC2密钥对名称传递给
ec2-attributes
参数很重要。 然后,可以使用此密钥对群集主服务器进行SSH访问。
That should be enough EMR to get started.
这应该足以启动EMR。
Dask部署的工作方式 (How Dask deployment works)
The Dask API has a concept of a cluster — a backend powering Dask’s distributed compute. Dask supports a few different types of clusters; the one we are interested in is the YarnCluster
, which lets you set up Dask wherever Hadoop Yarn is up and running. You then connect a client to that cluster to expose an interface you can connect to. Here’s a minimal example showing this process in action:
Dask API具有群集的概念—为Dask的分布式计算提供支持的后端。 Dask支持几种不同类型的集群。 我们感兴趣的是YarnCluster
,它使您可以在Hadoop Yarn启用并运行的任何地方设置Dask。 然后,您将客户端连接到该群集以公开可以连接的接口。 这是一个最小的示例,展示了此过程的实际效果:
from dask_yarn import YarnClusterfrom dask.distributed import Clientcluster = YarnCluster(worker_vcores=1, worker_memory="3GB")
client = Client(cluster)
cluster.scale(8)
Notice that YarnCluster
takes a variety of arguments configuring the cluster workers. Here we specify that we want each worker to get a single core and 3GB of memory, and that we want to scale this cluster up to 8 workers total.
请注意, YarnCluster
采用各种参数来配置集群工作器。 在这里,我们指定我们希望每个工作人员都拥有一个核心和3GB的内存,并且我们希望将该集群最多扩展到8个工作人员。
Dask will submit this application to Yarn, which takes care of scheduling these workers onto the machines. We have two m5.xlarge
worker machines, each with four vCPU cores and 16 GB of RAM. This configuration will cause Yarn to schedule four workers per machine — this works out to one worker per core and 12 GB total per machine (this is close to optimal, as it leaves 4 GB of overhead per machine for the OS and other background processes).
Dask会将这个申请提交给Yarn,Yarn负责安排这些工人到机器上的时间。 我们有两台m5.xlarge
工作器计算机,每台都有四个vCPU内核和16 GB的RAM。 这种配置将导致Yarn为每台机器调度4个工作线程-这相当于每个内核一个工作线程,每台机器总共12 GB(这接近最佳状态,因为它为OS和其他后台进程每台机器留下4 GB的开销) 。
Dask will ultimately execute our jobs as Python code on the worker machines. How do we ensure that each worker machine has the exact same Python environment? Dask does something very clever here: it leverages the conda-pack
or venv-pack
tools to do it for you. These command-line tools allow you to pack your current conda
or virtualenv
environment into a portable tar.gz
file. At client initialization time, Dask looks for this file (the default location is $HOME/environment.tar.gz
), beams it to the worker nodes, and unpacks it in-situ.
Dask最终将在工作机上以Python代码执行我们的作业。 我们如何确保每台工作计算机都具有完全相同的Python环境? DASK做了非常聪明的位置:它利用了conda-pack
或venv-pack
工具来为你做它。 这些命令行工具使您可以将当前的conda
或virtualenv
环境打包到可移植的tar.gz
文件中。 在客户端初始化时,Dask查找该文件(默认位置为$HOME/environment.tar.gz
),将其传送到工作节点,然后就地解压缩。
部署集群 (Deploying the cluster)
With that background out of the way, we are ready to deploy a cluster.
有了这样的背景,我们就可以部署集群了。
Before you begin, you will need:
在开始之前,您将需要:
- An AWS account (obviously). 一个AWS账户(显然)。
An EC2 keypair, whose private key you have downloaded to somewhere on your local machine. For instructions on creating one, refer here in the docs.
EC2密钥对 ,您已将其私钥下载到本地计算机上的某个位置。 有关创建一个的说明,请参阅文档中的此处 。
- An S3 bucket you have access to. 您有权访问的S3存储桶。
The Dask team maintains an example bootstrap script in the dask-yarn repo. To begin, download this script, modify it slightly to fix a known issue with the conda download URL inside, and upload the modified result to your S3 bucket:
Dask团队在dask-yarn存储库中维护了一个示例引导脚本 。 首先,下载此脚本,对其进行稍微修改以修复内部的conda下载URL 的已知问题 ,然后将修改后的结果上传到S3存储桶:
$ curl https://raw.githubusercontent.com/dask/dask-yarn/master/deployment_resources/aws-emr/bootstrap-dask \
> bootstrap.sh
$ cat bootstrap.sh | \
sed -E "s|https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh|https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh|" \
> 0_bootstrap.sh
$ aws s3 cp 0_bootstrap.sh s3://your-bucket/bootstrap/0_bootstrap.sh
Next, we will need to create and configure an IAM role. The cluster will use this role for accessing the services it needs to run (primarily, EC2 start/stop and S3 read/write):
接下来,我们将需要创建和配置IAM角色 。 群集将使用此角色来访问其需要运行的服务(主要是EC2启动/停止和S3读/写):
$ aws emr create-default-roles
The following incantation launches the cluster:
以下命令将启动集群:
$ aws emr create-cluster --name "My Dask Trial Cluster" \
--release-label emr-5.30.1 \
--applications Name=HADOOP \
--log-uri s3://aleksey-emr-dask/logs/ \
--ec2-attributes KeyName=alekseys-secret-key \
--instance-type m5.xlarge \
--instance-count 3 \
--bootstrap-actions Path=s3://aleksey-emr-dask/bootstrap/0_bootstrap.sh,Args="[--conda-packages,bokeh,fastparquet,python-snappy,snappy]" \
--use-default-roles
We’ve already discussed EMR in some detail, so this should look familiar. :)
我们已经详细讨论了EMR,因此应该看起来很熟悉。 :)
远程访问集群 (Accessing the cluster remotely)
Before we start using the cluster, we first need to SSH into the master node to perform some further configuration.
在开始使用集群之前,我们首先需要SSH进入主节点以执行一些进一步的配置。
Once the cluster has finished bootstrapping, the EMR page in your web console will provide the public IP address of your master node. You can connect to the instance over that IP address by running a command like the following:
集群完成引导后,Web控制台中的EMR页面将提供主节点的公共IP地址。 您可以通过运行以下命令来通过该IP地址连接到实例:
$ ssh -i /Users/me/Desktop/my-secret-key.pem \
[email protected]
Replacing /Users/me/Desktop/my-secret-key.pem
with the path to the EC2 keypair secret you created earlier, and hadoop@ec2–18–223–211–150.us-east-2.compute.amazonaws.com
with hadoop@YOUR_IP_ADDRESS.YOUR_REGION.compute.amazonaws.com
.
将/Users/me/Desktop/my-secret-key.pem
替换为您之前创建的EC2密钥对密钥的路径,并hadoop@ec2–18–223–211–150.us-east-2.compute.amazonaws.com
与hadoop@YOUR_IP_ADDRESS.YOUR_REGION.compute.amazonaws.com
。
You will probably need to configure the security group associated with the instance to include your personal machine’s IP address before this will work (if you don’t, the connection attempt will just time out). See the corresponding page in the AWS User Guide for details (you can also try using this script).
您可能需要配置与实例相关联的安全组,以包括您的个人计算机的IP地址,然后才能起作用(如果不这样做,则连接尝试将超时)。 有关详细信息,请参阅AWS用户指南中的相应页面 (您也可以尝试使用此脚本 )。
Successfully logging into the cluster prints a cute bit of ASCII art:
成功登录到群集将打印出可爱的ASCII文字:
Last login: Thu Jul 30 23:02:54 2020
__| __|_ )
_| ( / Amazon Linux 2 AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-2/
18 package(s) needed for security, out of 71 available
Run "sudo yum update" to apply all updates.
EEEEEEEEEEEEEEEEEEEE MMMMMMMM MMMMMMMM RRRRRRRRRRRRRRR
E::::::::::::::::::E M:::::::M M:::::::M R::::::::::::::R
EE:::::EEEEEEEEE:::E M::::::::M M::::::::M R:::::RRRRRR:::::R
E::::E EEEEE M:::::::::M M:::::::::M RR::::R R::::R
E::::E M::::::M:::M M:::M::::::M R:::R R::::R
E:::::EEEEEEEEEE M:::::M M:::M M:::M M:::::M R:::RRRRRR:::::R
E::::::::::::::E M:::::M M:::M:::M M:::::M R:::::::::::RR
E:::::EEEEEEEEEE M:::::M M:::::M M:::::M R:::RRRRRR::::R
E::::E M:::::M M:::M M:::::M R:::R R::::R
E::::E EEEEE M:::::M MMM M:::::M R:::R R::::R
EE:::::EEEEEEEE::::E M:::::M M:::::M R:::R R::::R
E::::::::::::::::::E M:::::M M:::::M RR::::R R::::R
EEEEEEEEEEEEEEEEEEEE MMMMMMM MMMMMMM RRRRRRR RRRRRR
[hadoop@ip-172-31-13-47 ~]$
Write an example client initialization script to disk and execute it:
将示例客户端初始化脚本写入磁盘并执行:
$ cat <> $HOME/test.py
from dask_yarn import YarnCluster
from dask.distributed import Client
import time
# Create a cluster
# cluster = YarnCluster()
cluster = YarnCluster(worker_vcores=1, worker_memory="3GB")
# Connect to the cluster
client = Client(cluster)
# Scale workers
cluster.scale(8)
# Print client and dashboard link.
print("Client __repr__: ", client)
print("Dashboard link: ", cluster.dashboard_link)
time.sleep(100000)
EOT
$ rm $HOME/.config/dask/yarn.yaml # ensure default config
$ python3 test.py
Assuming everything worked, you’ve now successfully launched a Dask cluster on AWS EMR!
假设一切正常,您现在已经在AWS EMR上成功启动了Dask集群!
However, we still can’t interact with Dask outside of the EMR cluster. For that we will need to do one more thing: set up SSH port forwarding.
但是,我们仍然无法在EMR集群外部与Dask进行交互。 为此,我们需要做一件事:设置SSH端口转发。
在本地访问集群 (Accessing the cluster locally)
Dask has great support for integrations with the Swiss army knife of data science, Jupyter Notebooks. In this section we will see how we can use dask-labextension
and SSH port forwarding to interact with and submit jobs to our cluster right from a Jupyter notebook on our local machine!
Dask对与瑞士数据科学军刀Jupyter Notebooks的集成提供了大力支持。 在本节中,我们将看到如何使用dask-labextension
和SSH端口转发直接从本地计算机上的Jupyter笔记本与集群交互并向集群提交作业!
When you launch, as part of the output you will see a logline that looks something like this:
启动时,作为输出的一部分,您将看到一条类似于以下内容的日志行:
distributed.scheduler - INFO - Scheduler at: tcp://172.31.10.101:42577
distributed.scheduler - INFO - dashboard at: :45475
When you execute the cluster script, Dask launches two separate processes in the background: a scheduler that performs task management, and a dashboard connected to that scheduler offering a suite of visualizations for monitoring your cluster. Each of these processes listens on a different port on the cluster master node.
当您执行集群脚本时,Dask将在后台启动两个独立的进程:执行任务管理的调度程序 ,以及连接到该调度程序的仪表板 ,该仪表板提供了一套可视化的监视集群的工具。 这些进程中的每个进程都在集群主节点上的不同端口上进行侦听。
However, these ports are local to the master node and are not exposed externally. To make Dask visible on our local machine, we need to use port forwarding. Port forwarding is an SSH feature that allows traffic from one port on a local machine to transparently (and securely — all traffic is encrypted) route to some other port on a remote machine. Here’s how:
但是,这些端口在主节点本地,并且不暴露在外部。 为了使Dask在我们的本地计算机上可见,我们需要使用端口转发。 端口转发是SSH的一项功能,它允许来自本地计算机上一个端口的流量透明(安全地加密所有流量)路由到远程计算机上的其他端口。 这是如何做:
# replace the port numbers and key paths here with the values
# specific to your cluster
# port forwarding for the dashboard
$ ssh -i /Users/alekseybilogur/Desktop/ec2-keys/spell2/alekseys-secret-key.pem \
-N -L 8157:ec2-3-22-181-59.us-east-2.compute.amazonaws.com:38943 \
[email protected]
# port forwarding for the scheduler
$ ssh -i /Users/alekseybilogur/Desktop/ec2-keys/spell2/alekseys-secret-key.pem \
-N -L 8158:ec2-3-22-181-59.us-east-2.compute.amazonaws.com:45909 \
[email protected]
You should now be able to connect to your EMR Dask cluster from your local machine thusly:
现在,您应该可以从本地计算机连接到EMR Dask集群了:
from dask.distributed import Client
client = Client("tcp://localhost:8158")
To verify that everything is working as expected, try printing the client object inside of a cell in a Jupyter Notebook — you should get a nicely formatted summary card that looks something like this:
要验证一切是否按预期进行,请尝试在Jupyter Notebook的单元格内打印客户端对象-您应该获得格式良好的摘要卡,看起来像这样:
The dask-labextension
provides a native Dask dashboard-like experience from inside of a Jupyter notebook. Once you’ve installed and enabled the extension (see the repo README for instructions) you can paste the forwarded SSH address, localhost:8158
, into the extension side panel to enable the connection to the cluster (see this SO question for details).
dask-labextension
从Jupyter笔记本内部提供了类似Dask仪表板的原生体验。 一旦安装并启用了扩展(请参阅回购自述文件中的说明),就可以将转发的SSH地址localhost:8158
粘贴到扩展侧面板中,以启用与集群的连接(有关详细信息,请参见此SO问题 )。
To learn more about the features dask-labextension
brings to the table, I highly recommend checking out the dask-labextension quickstart video. Here’s a screenshot from that video showing it in action:
要了解有关dask-labextension
带到表中的功能的更多信息,我强烈建议您查看dask-labextension快速入门视频 。 这是该视频的屏幕截图,显示了它的运行情况:
You’re now all set up and ready to get running with your new Dask cluster!
现在,您已经准备就绪,可以开始使用新的Dask集群了!
将您的群集进行测试 (Taking your cluster for a test drive)
To test out Dask performance, I downloaded a copy of NYC Open Data’s 2019 Yellow Taxi Trips dataset to my local disk. I switched the pickup and dropoff timestamps from string to datetime format, and partitioned the result into twelve parquet files, one per month:
为了测试Dask的性能,我将NYC Open Data的2019 Yellow Taxi Trips数据集的副本下载到了本地磁盘上。 我将提取和下放时间戳从字符串切换为日期时间格式,并将结果分为十二个实木复合地板文件,每月一个:
import pandas as pd
taxi = pd.read_csv("~/Downloads/2019_Yellow_Taxi_Trip_Data.csv")
taxi.tpep_pickup_datetime =\
pd.to_datetime(taxi.tpep_pickup_datetime,
format='%m/%d/%Y %I:%M:%S %p')
taxi.tpep_dropoff_datetime =\
pd.to_datetime(taxi.tpep_dropoff_datetime,
format='%m/%d/%Y %I:%M:%S %p')
taxi[taxi['tpep_pickup_datetime'] < pd.to_datetime('02/01/2019')]\
.to_parquet("01-2019-trips.parq")
# repeat for each month of the year...
taxi[(pd.to_datetime('12/01/2019') <= taxi['tpep_pickup_datetime']) & (taxi['tpep_pickup_datetime'] < pd.to_datetime('01/01/2020'))]\
.to_parquet("12-2019-trips.parq")
This is a good dataset to demonstrate the capacities of Dask because it’s large, but tidy. Each row in the dataset is a single taxi cab trip, and there’s ~84 million such trips in the dataset. Unpacked into pandas memory, this dataset takes up ~16 GB of RAM total, which the 16 GB of RAM on my laptop can just barely handle (with the help of swap space).
这是一个很好的数据集,可以证明Dask的功能,因为它很大但很整齐 。 数据集中的每一行都是一次出租车行程,数据集中有约8,400万次这样的行程。 解压缩到大熊猫内存中后,该数据集总共占用了约16 GB的RAM,而我的笔记本电脑上的16 GB RAM几乎无法处理(借助于交换空间 )。
This dataset is too big to work with on my local machine. That’s where Dask comes to the rescue. I used AWS S3 Sync to upload the partitioned Parquet files to the cloud:
该数据集太大,无法在我的本地计算机上使用。 那就是Dask进行救援的地方。 我使用AWS S3 Sync将分区的Parquet文件上传到云:
$ aws s3 sync "~/Desktop/2019-taxi-trips/" \
"s3://aleksey-emr-dask/data/2019-taxi-dataset/"
The dask.dataframe.read_parquet
method can then read partitioned Parquet files out of an S3 directory straight into Dask cluster memory. This is as simple as:
然后, dask.dataframe.read_parquet
方法可以从S3目录中直接将已分区的Parquet文件读取到Dask群集内存中。 这很简单:
import dask.dataframe as dd
df = dd.read_parquet(
's3://aleksey-emr-dask/data/2019-taxi-dataset/',
storage_options={
'key': 'YOUR_AWS_KEY',
'secret': 'YOUR_AWS_SECRET_KEY'
},
engine='fastparquet'
)
A couple of important hidden “gotchas” to keep in mind here:
这里要记住几个重要的隐藏“陷阱”:
pd.DataFrame.to_parquet
uses snappy compression by default. Dask supports this compression format, but you need to install the snappy and snappy-python peer dependencies fromconda-forge
first. I did this for you in theaws emr create-cluster
command we ran earlier.pd.DataFrame.to_parquet
默认情况下使用pd.DataFrame.to_parquet
压缩。 Dask支持这种压缩格式,但是您需要首先从conda-forge
forge安装snappy和snappy-python对等依赖项。 我在之前运行的aws emr create-cluster
命令中为您完成了此操作。dask.dataframe.read_parquet
supports thefastparquet
andpyarrow
engines. We usefastparquet
becausepyarrow
does not yet support user-partitioned Parquet files out of the box (it expects partitioning to be done programmatically, so certain metadata files written by e.g. Apache Hive Metastore must also exist; see this page in their docs for details).dask.dataframe.read_parquet
支持fastparquet
和pyarrow
引擎。 我们使用fastparquet
是因为pyarrow
尚不支持pyarrow
即用的用户分区Parquet文件(它希望以编程方式完成分区,因此某些Apache Hive Metastore编写的元数据文件也必须存在;有关详细信息,请参见此页 ) 。
It takes just about four seconds for Dask to value_counts
all 84 million rows in this dataset.
Dask只需大约四秒钟即可对这个数据集中的所有8400万行进行value_counts
计数。
%%time
passenger_counts = df.trip_distance.value_counts().compute()
CPU times: user 16.9 ms, sys: 2.51 ms, total: 19.4 ms
Wall time: 4.03 s
%%time
fare_amount = df.fare_amount.value_counts().compute()
CPU times: user 17.5 ms, sys: 2.46 ms, total: 19.9 ms
Wall time: 4.16 s
%%time
tip_amount = df.tip_amount.value_counts().compute()
CPU times: user 16 ms, sys: 2.04 ms, total: 18 ms
Wall time: 3.84 s
Other ETL operations in Dask will be slower or faster, depending on both the exact cluster configuration used and on the complexity of the (parallelized) algorithm. For more details, check out the DataFrame API or Best Practices pages in the Dask documentation for tips and tricks on performance.
Dask中的其他ETL操作将变慢或变快,这取决于所使用的确切群集配置以及(并行)算法的复杂性。 有关更多详细信息,请查阅Dask文档中的DataFrame API或最佳实践页面,以获取有关性能的提示和技巧。
结论 (Conclusion)
In this tutorial, we configured and deployed a Dask cluster on Hadoop Yarn on AWS EMR, using it to perform some basic EDA on 84 million rows of data in just a handful of seconds.
在本教程中,我们在AWS EMR的Hadoop Yarn上配置并部署了Dask集群,用它在几秒钟内对8400万行数据执行一些基本的EDA。
Distributed Dask clusters are one of the most popular and powerful tools for managing ETL jobs on large-scale datasets. Better yet, NVIDIA’s RAPIDS initiative is bringing Dask to GPU clusters as well. This promises to push the envelope on what’s possible even further: the blog post “Dask and RAPIDS: The Next Big Thing for Machine Learning and Data Science at Capital One” demonstrates some really impressive speedups.
分布式Dask集群是用于管理大规模数据集上的ETL作业的最流行和功能最强大的工具之一。 更好的是,NVIDIA的RAPIDS计划也将Dask引入了GPU集群。 这有望进一步推动发展:博客文章“ Dask和RAPIDS:Capital One的机器学习和数据科学的下一个大发展 ”展示了一些令人印象深刻的加速。
Note: This article originally appeared on the Spell.ML Blog.
注意:本文最初出现在Spell.ML Blog上 。
Enjoyed this article? Check out some more stories from the Spell blog:
喜欢这篇文章吗? 在Spell博客中查看更多故事:
Scaling model training in PyTorch using distributed data parallel
使用分布式数据并行在PyTorch中进行缩放模型训练
It’s 2020, why isn’t deep learning 100% on the cloud yet?
到了2020年,为什么还不可以在云上进行100%的深度学习?
翻译自: https://towardsdatascience.com/getting-started-with-large-scale-etl-jobs-using-dask-and-aws-emr-3a93ccff580e
aws emr使用